Creating a year identifier for pre-post analysis to use for diff-in-diff

Rene Natasha

Join Date: Apr 2019
Posts: 52

Creating a year identifier for pre-post analysis to use for diff-in-diff

26 Apr 2019, 19:20

Hello All,

I am new to posting on statalist. I hope my question is clear and happy to clarify if it is confusing.

I am working with county data and trying to create a year identifier var for each year of observations collected. About the dataset: The dataset compiles information that enable aggregation of county-level data over multiple years for health topics.

I want to use this dataset for studying policy effect (the effect of a health clinic on community outcomes) using diff-in-diff. I want to create the time variable (the pre_post variable) however there is no separate time variable. I am not sure if it is possible to have a time variable with aggregate data. Ideally I would want the dataset to include a time variable for each year of observations in the dataset.

Right now, the dataset labels the year of the observations by the name of the variable. For example
the number of doctors in LA County, California from 2010 - 2012 is three separate vars for the information:
count of doctors in 2010 = md2010,
count of doctors in 2011 = md2011,
and so forth.

For my independent var of interest: number of health clinics (fqhc) in each county is labeled with the year at the end of the variable:
fqhc10 = health clinics in 2010,
fqhc11 = health clinics in 2011,
fqhc12 = health clinics in 2012,
and so forth

My data looks like this

Code:

* Example generated by -dataex-. To install: ssc install dataex
* dataex statename countyname fqhc15 fqhc14 fqhc13 fqhc12 fqhc11 fqhc10
clear
input str20 statename str25 countyname int(fqhc15 fqhc14 fqhc13 fqhc12 fqhc11 fqhc10)
"South Carolina" "Abbeville"            1  1  1  1  1  1
"Louisiana"      "Acadia"               2  2  2  2  2  2
"Virginia"       "Accomack"             3  3  3  3  3  4
"Idaho"          "Ada"                 10 10  6  6  4  4
"Kentucky"       "Adair"                2  2  2  2  2  2
"Missouri"       "Adair"                3  3  3  3  3  4
"Iowa"           "Adair"                0  0  0  0  0  0
"Oklahoma"       "Adair"                3  3  3  3  2  2
"Pennsylvania"   "Adams"                1  1  1  1  1  1
"Wisconsin"      "Adams"                0  0  0  0  0  0
"Illinois"       "Adams"                3  1  1  1  1  1
"Mississippi"    "Adams"                2  2  2  2  2  1
"Ohio"           "Adams"                3  3  2  2  2  2
"Iowa"           "Adams"                0  0  0  0  0  0
"Colorado"       "Adams"               11 10 10  7  6  6
"North Dakota"   "Adams"                0  0  0  0  0  0
"Washington"     "Adams"                3  3  3  3  3  3
"Nebraska"       "Adams"                0  0  0  0  0  0
"Idaho"          "Adams"                1  1  1  1  1  1
"Indiana"        "Adams"                0  0  0  0  0  0
"Vermont"        "Addison"              2  2  2  1  0  0
"Puerto Rico"    "Adjuntas"             0  0  0  0  0  0
"Puerto Rico"    "Aguada"               0  0  0  0  0  0
"Puerto Rico"    "Aguadilla"            0  0  0  0  0  0
"Puerto Rico"    "Aguas Buenas"         1  0  0  0  0  0
"Puerto Rico"    "Aibonito"             0  0  0  0  0  0
"South Carolina" "Aiken"                4  4  5  5  1  1
"Minnesota"      "Aitkin"               1  1  1  2  2  2
"Florida"        "Alachua"              7  6  4  3  3  2
"North Carolina" "Alamance"             4  4  4  4  3  2
"California"     "Alameda"             36 33 32 32 31 28
"Colorado"       "Alamosa"              4  4  4  4  4  4
"New York"       "Albany"               1  1  1  1  1  1
"Wyoming"        "Albany"               0  0  0  0  0  0
"Virginia"       "Albemarle"            1  1  1  1  1  1
"Michigan"       "Alcona"               4  4  4  4  3  3
"Mississippi"    "Alcorn"               1  1  1  1  1  1
"Alaska"         "Aleutians East (B)"   0  0  0  0  0  0
"Alaska"         "Aleutians West (CA)"  1  1  1  1  1  1
"North Carolina" "Alexander"            0  0  0  0  0  0
"Illinois"       "Alexander"            3  3  3  2  2  2
"Virginia"       "Alexandria City"      2  3  3  3  3  2
"Oklahoma"       "Alfalfa"              1  1  1  1  1  1
"Michigan"       "Alger"                1  1  1  1  1  1
"Iowa"           "Allamakee"            0  0  0  0  0  0
"Michigan"       "Allegan"              1  1  1  1  1  1
"New York"       "Allegany"             2  2  2  2  2  2
"Maryland"       "Allegany"             2  2  2  2  2  2
"North Carolina" "Alleghany"            0  0  0  0  0  0
"Virginia"       "Alleghany"            0  0  0  0  0  0
"Pennsylvania"   "Allegheny"           22 24 24 24 24 24
"Kansas"         "Allen"                1  1  1  0  0  0
"Kentucky"       "Allen"                0  0  0  0  0  0
"Ohio"           "Allen"                3  2  1  1  1  1
"Louisiana"      "Allen"                2  2  1  1  1  1
"Indiana"        "Allen"                3  3  3  2  2  0
"South Carolina" "Allendale"            1  1  1  1  1  1
"Michigan"       "Alpena"               3  3  3  3  3  3
"California"     "Alpine"               0  0  0  0  0  0
"California"     "Amador"               0  0  0  0  0  0
"Virginia"       "Amelia"               2  3  3  3  2  2
"Virginia"       "Amherst"              1  1  1  1  0  0
"Mississippi"    "Amite"                1  1  1  1  1  1
"Puerto Rico"    "Anasco"               0  0  0  0  0  0
"Alaska"         "Anchorage (B)"        2  2  2  2  3  3
"Texas"          "Anderson"             2  2  2  3  3  3
"Tennessee"      "Anderson"             1  1  0  0  0  0
"Kansas"         "Anderson"             0  0  0  0  0  0
"South Carolina" "Anderson"             0  0  0  0  0  0
"Kentucky"       "Anderson"             0  0  0  0  0  0
"Missouri"       "Andrew"               1  1  1  1  2  1
"Texas"          "Andrews"              0  0  0  0  0  0
"Maine"          "Androscoggin"        12 13 14 13 13  3
"Texas"          "Angelina"             0  0  0  0  0  0
"Maryland"       "Anne Arundel"         3  2  4  4  5  4
"Minnesota"      "Anoka"                0  0  0  0  0  0
"North Carolina" "Anson"                1  1  1  2  2  2
"Nebraska"       "Antelope"             0  0  0  0  0  0
"Michigan"       "Antrim"               2  2  2  2  2  2
"Arizona"        "Apache"               3  3  2  2  2  2
"Iowa"           "Appanoose"            2  2  2  2  2  2
"Georgia"        "Appling"              1  1  0  0  0  0
"Virginia"       "Appomattox"           0  0  0  0  0  0
"Texas"          "Aransas"              0  0  0  0  0  0
"Colorado"       "Arapahoe"             8  8  8  5  5  5
"Texas"          "Archer"               0  0  0  0  0  0
"Colorado"       "Archuleta"            0  0  0  0  0  0
"Puerto Rico"    "Arecibo"              0  0  0  0  0  0
"Michigan"       "Arenac"               1  1  1  1  1  1
"Arkansas"       "Arkansas"             0  0  0  0  0  0
"Virginia"       "Arlington"            2  2  1  1  1  1
"Pennsylvania"   "Armstrong"            0  0  0  0  0  0
"Texas"          "Armstrong"            0  0  0  0  0  0
"Maine"          "Aroostook"           17 17 16 15 15 15
"Puerto Rico"    "Arroyo"               1  1  1  1  1  1
"Nebraska"       "Arthur"               0  0  0  0  0  0
"Louisiana"      "Ascension"            1  1  1  0  0  0
"North Carolina" "Ashe"                 0  0  0  0  0  0
"Ohio"           "Ashland"              0  0  0  0  0  0
"Wisconsin"      "Ashland"              1  1  1  1  1  1
end

As you can see in my data, I have the county and state as string variables and the count of clinics by county is provided by individual variables for each year. I imagine that in order for me to do a diff-in-diff with pre-post analysis, I would need to create a time variable for each year of data I have for health clinics. I have many questions about whether this is the correct dataset (count-level/aggregate data) to do this analysis.

I did not include any additional variables from the dataset, but for all other health variables, they are all coded with the year indicated at the end of the variable.

Thank you so much
Stata 12 on MAC OS (but have access to Stata 15 on Windows)

Tags: difference-in-difference, time-constant variables

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

26 Apr 2019, 20:09

Welcome to Statalist, and thank you for using -dataex- to show example data in your very first post!

The difficulty you face is because your data are in wide layout, and to do what you want to do, you need long layout. Actually, to do most things in Stata you need long layout. The -reshape- command gets you there.

But while we are at it, you will also eventually need a numeric variable to index the counties. So we'll we're at it, we'll do that, too.

Code:

reshape long fqhc, i(statename countyname) j(year) egen n_county = group(statename countyname), label xtset n_county year

With that, your data will be in long layout, and it will also have been set up for longitudinal (panel) data analysis. From here it is simple to set up a pre-post indicator variable based on the year. And then suitable -xt- commands will enable you to carry out difference-in-differences analyses on various outcomes.

Added: The commands shown above will all run without modification in version 12.
3 likes
Comment

Rene Natasha

Join Date: Apr 2019
Posts: 52

26 Apr 2019, 20:43

Mr. Schechter,

Thank you so much for responding to my posts! I appreciate the additional code for changing the county var to numeric.

As a follow up, with this change to the dataset, I can assume that all the other variables are reshaped to reflect the observations by the year variable as well?

I looked at some of the variables and I was confused

Code:

* Example generated by -dataex-. To install: ssc install dataex
*dataex mdnf_active10 year n_county
clear
input double mdnf_active10 byte year float n_county
 33 10  1
 33 11  1
 33 12  1
 33 13  1
 33 14  1
 33 15  1
 33 16  1
 33 17  1
367 10  2
367 11  2
367 12  2
367 13  2
367 14  2
367 15  2
367 16  2
367 17  2
 13 10  3
 13 11  3
 13 12  3
 13 13  3
 13 14  3
 13 15  3
 13 16  3
 13 17  3
  6 10  4
  6 11  4
  6 12  4
  6 13  4
  6 14  4
  6 15  4
  6 16  4
  6 17  4
 17 10  5
 17 11  5
 17 12  5
 17 13  5
 17 14  5
 17 15  5
 17 16  5
 17 17  5
  8 10  6
  8 11  6
  8 12  6
  8 13  6
  8 14  6
  8 15  6
  8 16  6
  8 17  6
 16 10  7
 16 11  7
 16 12  7
 16 13  7
 16 14  7
 16 15  7
 16 16  7
 16 17  7
197 10  8
197 11  8
197 12  8
197 13  8
197 14  8
197 15  8
197 16  8
197 17  8
 27 10  9
 27 11  9
 27 12  9
 27 13  9
 27 14  9
 27 15  9
 27 16  9
 27 17  9
  8 10 10
  8 11 10
  8 12 10
  8 13 10
  8 14 10
  8 15 10
  8 16 10
  8 17 10
 12 10 11
 12 11 11
 12 12 11
 12 13 11
 12 14 11
 12 15 11
 12 16 11
 12 17 11
  5 10 12
  5 11 12
  5 12 12
  5 13 12
  5 14 12
  5 15 12
  5 16 12
  5 17 12
 16 10 13
 16 11 13
 16 12 13
 16 13 13
end
label values n_county n_county
label def n_county 1 "Alabama Autauga", modify
label def n_county 2 "Alabama Baldwin", modify
label def n_county 3 "Alabama Barbour", modify
label def n_county 4 "Alabama Bibb", modify
label def n_county 5 "Alabama Blount", modify
label def n_county 6 "Alabama Bullock", modify
label def n_county 7 "Alabama Butler", modify
label def n_county 8 "Alabama Calhoun", modify
label def n_county 9 "Alabama Chambers", modify
label def n_county 10 "Alabama Cherokee", modify
label def n_county 11 "Alabama Chilton", modify
label def n_county 12 "Alabama Choctaw", modify
label def n_county 13 "Alabama Clarke", modify

I wanted to paste some of the output that seems suspcious. I looked at a list the first 10 observations of 2010 active doctors by county. I used the new variable n_county
*list mdnf_active10(n_county) year in 1/10

+-----------------------------------+
| mdnf_~10 n_county year |
|-----------------------------------|
1. | 33 Alabama Autauga 10 |
2. | 33 Alabama Autauga 11 |
3. | 33 Alabama Autauga 12 |
4. | 33 Alabama Autauga 13 |
5. | 33 Alabama Autauga 14 |
|-----------------------------------|
6. | 33 Alabama Autauga 15 |
7. | 33 Alabama Autauga 16 |
8. | 33 Alabama Autauga 17 |
9. | 367 Alabama Baldwin 10 |
10. | 367 Alabama Baldwin 11 |
+-----------------------------------+

I am concerned that while the fqhc variable was correctly reshaped, the other variables were not.

Thank you.

Rene
Stata 12 on MAC OS (but also have access to Stata 15 on Windows)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

26 Apr 2019, 21:26

To be clear, the -reshape- command in #2 only reshapes the fqhc's. Your example data didn't include any others. But whatever other variables you have that need to be reshaped must also be listed in the -reshape- command. And it all must be done at once. So, if fqhc10-fqhc17 and mdnf10-mdnf17 are in your data set, the -reshape- command has to mention both:

Code:

reshape long fqhc mdnf, i(statename countyname) j(year)

And if there are more variables like that, they, too must be mentioned in the list of variables that follows -reshape long- and precedes the comma.
2 likes
Comment
Rene Natasha

Join Date: Apr 2019

Posts: 52
#5

27 Apr 2019, 08:36

Thank you so much Mr. Schechter. I appreciate your response. This is very helpful. I am going to try this on the variables and see how it works out.

I do have a follow up on the use of this dataset and creating control and treatment groups but since it is slightly different topic, I will start a new post.

I will link my new thread if I do create one.
Comment

Rene Natasha

Join Date: Apr 2019
Posts: 52

27 Apr 2019, 09:47

When reshaping the data to create the year identifier, is the assumption that every variable has data for year of the years created? For example the fqhc variable has data from 2010 - 2017 however the mdnf have data for 2000 through 2015.

Below is a sample of the different variables with the two digit year at the end of the variable. As you can see, some of the variables start at 2010 (fqhc10) and some 2013 (chcg13).

Code:

* Example generated by -dataex-. To install: ssc install dataex
* dataex totactive_nhsc15 totactive_nhsc16 totalMD_all2010 totalMD_all2015 fqhc10 chcg13 
clear
input int(totactive_nhsc15 totactive_nhsc16 totalMD_all2010) long totalMD_all2015 int(fqhc10 chcg13)
 5  5   23   22  1  1
 7  5   52   57  2  1
10 10   33   30  4  6
26 26 1297 1549  4  8
 3  6   15   16  2  1
11 11   14   13  4 10
 2  2    4    4  0  0
 5  5   16   15  2  1
 3  3  129  135  1  3
 5  6    6    5  0  0
11 14  190  196  1  0
 3  3   71   78  1  2
12 12   16   18  2  1
 2  2    3    1  0  0
15 16 1026 1321  6  7
 2  2   15   17  0  0
 3  3   14   13  3  2
 0  0   84   88  0  0
 1  1    3    3  1  1
 3  3   20   18  0  0
 1  1  110  106  0  1
 2  2   24   26  0  1
 1  1   84   87  0  0
 0  0  188  198  0  0
 1  1   25   27  0  1
 0  1   74   76  0  0
 4  4  252  270  1  6
 9  7   20   22  2  0
11 11 2418 2716  2  4
 3  5  274  272  2  3
91 91 5347 6155 28 80
 6  6   40   42  4  5
 5  5 1872 2066  1  6
 7  7   87   81  0  0
 1  1  797  887  1  1
 6  6    8    6  3  4
 6  6   66   70  1  1
 8  8    1    0  0  7
 6  6    4    0  1  4
 0  0   15   14  0  0
 4  3    2    3  2  4
 2  2  526  563  2  6
 2  2    0    1  1  1
 5  4    9    9  1  0
 0  0   13   12  0  0
 4  4   77   75  1  1
11 11   42   44  2  2
13 15  206  198  2  3
 1  1   21   17  0  0
 1  1   34   22  0  0
41 39 8167 8773 24 31
 3  3    5    7  0  1
 2  2    8    5  0  0
 8  8  256  307  1  3
 2  3   19   15  1  0
 5  4 1029 1147  0  5
 6  6    9   11  1  1
 6  6   78   83  3  4
 0  0    2    2  0  0
 1  1   71   71  0  0
 4  4    3    3  2  3
 3  3   14   12  0  2
 5  5    6    5  1  4
 0  0   55   53  0  0
 8  8 1107 1262  3 16
 2  2   60   57  3  0
 2  3  205  204  0  0
 2  2    7    8  0  0
 7  3  372  402  0  0
 1  1   16   19  0  0
 2  2    3    4  1  1
 1  1   11   11  0  0
19 21  294  299  3  3
 2  2  155  167  0  0
 2  2 1527 1647  4  4
 1  1  473  472  0  0
 4  5   15   20  2  1
 0  0    5    4  0  0
 6  6   22   15  2  3
21 21   66   62  2  2
 3  3   13   12  2  3
 3  5   16   16  0  0
 2  2    3    2  0  0
 1  1   28   39  0  0
15 15 1940 2215  5  8
 0  0    4    5  0  0
 2  2   22   28  0  0
 0  0  376  352  0  1
 3  3    7    8  1  1
 1  1   22   17  0  0
 3  3  875 1014  1  3
 2  2   68   67  0  0
 0  0    0    0  0  0
34 37  165  181 15 16
 2  2   31   31  1  2
 0  0    0    0  0  0
 3  3  112  127  0  3
 2  3   36   37  0  0
 1  1   63   58  0  0
 3  3   59   63  1  1
end

When i run the reshape cmd in Stata, I get the following error message:

Code:

         reshape long fqhc totalMD_active ruralclinic asc chc chcg totactive_nhsc nhsc_pcsites nhsc_fteprovider nhsc_ftepc nhsc_
> ftedental nhsc_ftemh opvisit opvisit_er ervisit_va ervisit_genhosp pop_est pop_est popmale popfemale popwmale popwfemale pop_wn
> hmale pop_wnhfemale pop_whispmale pop_whispfemale pop_blkmale pop_blkfemale pop_blkhispmale pop_blkhisfemale pop_aianmale pop_a
> ianfemale pop_asianfemale pop_asianmale pop_nhmale pop_nhfemale medicaid avg_hhsize singlehh income medfamilyincome povertystat
>  deeppoverty deeppovertypercent insured uninsured mkt mkt_new mkt_active mkt_auto hhpublicassist, i(statename countyname) j(yea
> r)
(note: j = 10 11 12 13 14 15 16 17 18 2010 2015 2016)
(note: totalMD_active10 not found)
(note: chcg10 not found)
(note: totactive_nhsc10 not found)
(note: nhsc_pcsites10 not found)
(note: nhsc_fteprovider10 not found)
(note: nhsc_ftepc10 not found)
(note: nhsc_ftedental10 not found)
(note: nhsc_ftemh10 not found)
(note: ervisit_va10 not found)
(note: pop_est10 not found)
(note: popmale10 not found)
(note: popfemale10 not found)
(note: popwmale10 not found)
(note: popwfemale10 not found)
(note: pop_wnhmale10 not found)
(note: pop_wnhfemale10 not found)
(note: pop_whispmale10 not found)
(note: pop_whispfemale10 not found)
(note: pop_blkmale10 not found)
(note: pop_blkfemale10 not found)
(note: pop_blkhispmale10 not found)
(note: pop_blkhisfemale10 not found)
(note: pop_aianmale10 not found)
(note: pop_aianfemale10 not found)
(note: pop_asianfemale10 not found)
(note: pop_asianmale10 not found)
(note: pop_nhmale10 not found)
(note: pop_nhfemale10 not found)
(note: income10 not found)
(note: medfamilyincome10 not found)
(note: povertystat10 not found)
(note: deeppoverty10 not found)
(note: deeppovertypercent10 not found)
(note: mkt10 not found)
(note: mkt_new10 not found)
(note: mkt_active10 not found)
(note: mkt_auto10 not found)
(note: hhpublicassist10 not found)
variable pop_est10 not found
r(111);

I tried dropping some of the variables from years that I do not need. I also considered if i need to rename the variable to something else? Another thought I had was maybe I reshape only on the variables of interest and then after reshape I drop or recode the other variables, but I am not sure if conceptually that makes sense.

Thanks for any help

Rene
Stata 12 on MAC OS (but have access to Stata 15 on Windows)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#7

27 Apr 2019, 11:01

No, the variables do not have to all have the same years. You can have abc10 abc11 abc12 and xyz19 xyz20 xyz21. What will happen is that for observations with year = 10, 11, or 12, the values of xyz will be missing, and for observations with year = 19, 20, or 21, the values of abc will be missing.

The various (note: ... not found) messages are not a problem. They are simply pointing out that not all of the variables are instantiated for all of the years. As per the previous paragraph, it will simply result in missing values for the years for which those variables don't exist.

However, your data is pushing the limits of what Stata can parse because you are mixing two-digit years like 10, 13, or 15 with four digit years like 2010 and 2015. So you need to do some renaming. Probably the simplest way here is to rename all the *20# variables to just *#.

Code:

gen long obs_no = _n rename *20# *# reshape long totalactive_nhsc totalMD_all fchq chcg, i(obs_no) j(year)

There is a different issue, however, that is leading to your red error message. In your -reshape- command you have listed the variable name stub pop_est twice. And once it processes one of them, it can't find it again so it gives you the somewhat misleading message pop_est10 not found because it doesn't quite know how to explain it better. If you remove the duplicate mention of pop_est from the command (and fix any other mistakes like that if there are any), and simplify the variable names to all two-digit years (or all four digits if you prefer, though that is a bit harder to do), then it should run correctly.

Added: I am not sure when the syntax used in the -rename- command came into Stata; it might not be available on version 12.
1 like
Comment

Rene Natasha

Join Date: Apr 2019
Posts: 52

29 Apr 2019, 22:44

Thank you for that clarification. I was able to reshape the data set selecting several variables to reshape. I am wondering if I should drop the variables that I did not reshape but are associated with a specific year?

For example I reshaped fqhc10, fqhc11, fqhc12, etc and now I have # of fqhc variable by year by county. However I have other variables like md_primarycare that has 3 years of data but stored in wide format: md_primarycare16, md_primarycare15, md_primarycare14. I did not reshape these variables. I am not sure how to interpret or understand the data in those columns after the reshape? I think dropping them might reduce confusion later on in my analysis however, I worry what if I need those variables later on? I know that I can only use reshape command once in the dataset.

I provided a sample of the data and you can see that it seems a bit funky.

Code:

* Example generated by -dataex-. To install: ssc install dataex
*dataex md_pcnf15 md_pcnf14 year fqhc n_county
clear
input double(md_pcnf15 md_pcnf14) int(year fqhc) float n_county
 25  23    0 . 1
 25  23    5 . 1
 25  23    6 . 1
 25  23    7 . 1
 25  23    8 . 1
 25  23    9 . 1
 25  23   10 1 1
 25  23   11 2 1
 25  23   12 2 1
 25  23   13 2 1
 25  23   14 2 1
 25  23   15 2 1
 25  23   16 2 1
 25  23   17 2 1
 25  23   18 . 1
 25  23 2010 . 1
 25  23 2015 . 1
 25  23 2016 . 1
148 148    0 . 2
148 148    5 . 2
148 148    6 . 2
148 148    7 . 2
148 148    8 . 2
148 148    9 . 2
148 148   10 3 2
148 148   11 3 2
148 148   12 3 2
148 148   13 3 2
148 148   14 3 2
148 148   15 3 2
148 148   16 3 2
148 148   17 4 2
148 148   18 . 2
148 148 2010 . 2
148 148 2015 . 2
148 148 2016 . 2
 11  11    0 . 3
 11  11    5 . 3
 11  11    6 . 3
 11  11    7 . 3
 11  11    8 . 3
 11  11    9 . 3
 11  11   10 3 3
 11  11   11 3 3
 11  11   12 3 3
 11  11   13 3 3
 11  11   14 3 3
 11  11   15 3 3
 11  11   16 3 3
 11  11   17 3 3
 11  11   18 . 3
 11  11 2010 . 3
 11  11 2015 . 3
 11  11 2016 . 3
 12   9    0 . 4
 12   9    5 . 4
 12   9    6 . 4
 12   9    7 . 4
 12   9    8 . 4
 12   9    9 . 4
 12   9   10 1 4
 12   9   11 1 4
 12   9   12 1 4
 12   9   13 1 4
 12   9   14 1 4
 12   9   15 2 4
 12   9   16 2 4
 12   9   17 2 4
 12   9   18 . 4
 12   9 2010 . 4
 12   9 2015 . 4
 12   9 2016 . 4
 12  11    0 . 5
 12  11    5 . 5
 12  11    6 . 5
 12  11    7 . 5
 12  11    8 . 5
 12  11    9 . 5
 12  11   10 1 5
 12  11   11 1 5
 12  11   12 1 5
 12  11   13 1 5
 12  11   14 1 5
 12  11   15 1 5
 12  11   16 1 5
 12  11   17 1 5
 12  11   18 . 5
 12  11 2010 . 5
 12  11 2015 . 5
 12  11 2016 . 5
  3   3    0 . 6
  3   3    5 . 6
  3   3    6 . 6
  3   3    7 . 6
  3   3    8 . 6
  3   3    9 . 6
  3   3   10 1 6
  3   3   11 1 6
  3   3   12 1 6
  3   3   13 1 6
end
label values n_county n_county
label def n_county 1 "Alabama Autauga", modify
label def n_county 2 "Alabama Baldwin", modify
label def n_county 3 "Alabama Barbour", modify
label def n_county 4 "Alabama Bibb", modify
label def n_county 5 "Alabama Blount", modify
label def n_county 6 "Alabama Bullock", modify

Also I wanted to make sure I understood your comment in post #7.

However, your data is pushing the limits of what Stata can parse because you are mixing two-digit years like 10, 13, or 15 with four digit years like 2010 and 2015. So you need to do some renaming. Probably the simplest way here is to rename all the *20# variables to just *#.

Code:

gen long obs_no = _n rename *20# *# reshape long totalactive_nhsc totalMD_all fchq chcg, i(obs_no) j(year)

I would only carryout this command if I had shifting two dight and four digit years on my vars like 2010 and 2015? I do not need to generate an id number for the variables?

Part 2 of my question:

A question to creating a pre_post variable. Now that I have the time identifier created, creating the pre_post variable in a general DID seems easy. However in my dataset different counties adopted the treatment in different years, so I am trying to figure out how I would create a pre_post time variable under these circumstances

Clyde, I took at a Statalist post you responded too that seem similar: https://www.statalist.org/forums/for...reatment-group wondering if that is the approach #3 would be the one I would have to take?

Hopefully my two-part follow up makes sense?

Thank you,

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#9

29 Apr 2019, 23:34

For the first part, I would advise against leaving the data in this hybrid half-long half-wide layout. It will likely get you into trouble later. If you won't need those other variables then, indeed, just drop them. If you think you will, why didn't you just include them in the -reshape- to begin with? That way you'd have a fully long layout and working with all of those variables will be straightforward.

Concerning the question about generating the variable obs_num, no you don't need to do that. I see that in your -reshape- command you have variables statename and countyname in the i() option and they provide unique identification of observations. In the example data you posted, those variables do not appear and there was nothing in the example that I could put in the -i()- option. But the -i()- option is not optional: you must have a unique identifier (or group of variables that jointly form a unique identifier) to put there. So, when there is nothing else in the data to do that, the simplest approach is to create an obs_no variable as I did.

The -rename- command was necessary because in your example data some of the variables had 4 digit years and some had 2 digit years and -reshape- would not handle that correctly.

Concerning part 2 of your question, I no longer recommend the approaches in the post you linked to. They are valid and they work, but there is an easier way. When the different entities adopt the intervention at different times, you cannot do a classical DID analysis. But you can do a generalized DID analysis. https://www.annualreviews.org/doi/pd...-040617-013507 is a good reference that explains the technique.

In brief, you will not have a prepost variable for this analysis. Nor will you have a treat vs no treat variable. Instead you need a treatment_in_effect variable that is 1 in exactly those observations where the entity in question is receiving the treatment, and 0 in all other observations. This variable is, in fact, equivalent to the treat#prepost interaction in a classical DID analysis. When you run the regression, you regress on this special variable and you also include indicator variables for the entities themselves (which I guess in your case are states or counties or something like that) and indicator variables for the years. The coefficient of this treatment_in_effect variable is the generalized DID estimator of the intervention's effect on the outcome.
2 likes
Comment
Rene Natasha

Join Date: Apr 2019

Posts: 52
#10

03 May 2019, 10:32

Thank you Mr. Schechter.

I am going to modify my dataset and will connect if i have any follow up.

Thank you!
Comment
Rene Natasha

Join Date: Apr 2019

Posts: 52
#11

10 May 2019, 12:21

Hi Clyde,

I found the Wing,Simon, & Bello-Gomez (2018) article you referred me very helpful in understanding how to build the model for a generalized DID analysis with multiple time and treatment periods and also for how to validate the method assumptions.

I am trying to figure out how to identify the independent variable - unit of analysis (FQHC site) across the different counties in my data set. This would help me determine which counties will be in the treatment group and which will be in the control group. Before I can do that I have yet to reshape all of the necessary variables in the dataset. I read the entire article and tried to walk through my data to build the model. I want to make sure I understand your comment in #9 and also the logic outlined in the article.

In brief, you will not have a prepost variable for this analysis. Nor will you have a treat vs no treat variable. Instead you need a treatment_in_effect variable that is 1 in exactly those observations where the entity in question is receiving the treatment, and 0 in all other observations. This variable is, in fact, equivalent to the treat#prepost interaction in a classical DID analysis. When you run the regression, you regress on this special variable and you also include indicator variables for the entities themselves (which I guess in your case are states or counties or something like that) and indicator variables for the years. The coefficient of this treatment_in_effect variable is the generalized DID estimator of the intervention's effect on the outcome.

This is my assumption of how the model would look based on what you provided.

"plain language"
outcome = + β(treatment*year)+ time indicator variables + county indicator variables + FQHC indicator variables + control variables + county fixed-effects + year fixed effects + error

the model:
Y_it= α_i + α_t + β*T_it + + γX_it + ϵ_igt

Y independent variable
i county
t year
α_iand α_t are county and year fixed effects which are the constants in the model
T_itis the treatment dummy variable that equals 1 if a FQHC site has been introduced within a county i by time t.
X_it is the indicator variables
ϵ_it is the error term
β*T is the interaction variable or treatment-in-effect variable

I am having a hard time conceptualizing this and operationalizing it so that I can at least generate the commands in stata to provide you what it would look like with my variables. I looked at a previous post on generalized did and I am trying to see if it applies to my question https://www.statalist.org/forums/for...atment-periods

Last edited by Rene Natasha; 10 May 2019, 12:35. Reason: added a link to a previous statatlist post.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#12

10 May 2019, 12:40

outcome = + β(treatment*year)+ time indicator variables + county indicator variables + FQHC indicator variables + control variables + county fixed-effects + year fixed effects + error

I'm not sure. I don't have a clear understanding of what you are modeling here. I was under the impression that FQHCs come into existence at various times in your data set (and perhaps some also cease operations) and that you are looking to estimate the effects of having an FQHC in the county (or perhaps of the number of FQHC's in the county). If that's the case, the FQHC indicators aren't even definable and don't belong in the model. Also county indicators is the same thing as county fixed effects. You don't need them twice, once will do. Also, I don't know what the distinction between time indicator variables and year fixed effects is supposed to be--to my mind they are the same thing, and only need to appear once. Am I missing something there?

Finally, I see the term ϵ_igt in your equation: is the g just a typo? If not, what does it refer to?
1 like
Comment
Rene Natasha

Join Date: Apr 2019

Posts: 52
#13

11 May 2019, 13:03

Hello Mr. Schechter,

I apologize for the confusion. I think some of the things you pointed out reflects my confusion with the generalized DID. I am use to seeing the specification of fixed effects (fe) at the end of the model equation and i was unsure if it was doing something different in the generalized DID.

I was under the impression that FQHCs come into existence at various times in your data set (and perhaps some also cease operations) and that you are looking to estimate the effects of having an FQHC in the county (or perhaps of the number of FQHC's in the county).

You understanding of what I am studying is correct. Here is my updated model equation

outcome = county fixed-effects + year fixed effects + β(treatment*year)+ control variables + error

or

Y_it = α_i + α_t + β*T_it + γX_it + ϵ_it

where,
Y = outcome
i = county
t = time in years
T = treatment or presence of a FQHC
β = DD estimator. The coefficient will be what I will be interested in results
β*T = the interaction variable in county with FQHC during time period when it is open (which is the treatment in effect term)
X = additional indicator variables
ϵ = error term (you asked if the g was supposed to in the original). That was an error on my end to include the g.

When building this, I have yet to consider any lag or lead time that would occur when the FQHC closes or opens for the effect of it as a treatment to take place. I know there are considerations for lead and lag time that can be incorporated when building the model.

Prior to even building the model, I want to identify the control and treatment groups which for me would be counties that have FQHCs and those that do not.

Thank you for your head so far.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#14

12 May 2019, 12:54

Now your model looks right and makes sense to me. I do agree that, in any DID (classical or generalized) analysis you have to give consideration to the possibility that effects are delayed or phased in (or, in some less common circumstances, anticipated). That, of course, is a substantive issue in the subject matter domain, and not something I could help you with from a statistical perspective.

Good luck. Post back if there is anything else you want help with.
2 likes
Comment
Rene Natasha

Join Date: Apr 2019

Posts: 52
#15

21 Jun 2019, 17:31

Hi Clyde Schechter,

It has been a long time, but I was finally able to reshape the dataset and label the newly created variables. I dropped the wide formatted variables from the datset.

My question as a follow up on reshaping the dataset is for all commands moving forward do I have to use the xi prefix for any of the analysis?

As I mentioned the FQHCs (the treatment variable) opened and closed at different times and there are greater than 1 FQHC in some counties that all opened and closed at different times. If I am to identify the counties with FQHCs using the reshaped data, do I have to do some sort of forvalues command or loop command to identify the county and years where the clinic was once opened but now closed? so for example:

Code:

* Example generated by -dataex-. To install: ssc install dataex *dataex list fqhc year countystate in 1/18, table clear input int fqhc byte year str30 countystate 1 10 "Autauga, AL" 2 11 "Autauga, AL" 2 12 "Autauga, AL" 2 13 "Autauga, AL" 2 14 "Autauga, AL" 2 15 "Autauga, AL" 2 16 "Autauga, AL" 2 17 "Autauga, AL" . 18 "Autauga, AL" 3 10 "Baldwin, AL" 3 11 "Baldwin, AL" 3 12 "Baldwin, AL" 3 13 "Baldwin, AL" 3 14 "Baldwin, AL" 3 15 "Baldwin, AL" 3 16 "Baldwin, AL" 4 17 "Baldwin, AL" . 18 "Baldwin, AL" end

This gets me the following output:

PHP Code:

+---------------------------+ | fqhc year countystate | |---------------------------| 1. | 1 10 Autauga, AL | 2. | 2 11 Autauga, AL | 3. | 2 12 Autauga, AL | 4. | 2 13 Autauga, AL | 5. | 2 14 Autauga, AL | |---------------------------| 6. | 2 15 Autauga, AL | 7. | 2 16 Autauga, AL | 8. | 2 17 Autauga, AL | 9. | . 18 Autauga, AL | 10. | 3 10 Baldwin, AL | |---------------------------| 11. | 3 11 Baldwin, AL | 12. | 3 12 Baldwin, AL | 13. | 3 13 Baldwin, AL | 14. | 3 14 Baldwin, AL | 15. | 3 15 Baldwin, AL | |---------------------------| 16. | 3 16 Baldwin, AL | 17. | 4 17 Baldwin, AL | 18. | . 18 Baldwin, AL | +---------------------------+

So I want to be able to determine in which years did a county go from having zero FQHCs to at least 1 FQHC and identify the list of counties that did this. I was told that forvalue command would allow this type of loop to go through all of the data and determine which counties went from 0 to at least 1 FQHC and create a new variable. I could also create a similar command that would tell me whether over a period of time if the county also went from 0 FQHCs to >=1 FQHCs and then lost the FQHCs (zero again).

I guess my overall research question before even the model is to get a list of the counties that went from not having a FQHC (pre-period) to having at least 1 FQHC (post period). That would help in building a demographics table by pre/post period and also create a list of treatment counties.

Let me know if I should start a new topic for this question. I am really stuck. I reshaped my data but all I am able to tell is the data I list above in terms of identifying counties without FQHCs.
Comment

Announcement

Creating a year identifier for pre-post analysis to use for diff-in-diff

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment