Difference-in-Difference with a series of cross section data*

Tenzing Doma

Join Date: Mar 2021
Posts: 5

Difference-in-Difference with a series of cross section data*

14 Mar 2021, 14:16

Hello, everyone!

I am fairly bad at Stata and it's been some time since I attended any econometrics class. I am trying to work out how to complete a DID analysis using the BRFSS data with geolocation data on farmers' markets around the USA by county-level. The BRFSS only provides county info till 2012 so I am using data from the years 2009-2012. Therefore I would like to test the effect of the farmers market on BMI by county-level.

I have created a dummy variable

Code:

egen fmktarea = max(anyfarmersmkt),by(countyid)

a dummy variable for the area of the study; at least one farmers market in the county

Code:

gen fmktreated  = 0
replace fmktreated =1 if (numfarmersmkt>=1)

. dataex year bmi fmktarea fmktreated

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int year float(bmi fmktarea fmktreated)
2009 24.58 1 1
2009 28.09 1 1
2009 27.32 1 1
2009 29.11 1 1
2009 23.78 1 1
2009 23.54 1 1
2009  34.4 1 1
2009 22.36 1 1
2009 19.88 1 1
2009 30.45 1 1
2009 29.35 1 1
2009 27.51 1 1
2009  25.6 1 1
2009 20.59 1 1
2009 33.73 1 1
2009 23.96 1 1
2009  25.9 1 1
2009 44.72 1 1
2009 20.16 1 1
2009 20.41 1 1
2009 33.07 1 1
2009 22.91 1 1
2009 26.56 1 1
2009 42.77 1 1
2009 27.86 1 1
2009  29.1 1 1
2009 23.35 1 1
2009 26.64 1 1
2009 31.54 1 1
2009 35.85 1 1
2009 31.63 1 1
2009 23.85 1 1
2009 32.35 1 1
2009  25.9 1 1
2009 37.84 1 1
2009 33.36 1 1
2009 25.07 1 1
2009 26.31 1 1
2009 25.14 1 1
2009 45.01 1 1
2009 27.31 1 1
2009  25.9 1 1
2009 23.08 1 1
2009 24.38 1 1
2009  26.3 1 1
2009 33.37 1 1
2009 35.02 1 1
2009 21.68 1 1
2010 25.85 1 1
2010 28.27 1 1
2010 29.33 1 1
2010 25.29 1 1
2010 19.18 1 1
2010 31.39 1 1
2010 36.92 1 1
2010 25.82 1 1
2010 37.33 1 1
2010 26.22 1 1
2010 25.16 1 1
2010 39.66 1 1
2010  37.2 1 1
2010 21.59 1 1
2010  29.6 1 1
2010 26.68 1 1
2010 23.62 1 1
2010 24.38 1 1
2010 32.68 1 1
2010 25.88 1 1
2010 24.01 1 1
2010 25.07 1 1
2010 30.61 1 1
2010 25.88 1 1
2010 26.56 1 1
2010 23.35 1 1
2010 30.11 1 1
2010 31.06 1 1
2010 23.08 1 1
2010  27.5 1 1
2010 24.94 1 1
2010  36.7 1 1
2010  34.4 1 1
2010 28.65 1 1
2010 23.44 1 1
2010 23.46 1 1
2010 25.14 1 1
2010 20.27 1 1
2010 33.35 1 1
2010 41.63 1 1
2010 22.35 1 1
2010 28.38 1 1
2010 34.24 1 1
2010 21.57 1 1
2010 24.94 1 1
2010 22.64 1 1
2011 27.12 1 1
2011 27.34 1 1
2011 30.52 1 1
2011 31.47 1 1
2011 33.45 1 1
2011 24.33 1 1
end

------------------ copy up to and including the previous line ------------------

Listed 100 out of 1327642 observations
Use the count() option to list more

.I am a bit lost on how to proceed with this analysis. Any help would be greatly appreciated.

Last edited by Tenzing Doma; 14 Mar 2021, 14:17. Reason: clyde schechter

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#2

14 Mar 2021, 15:00

A DID analysis requires not only treated and untreated groups, but pre-treatment and post-treatment time periods. You mention nothing about the latter, and it isn't clear to me whether you have correctly created a variable that deals with the former. While you have given the code that created the variables fmktarea and fmktreated, since you haven't explained what the other variables in those commands mean, it's impossible to say if these variables are even useful. Also, you will need a variable which identifies the counties--which I'm guessing you have but which you didn't show.

Anyway, if you have as a starting point a variable that identifies the county, a variable that identifies the year, and a variable, coded 0/1, that indicates whether that county was treated in that year, then we can proceed from there. Please post back with a new -dataex- output that includes these three variables (plus your BMI outcome).
Comment

Tenzing Doma

Join Date: Mar 2021
Posts: 5

14 Mar 2021, 15:25

Clyde Schechter Thank you for replying to my post! I am a fan!

The data doesn't have definite pre-and post-treatment time periods; some counties don't have a farmers market in 2009 (pre*) but have a market in 2011 (post-treatment?*). I am a little confused about how to identify that.

Code:

. tab year  fmktreated

           |    the area of the
           |  study; atleast one
           |  farmers' market in
 INTERVIEW |      the county
      YEAR |         0          1 |     Total
-----------+----------------------+----------
      2009 |    42,102    267,782 |   309,884
      2010 |    37,055    283,101 |   320,156
      2011 |    38,461    317,950 |   356,411
      2012 |    35,677    305,514 |   341,191
-----------+----------------------+----------
     Total |   153,295  1,174,347 | 1,327,642

As for the identifier for treated and untreated groups:
i created a dummy variable "fmktreated" : "==0" if there are no farmers market and "==1" if there are more than or equal to one farmers market in the county

Code:

gen fmktreated  = 0
replace fmktreated =1 if (numfarmersmkt>=1)

Also, the variable identifier for the counties is "countyid" = " group(statefips countyfips)"

Code:

. dataex bmi countyid year fmktreated

----------------------- copy starting from the next line -----------------------


	Code:
	* Example generated by -dataex-. To install: ssc install dataex
clear
input float(bmi countyid) int year float fmktreated
24.58 1 2009 1
28.09 1 2009 1
27.32 1 2009 1
29.11 1 2009 1
23.78 1 2009 1
23.54 1 2009 1
 34.4 1 2009 1
22.36 1 2009 1
19.88 1 2009 1
30.45 1 2009 1
29.35 1 2009 1
27.51 1 2009 1
 25.6 1 2009 1
20.59 1 2009 1
33.73 1 2009 1
23.96 1 2009 1
 25.9 1 2009 1
44.72 1 2009 1
20.16 1 2009 1
20.41 1 2009 1
33.07 1 2009 1
22.91 1 2009 1
26.56 1 2009 1
42.77 1 2009 1
27.86 1 2009 1
 29.1 1 2009 1
23.35 1 2009 1
26.64 1 2009 1
31.54 1 2009 1
35.85 1 2009 1
31.63 1 2009 1
23.85 1 2009 1
32.35 1 2009 1
 25.9 1 2009 1
37.84 1 2009 1
33.36 1 2009 1
25.07 1 2009 1
26.31 1 2009 1
25.14 1 2009 1
45.01 1 2009 1
27.31 1 2009 1
 25.9 1 2009 1
23.08 1 2009 1
24.38 1 2009 1
 26.3 1 2009 1
33.37 1 2009 1
35.02 1 2009 1
21.68 1 2009 1
25.85 1 2010 1
28.27 1 2010 1
29.33 1 2010 1
25.29 1 2010 1
19.18 1 2010 1
31.39 1 2010 1
36.92 1 2010 1
25.82 1 2010 1
37.33 1 2010 1
26.22 1 2010 1
25.16 1 2010 1
39.66 1 2010 1
 37.2 1 2010 1
21.59 1 2010 1
 29.6 1 2010 1
26.68 1 2010 1
23.62 1 2010 1
24.38 1 2010 1
32.68 1 2010 1
25.88 1 2010 1
24.01 1 2010 1
25.07 1 2010 1
30.61 1 2010 1
25.88 1 2010 1
26.56 1 2010 1
23.35 1 2010 1
30.11 1 2010 1
31.06 1 2010 1
23.08 1 2010 1
 27.5 1 2010 1
24.94 1 2010 1
 36.7 1 2010 1
 34.4 1 2010 1
28.65 1 2010 1
23.44 1 2010 1
23.46 1 2010 1
25.14 1 2010 1
20.27 1 2010 1
33.35 1 2010 1
41.63 1 2010 1
22.35 1 2010 1
28.38 1 2010 1
34.24 1 2010 1
21.57 1 2010 1
24.94 1 2010 1
22.64 1 2010 1
27.12 1 2011 1
27.34 1 2011 1
30.52 1 2011 1
31.47 1 2011 1
33.45 1 2011 1
24.33 1 2011 1
end
label values countyid countyid
label def countyid 1 "1 1", modify
------------------ copy up to and including the previous line ------------------

Listed 100 out of 1327642 observations
Use the count() option to list more

I know I am missing a lot here. please any help is greatly appreciated. Thank you once again.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#4

14 Mar 2021, 19:33

Well, the variable numfarmersmkt in your data changes from year to year when the actual number of farmers markets in the area changes from year to year, then your task is fairly straightforward. In the example data you show, however, the variable fmkttreated is always 1 for every county in every year. If your real data is like that, you cannot possibly test your hypothesis (by DID or any other approach).

But if you do have county/year combinations with fmkttreated == 0 as well, then you can do this:

Code:

xtset countyid year xtreg bmi i.fmkttreated, fe

That said, using this fmkttreated variable seems like a very blunt instrument to use here. It's an on-off switch. Doesn't it make more sense to think that counties with more farmer's markets might respond more than counties with fewer of them. I'd be inclined to actually use the number of farmers markets itself as the treatment variable. If you do that, you might want to refine it by making it farmers markets per capita, assuming you can get the population variable, to be even more mechanistically plausible. And also, think about whether there should be a time lag in your model. It's very hard to change people's BMIs--if a new farmers market opens up, I would think it would take a year or maybe even more for its impact to show up. Also, nationally in the US, there was still a secular trend of rising BMI across the US, so you might want to include a time trend variable as well. Anyway, you get the point: think about other predictors you can add to the model. The one you have discussed in this thread so far is, I think, greatly oversimplified and has little chance of being able to penetrate the fog of causal influences on BMI. I'm generally telling people to simplify their models--but you need to do the opposite.
1 like
Comment

Tenzing Doma

Join Date: Mar 2021
Posts: 5

17 Mar 2021, 19:22

Thank you for replying. I am still trying to understand my data. I really appreciate all the help.

I don't know if this is correct but I thought if I could label counties with no farmers' market for the four years as the control group and the rest as treatment. Then try to execute the DID model however, this seems to not be working. Here are some of the issues I am facing right now:

the variable numfarmersmkt in your data changes from year to year when the actual number of farmers markets in the area changes from year to year,

==> can you elaborate more on this?

however, the variable fmkttreated is always 1 for every county in every year. If your real data is like that, you cannot possibly test your hypothesis (by DID or any other approach).

Code:

. tab   fmktreated

the area of |
 the study; |
atleast one |
   farmers' |
  market in |
 the county |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |    153,295       11.55       11.55
          1 |  1,174,347       88.45      100.00
------------+-----------------------------------
      Total |  1,327,642      100.00

the variable "fmktreatd" has the value 0, however, it's quite minuscule compared to the total number of obvs. hence it looks like it's always 1. I was so a bit hesitant on the sample data when the control group is very small compared to the treatment group.

county/year combinations with fmkttreated == 0 as well

I guess I am lacking this combo; for I have a variable, countyid, and year because I tried to perform the following codes and it did not work.

Code:

xtset countyid year

repeated time values within panel

r(451);



. isid countyid year

variables countyid year do not uniquely identify the observations

r(459);


. bysort countyid year: assert _N ==1

7953 contradictions in 7962 by-groups

assertion is false

r(9);

So then i created a general:

Code:

gen id =_n

but i got this

Code:

. xtset id year
       panel variable:  id (weakly balanced)
        time variable:  year, 2009 to 2012
                delta:  1 unit

and when I tried the -xtreg- command:

Code:


. xtreg bmi i.fmktreated,fe

insufficient observations

r(2001);

Since -xtreg- didn't work; I tried the basic -reg-

Code:


. reg bmi i.fmktreated




      Source |       SS       df       MS              Number of obs = 1327642

-------------+------------------------------           F(  1,1327640) =  295.01

       Model |  8830.46255     1  8830.46255           Prob > F      =  0.0000

    Residual |  39739596.31327640   29.932509           R-squared     =  0.0002

-------------+------------------------------           Adj R-squared =  0.0002

       Total |  39748426.71327641  29.9391377           Root MSE      =  5.4711




------------------------------------------------------------------------------

         bmi |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

1.fmktreated |  -.2551937   .0148576   -17.18   0.000    -.2843142   -.2260733

       _cons |   27.91119   .0139736  1997.43   0.000      27.8838    27.93857

------------------------------------------------------------------------------

The suggestion on focusing on the number of farmers' market as a treatment variable sounds very interesting. I would like to try that next.

Last edited by Tenzing Doma; 17 Mar 2021, 19:28.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#6

17 Mar 2021, 19:49

OK. I still don't have a grasp of what is going on in your data (although it's much clearer now than it was before). I suspect you don't either.

I see your data contains over 1.3 million observations. I don't know how many counties there are in America, but I am sure it is not even of that order of magnitude. So you have many, many observations for most (every?) county even in the course of just a single year. Why is that? What do those different observations represent? How did they get into your data? At the outset I thought you had a relatively small data with a bunch of counties, and one observation in each year showing the number of farmer markets and the average BMI. But clearly your data is far more extensive than that. So what are all these observations about? This has to be known in order to decide how to organize the analysis. It is quite clear that the -xtset countyid year- approach is impossible (a Stata told you when you tried it).

Your attempt to salvage it by creating this new id variable failed because you now turned the "panel" grouping into individual observations. Not only does that defeat the purpose of using panel data estimators, it also, as you observed, makes them fail because they cannot work when each "panel" is just a single observation. That's what that "inufficient observations" message means.

Thank you for showing that there are some counties in the control group. Without that you would be completely out of luck for your analysis. That the number of control counties is small is less than optimal, but not a serious problem in a data set this size.

However, it appears that in your data every county is either always a treatment county (has at least one farmer market) or is always a control county. That being the case, a difference-in-differences analysis is impossible as that analysis requires a full fthree-fold subsetting of the observations into four categories: eventually treated but currently pre-treatment, currently treated, never treated. But you have only the last two. So you will have to rely on a simple treatment vs control contrast. Now it is well known that this type of comparison, when not based on randomization, is more vulnerable to confounding than the difference-in-differences approach. So it is really necessary for you to include covariates, that is, other variables that are associated with BMI and which might be differently distributed in the treatment and control groups. So give this a lot of thought, and since you do have such a large data set, cast the net very wide for these.

But the first priority is getting a clear understanding of what these observations represent. We won't be able to make progress until we are clear on the answers to the questions raised in the second paragraph of this response.
Comment
Tenzing Doma

Join Date: Mar 2021

Posts: 5
#7

21 Mar 2021, 18:39

Sorry for the late response; I really appreciate your help;

So you have many, many observations for most (every?) county even in the course of just a single year.

Code:

summ countyid if year ==2009 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- countyid | 315955 1191.684 705.8377 1 2518

So the data I have is sorted by countyid, therefore the observations I have in based by [ counted = (countyfip statefip) ]; therefore the countyid represents per individuals who were surveyed. The BRFSS doesn’t offer a unique identifier, therefore when I merged the farmers’ market data and the BRFSS data; I had to create the countyid variable so the two data sets could merge. The observations from the Behavioral Risk Factor Surveillance System (BRFSS) data allow me to look into the effects of the presence of a farmers’ market in the counties. If we factor in the county-level observation, wouldn’t the significance of an effect more significant than observe just state-wide significance?

What do those different observations represent? How did they get into your data?

The other observations are from the BRFSS with health-related factors like diabetes, asthma, pregnancy, etc.

Code:

numadult byte %10.0g NUMBER OF ADULTS IN HOUSEHOLD genhlth byte %10.0g GENERAL HEALTH physhlth byte %10.0g NUMBER OF DAYS PHYSICAL HEALTH NOT GOOD menthlth byte %10.0g NUMBER OF DAYS MENTAL HEALTH NOT GOOD poorhlth byte %10.0g POOR PHYSICAL OR MENTAL HEALTH persdoc2 byte %10.0g MULTIPLE HEALTH CARE PROFESSIONALS checkup1 byte %10.0g LENGTH OF TIME SINCE LAST ROUTINE CHECKU exerany2 byte %10.0g EXERCISE IN PAST 30 DAYS cvdinfr4 byte %10.0g EVER DIAGNOSED WITH HEART ATTACK cvdcrhd4 byte %10.0g EVER DIAGNOSED WITH ANGINA OR CORONARY H cvdstrk3 byte %10.0g EVER DIAGNOSED WITH A STROKE asthma2 byte %10.0g EVER TOLD HAD ASTHMA diabetes byte %10.0g (EVER TOLD) YOU HAVE DIABETES age byte %10.0g REPORTED AGE IN YEARS

==> looking at the ideas of counties and the total amount of farmers market; since the average numbers of farmers' market is 5; can I create an indicator variable ==1 for counties with more than 5* “county with markets-treat” if the county has implemented farmers’ market effect in the year of the interview and ==0 otherwise.
And the other “ county has farmers market” takes the value of ==1 if the county farmers’ market” and ==0 otherwise.

Code:

summ numfarmersmkt Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- numfarmers~t | 1361004 5.327064 9.465987 0 136

Last edited by Tenzing Doma; 21 Mar 2021, 18:42.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#8

21 Mar 2021, 19:42

OK, now it all makes sense. What you started with is a series of cross-sectional surveys (Behavioral Risk Factor Surveillance System) and you merged that with a source of county-level data on the number of farmers markets in the county. So far so good.

Code:

xtset countyid

is clearly the way to inform Stata of the repeated observations within counties in this data. There may be some individuals who also appear in more than one wave of BRFSS, but you can't identify who they are and, the way BRFSS does its sampling, there will be very few such cases, so we will ignore that additional level of nesting.

Your research question, as I understand it, is to assess the association between number of farmers markets in the county and body mass index. I think that dichotomizing the counties into a "treated" and "untreated" group is a mistake. See https://www.fharrell.com/post/errmed/#catg for a general discussion of the problems with this approach. Briefly, you chose to dichotomize at > 5 vs <= 5 because 5 is approximately the average. But there is absolutely no reason to think that something magical happens when the 6th farmers market in a county opens (nor that anything magical happens when the 6th farmers market goes out of business.) It is far more likely that the relationship between farmers markets and BMI, if there is one, changes gradually. So what I recommend you do is scrap that approach. Find a good representation of the association by doing some graphical exploration of mean county BMI against number of farmers markets:

Code:

collapse (mean) bmi (first) numfarmersmarkets, by(countyid) graph twoway scatter bmi numfarmersmarkets

What does that graph look like? Maybe do a lowess plot to see if the relationship looks linear, or curvilinear in a way that might be straightforward to model (logarithmic, or exponential, or some simple power law). Then use that as a starting point for your modeling. So your general starting approach, in the uncollapsed data, will be -xtreg bmi some variable(s) representing number of farmers markets, re-. You will want to also include as many potentially confounding variables as you can in this model, and you definitely need to include study year as a variable as there was a secular trend in BMI over the time period you are studying. Another thing to consider is that the association of BMI with farmers markets as they change over time within a county may differ from the association with the number of farmers markets between counties. The -xthybrid- command (I believe that's from SSC) will enable you to separate out those effects.
Comment
Tenzing Doma

Join Date: Mar 2021

Posts: 5
#9

21 Mar 2021, 20:08

Code:

xtset countyid

This is what i got:

Code:

xtset countyid panel variable: countyid (unbalanced)

And the graph:

Code:

collapse (mean) bmi (first) numfarmersmkt, by(countyid) graph twoway scatter bmi numfarmersmkt

What does that graph look like? Maybe do a lowess plot to see if the relationship looks linear, or curvilinear in a way that might be straightforward to model (logarithmic, or exponential, or some simple power law)

lowess smoother.gph
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#10

22 Mar 2021, 11:18

So, the most obvious thing in that graph is that there are two highly influential outlier points containing a huge number of farmers markets. Are those perhaps data errors? Or is something else strange going on in those counties? Were it not for those two data points, I would say that lowess is showing you a more or less linear decline in BMI with increasing numbers of farmers markets. That would incline me to model this with -xtreg bmi numfarmersmarkets and lots of potential confounders, re-.

But I worry about those two points. Is that one county in two separate years? Two counties each in one year? Is it credible that the number of farmersmarkets there abruptly jumped that high? If so, how did that happen?

That said, it is also plausible that there would be diminishing returns to the increase in farmers markets at some point. So if there is no good reason to exclude those two data points, I would be comfortable with perhaps using a linear spline with a joinpoint (knot) at 50 farmers markets to specify the model instead of just the simple numfarmersmarkets term. See -help mkspline-.
1 like
Comment

Announcement

Difference-in-Difference with a series of cross section data*

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment