Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difference-in-Difference with a series of cross section data*

    Hello, everyone!

    I am fairly bad at Stata and it's been some time since I attended any econometrics class. I am trying to work out how to complete a DID analysis using the BRFSS data with geolocation data on farmers' markets around the USA by county-level. The BRFSS only provides county info till 2012 so I am using data from the years 2009-2012. Therefore I would like to test the effect of the farmers market on BMI by county-level.


    I have created a dummy variable
    Code:
    egen fmktarea = max(anyfarmersmkt),by(countyid)
    a dummy variable for the area of the study; at least one farmers market in the county
    Code:
    gen fmktreated  = 0
    replace fmktreated =1 if (numfarmersmkt>=1)

    . dataex year bmi fmktarea fmktreated

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int year float(bmi fmktarea fmktreated)
    2009 24.58 1 1
    2009 28.09 1 1
    2009 27.32 1 1
    2009 29.11 1 1
    2009 23.78 1 1
    2009 23.54 1 1
    2009  34.4 1 1
    2009 22.36 1 1
    2009 19.88 1 1
    2009 30.45 1 1
    2009 29.35 1 1
    2009 27.51 1 1
    2009  25.6 1 1
    2009 20.59 1 1
    2009 33.73 1 1
    2009 23.96 1 1
    2009  25.9 1 1
    2009 44.72 1 1
    2009 20.16 1 1
    2009 20.41 1 1
    2009 33.07 1 1
    2009 22.91 1 1
    2009 26.56 1 1
    2009 42.77 1 1
    2009 27.86 1 1
    2009  29.1 1 1
    2009 23.35 1 1
    2009 26.64 1 1
    2009 31.54 1 1
    2009 35.85 1 1
    2009 31.63 1 1
    2009 23.85 1 1
    2009 32.35 1 1
    2009  25.9 1 1
    2009 37.84 1 1
    2009 33.36 1 1
    2009 25.07 1 1
    2009 26.31 1 1
    2009 25.14 1 1
    2009 45.01 1 1
    2009 27.31 1 1
    2009  25.9 1 1
    2009 23.08 1 1
    2009 24.38 1 1
    2009  26.3 1 1
    2009 33.37 1 1
    2009 35.02 1 1
    2009 21.68 1 1
    2010 25.85 1 1
    2010 28.27 1 1
    2010 29.33 1 1
    2010 25.29 1 1
    2010 19.18 1 1
    2010 31.39 1 1
    2010 36.92 1 1
    2010 25.82 1 1
    2010 37.33 1 1
    2010 26.22 1 1
    2010 25.16 1 1
    2010 39.66 1 1
    2010  37.2 1 1
    2010 21.59 1 1
    2010  29.6 1 1
    2010 26.68 1 1
    2010 23.62 1 1
    2010 24.38 1 1
    2010 32.68 1 1
    2010 25.88 1 1
    2010 24.01 1 1
    2010 25.07 1 1
    2010 30.61 1 1
    2010 25.88 1 1
    2010 26.56 1 1
    2010 23.35 1 1
    2010 30.11 1 1
    2010 31.06 1 1
    2010 23.08 1 1
    2010  27.5 1 1
    2010 24.94 1 1
    2010  36.7 1 1
    2010  34.4 1 1
    2010 28.65 1 1
    2010 23.44 1 1
    2010 23.46 1 1
    2010 25.14 1 1
    2010 20.27 1 1
    2010 33.35 1 1
    2010 41.63 1 1
    2010 22.35 1 1
    2010 28.38 1 1
    2010 34.24 1 1
    2010 21.57 1 1
    2010 24.94 1 1
    2010 22.64 1 1
    2011 27.12 1 1
    2011 27.34 1 1
    2011 30.52 1 1
    2011 31.47 1 1
    2011 33.45 1 1
    2011 24.33 1 1
    end
    ------------------ copy up to and including the previous line ------------------

    Listed 100 out of 1327642 observations
    Use the count() option to list more

    .I am a bit lost on how to proceed with this analysis. Any help would be greatly appreciated.
    Last edited by Tenzing Doma; 14 Mar 2021, 14:17. Reason: clyde schechter

  • #2
    A DID analysis requires not only treated and untreated groups, but pre-treatment and post-treatment time periods. You mention nothing about the latter, and it isn't clear to me whether you have correctly created a variable that deals with the former. While you have given the code that created the variables fmktarea and fmktreated, since you haven't explained what the other variables in those commands mean, it's impossible to say if these variables are even useful. Also, you will need a variable which identifies the counties--which I'm guessing you have but which you didn't show.

    Anyway, if you have as a starting point a variable that identifies the county, a variable that identifies the year, and a variable, coded 0/1, that indicates whether that county was treated in that year, then we can proceed from there. Please post back with a new -dataex- output that includes these three variables (plus your BMI outcome).

    Comment


    • #3
      Clyde Schechter Thank you for replying to my post! I am a fan!


      The data doesn't have definite pre-and post-treatment time periods; some counties don't have a farmers market in 2009 (pre*) but have a market in 2011 (post-treatment?*). I am a little confused about how to identify that.
      Code:
      . tab year  fmktreated
      
                 |    the area of the
                 |  study; atleast one
                 |  farmers' market in
       INTERVIEW |      the county
            YEAR |         0          1 |     Total
      -----------+----------------------+----------
            2009 |    42,102    267,782 |   309,884
            2010 |    37,055    283,101 |   320,156
            2011 |    38,461    317,950 |   356,411
            2012 |    35,677    305,514 |   341,191
      -----------+----------------------+----------
           Total |   153,295  1,174,347 | 1,327,642

      As for the identifier for treated and untreated groups:
      i created a dummy variable "fmktreated" : "==0" if there are no farmers market and "==1" if there are more than or equal to one farmers market in the county

      Code:
      gen fmktreated  = 0
      replace fmktreated =1 if (numfarmersmkt>=1)
      Also, the variable identifier for the counties is "countyid" = " group(statefips countyfips)"

      Code:
      . dataex bmi countyid year fmktreated
      
      ----------------------- copy starting from the next line -----------------------
      
      
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(bmi countyid) int year float fmktreated
      24.58 1 2009 1
      28.09 1 2009 1
      27.32 1 2009 1
      29.11 1 2009 1
      23.78 1 2009 1
      23.54 1 2009 1
       34.4 1 2009 1
      22.36 1 2009 1
      19.88 1 2009 1
      30.45 1 2009 1
      29.35 1 2009 1
      27.51 1 2009 1
       25.6 1 2009 1
      20.59 1 2009 1
      33.73 1 2009 1
      23.96 1 2009 1
       25.9 1 2009 1
      44.72 1 2009 1
      20.16 1 2009 1
      20.41 1 2009 1
      33.07 1 2009 1
      22.91 1 2009 1
      26.56 1 2009 1
      42.77 1 2009 1
      27.86 1 2009 1
       29.1 1 2009 1
      23.35 1 2009 1
      26.64 1 2009 1
      31.54 1 2009 1
      35.85 1 2009 1
      31.63 1 2009 1
      23.85 1 2009 1
      32.35 1 2009 1
       25.9 1 2009 1
      37.84 1 2009 1
      33.36 1 2009 1
      25.07 1 2009 1
      26.31 1 2009 1
      25.14 1 2009 1
      45.01 1 2009 1
      27.31 1 2009 1
       25.9 1 2009 1
      23.08 1 2009 1
      24.38 1 2009 1
       26.3 1 2009 1
      33.37 1 2009 1
      35.02 1 2009 1
      21.68 1 2009 1
      25.85 1 2010 1
      28.27 1 2010 1
      29.33 1 2010 1
      25.29 1 2010 1
      19.18 1 2010 1
      31.39 1 2010 1
      36.92 1 2010 1
      25.82 1 2010 1
      37.33 1 2010 1
      26.22 1 2010 1
      25.16 1 2010 1
      39.66 1 2010 1
       37.2 1 2010 1
      21.59 1 2010 1
       29.6 1 2010 1
      26.68 1 2010 1
      23.62 1 2010 1
      24.38 1 2010 1
      32.68 1 2010 1
      25.88 1 2010 1
      24.01 1 2010 1
      25.07 1 2010 1
      30.61 1 2010 1
      25.88 1 2010 1
      26.56 1 2010 1
      23.35 1 2010 1
      30.11 1 2010 1
      31.06 1 2010 1
      23.08 1 2010 1
       27.5 1 2010 1
      24.94 1 2010 1
       36.7 1 2010 1
       34.4 1 2010 1
      28.65 1 2010 1
      23.44 1 2010 1
      23.46 1 2010 1
      25.14 1 2010 1
      20.27 1 2010 1
      33.35 1 2010 1
      41.63 1 2010 1
      22.35 1 2010 1
      28.38 1 2010 1
      34.24 1 2010 1
      21.57 1 2010 1
      24.94 1 2010 1
      22.64 1 2010 1
      27.12 1 2011 1
      27.34 1 2011 1
      30.52 1 2011 1
      31.47 1 2011 1
      33.45 1 2011 1
      24.33 1 2011 1
      end
      label values countyid countyid
      label def countyid 1 "1 1", modify
      ------------------ copy up to and including the previous line ------------------ Listed 100 out of 1327642 observations Use the count() option to list more


      I know I am missing a lot here. please any help is greatly appreciated. Thank you once again.



      Comment


      • #4
        Well, the variable numfarmersmkt in your data changes from year to year when the actual number of farmers markets in the area changes from year to year, then your task is fairly straightforward. In the example data you show, however, the variable fmkttreated is always 1 for every county in every year. If your real data is like that, you cannot possibly test your hypothesis (by DID or any other approach).

        But if you do have county/year combinations with fmkttreated == 0 as well, then you can do this:

        Code:
        xtset countyid year
        xtreg bmi i.fmkttreated, fe
        That said, using this fmkttreated variable seems like a very blunt instrument to use here. It's an on-off switch. Doesn't it make more sense to think that counties with more farmer's markets might respond more than counties with fewer of them. I'd be inclined to actually use the number of farmers markets itself as the treatment variable. If you do that, you might want to refine it by making it farmers markets per capita, assuming you can get the population variable, to be even more mechanistically plausible. And also, think about whether there should be a time lag in your model. It's very hard to change people's BMIs--if a new farmers market opens up, I would think it would take a year or maybe even more for its impact to show up. Also, nationally in the US, there was still a secular trend of rising BMI across the US, so you might want to include a time trend variable as well. Anyway, you get the point: think about other predictors you can add to the model. The one you have discussed in this thread so far is, I think, greatly oversimplified and has little chance of being able to penetrate the fog of causal influences on BMI. I'm generally telling people to simplify their models--but you need to do the opposite.

        Comment


        • #5
          Thank you for replying. I am still trying to understand my data. I really appreciate all the help.

          I don't know if this is correct but I thought if I could label counties with no farmers' market for the four years as the control group and the rest as treatment. Then try to execute the DID model however, this seems to not be working. Here are some of the issues I am facing right now:

          the variable numfarmersmkt in your data changes from year to year when the actual number of farmers markets in the area changes from year to year,
          ==> can you elaborate more on this?


          however, the variable fmkttreated is always 1 for every county in every year. If your real data is like that, you cannot possibly test your hypothesis (by DID or any other approach).

          Code:
          . tab   fmktreated
          
          the area of |
           the study; |
          atleast one |
             farmers' |
            market in |
           the county |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                    0 |    153,295       11.55       11.55
                    1 |  1,174,347       88.45      100.00
          ------------+-----------------------------------
                Total |  1,327,642      100.00
          the variable "fmktreatd" has the value 0, however, it's quite minuscule compared to the total number of obvs. hence it looks like it's always 1. I was so a bit hesitant on the sample data when the control group is very small compared to the treatment group.

          county/year combinations with fmkttreated == 0 as well


          I guess I am lacking this combo; for I have a variable, countyid, and year because I tried to perform the following codes and it did not work.


          Code:
          xtset countyid year
          
          repeated time values within panel
          
          r(451);
          
          
          
          . isid countyid year
          
          variables countyid year do not uniquely identify the observations
          
          r(459);
          
          
          . bysort countyid year: assert _N ==1
          
          7953 contradictions in 7962 by-groups
          
          assertion is false
          
          r(9);


          So then i created a general:

          Code:
          gen id =_n
          but i got this
          Code:
          . xtset id year
                 panel variable:  id (weakly balanced)
                  time variable:  year, 2009 to 2012
                          delta:  1 unit
          and when I tried the -xtreg- command:

          Code:
          
          . xtreg bmi i.fmktreated,fe
          
          insufficient observations
          
          r(2001);
          Since -xtreg- didn't work; I tried the basic -reg-

          Code:
          
          . reg bmi i.fmktreated
          
          
          
          
                Source |       SS       df       MS              Number of obs = 1327642
          
          -------------+------------------------------           F(  1,1327640) =  295.01
          
                 Model |  8830.46255     1  8830.46255           Prob > F      =  0.0000
          
              Residual |  39739596.31327640   29.932509           R-squared     =  0.0002
          
          -------------+------------------------------           Adj R-squared =  0.0002
          
                 Total |  39748426.71327641  29.9391377           Root MSE      =  5.4711
          
          
          
          
          ------------------------------------------------------------------------------
          
                   bmi |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
          
          -------------+----------------------------------------------------------------
          
          1.fmktreated |  -.2551937   .0148576   -17.18   0.000    -.2843142   -.2260733
          
                 _cons |   27.91119   .0139736  1997.43   0.000      27.8838    27.93857
          
          ------------------------------------------------------------------------------

          The suggestion on focusing on the number of farmers' market as a treatment variable sounds very interesting. I would like to try that next.
          Last edited by Tenzing Doma; 17 Mar 2021, 19:28.

          Comment


          • #6
            OK. I still don't have a grasp of what is going on in your data (although it's much clearer now than it was before). I suspect you don't either.

            I see your data contains over 1.3 million observations. I don't know how many counties there are in America, but I am sure it is not even of that order of magnitude. So you have many, many observations for most (every?) county even in the course of just a single year. Why is that? What do those different observations represent? How did they get into your data? At the outset I thought you had a relatively small data with a bunch of counties, and one observation in each year showing the number of farmer markets and the average BMI. But clearly your data is far more extensive than that. So what are all these observations about? This has to be known in order to decide how to organize the analysis. It is quite clear that the -xtset countyid year- approach is impossible (a Stata told you when you tried it).

            Your attempt to salvage it by creating this new id variable failed because you now turned the "panel" grouping into individual observations. Not only does that defeat the purpose of using panel data estimators, it also, as you observed, makes them fail because they cannot work when each "panel" is just a single observation. That's what that "inufficient observations" message means.

            Thank you for showing that there are some counties in the control group. Without that you would be completely out of luck for your analysis. That the number of control counties is small is less than optimal, but not a serious problem in a data set this size.

            However, it appears that in your data every county is either always a treatment county (has at least one farmer market) or is always a control county. That being the case, a difference-in-differences analysis is impossible as that analysis requires a full fthree-fold subsetting of the observations into four categories: eventually treated but currently pre-treatment, currently treated, never treated. But you have only the last two. So you will have to rely on a simple treatment vs control contrast. Now it is well known that this type of comparison, when not based on randomization, is more vulnerable to confounding than the difference-in-differences approach. So it is really necessary for you to include covariates, that is, other variables that are associated with BMI and which might be differently distributed in the treatment and control groups. So give this a lot of thought, and since you do have such a large data set, cast the net very wide for these.

            But the first priority is getting a clear understanding of what these observations represent. We won't be able to make progress until we are clear on the answers to the questions raised in the second paragraph of this response.

            Comment


            • #7
              Sorry for the late response; I really appreciate your help;

              So you have many, many observations for most (every?) county even in the course of just a single year.
              Code:
              summ countyid if year ==2009
              
              Variable | Obs Mean Std. Dev. Min Max
              -------------+--------------------------------------------------------
              countyid | 315955 1191.684 705.8377 1 2518
              So the data I have is sorted by countyid, therefore the observations I have in based by [ counted = (countyfip statefip) ]; therefore the countyid represents per individuals who were surveyed. The BRFSS doesn’t offer a unique identifier, therefore when I merged the farmers’ market data and the BRFSS data; I had to create the countyid variable so the two data sets could merge. The observations from the Behavioral Risk Factor Surveillance System (BRFSS) data allow me to look into the effects of the presence of a farmers’ market in the counties. If we factor in the county-level observation, wouldn’t the significance of an effect more significant than observe just state-wide significance?

              What do those different observations represent? How did they get into your data?
              The other observations are from the BRFSS with health-related factors like diabetes, asthma, pregnancy, etc.

              Code:
              numadult        byte    %10.0g                NUMBER OF ADULTS IN HOUSEHOLD
              genhlth         byte    %10.0g                GENERAL HEALTH
              physhlth        byte    %10.0g                NUMBER OF DAYS PHYSICAL HEALTH NOT GOOD
              menthlth        byte    %10.0g                NUMBER OF DAYS MENTAL HEALTH NOT GOOD
              poorhlth        byte    %10.0g                POOR PHYSICAL OR MENTAL HEALTH
              persdoc2        byte    %10.0g                MULTIPLE HEALTH CARE PROFESSIONALS
              checkup1        byte    %10.0g                LENGTH OF TIME SINCE LAST ROUTINE CHECKU
              exerany2        byte    %10.0g                EXERCISE IN PAST 30 DAYS
              cvdinfr4        byte    %10.0g                EVER DIAGNOSED WITH HEART ATTACK
              cvdcrhd4        byte    %10.0g                EVER DIAGNOSED WITH ANGINA OR CORONARY H
              cvdstrk3        byte    %10.0g                EVER DIAGNOSED WITH A STROKE
              asthma2         byte    %10.0g                EVER TOLD HAD ASTHMA
              diabetes        byte    %10.0g                (EVER TOLD) YOU HAVE DIABETES
              age             byte    %10.0g                REPORTED AGE IN YEARS


              ==> looking at the ideas of counties and the total amount of farmers market; since the average numbers of farmers' market is 5; can I create an indicator variable ==1 for counties with more than 5* “county with markets-treat” if the county has implemented farmers’ market effect in the year of the interview and ==0 otherwise.
              And the other “ county has farmers market” takes the value of ==1 if the county farmers’ market” and ==0 otherwise.

              Code:
              summ  numfarmersmkt
              
                  Variable |       Obs        Mean    Std. Dev.       Min        Max
              -------------+--------------------------------------------------------
              numfarmers~t |   1361004    5.327064    9.465987          0        136
              Last edited by Tenzing Doma; 21 Mar 2021, 18:42.

              Comment


              • #8
                OK, now it all makes sense. What you started with is a series of cross-sectional surveys (Behavioral Risk Factor Surveillance System) and you merged that with a source of county-level data on the number of farmers markets in the county. So far so good.

                Code:
                xtset countyid
                is clearly the way to inform Stata of the repeated observations within counties in this data. There may be some individuals who also appear in more than one wave of BRFSS, but you can't identify who they are and, the way BRFSS does its sampling, there will be very few such cases, so we will ignore that additional level of nesting.

                Your research question, as I understand it, is to assess the association between number of farmers markets in the county and body mass index. I think that dichotomizing the counties into a "treated" and "untreated" group is a mistake. See https://www.fharrell.com/post/errmed/#catg for a general discussion of the problems with this approach. Briefly, you chose to dichotomize at > 5 vs <= 5 because 5 is approximately the average. But there is absolutely no reason to think that something magical happens when the 6th farmers market in a county opens (nor that anything magical happens when the 6th farmers market goes out of business.) It is far more likely that the relationship between farmers markets and BMI, if there is one, changes gradually. So what I recommend you do is scrap that approach. Find a good representation of the association by doing some graphical exploration of mean county BMI against number of farmers markets:

                Code:
                collapse (mean) bmi (first) numfarmersmarkets, by(countyid)
                graph twoway scatter bmi numfarmersmarkets
                What does that graph look like? Maybe do a lowess plot to see if the relationship looks linear, or curvilinear in a way that might be straightforward to model (logarithmic, or exponential, or some simple power law). Then use that as a starting point for your modeling. So your general starting approach, in the uncollapsed data, will be -xtreg bmi some variable(s) representing number of farmers markets, re-. You will want to also include as many potentially confounding variables as you can in this model, and you definitely need to include study year as a variable as there was a secular trend in BMI over the time period you are studying. Another thing to consider is that the association of BMI with farmers markets as they change over time within a county may differ from the association with the number of farmers markets between counties. The -xthybrid- command (I believe that's from SSC) will enable you to separate out those effects.

                Comment


                • #9



                  Code:
                  xtset countyid
                  This is what i got:

                  Code:
                  xtset countyid
                         panel variable:  countyid (unbalanced)
                  And the graph:

                  Code:
                  collapse (mean) bmi (first) numfarmersmkt, by(countyid)
                  graph twoway scatter bmi numfarmersmkt
                  What does that graph look like? Maybe do a lowess plot to see if the relationship looks linear, or curvilinear in a way that might be straightforward to model (logarithmic, or exponential, or some simple power law)

                  lowess smoother.gph

                  Comment


                  • #10
                    So, the most obvious thing in that graph is that there are two highly influential outlier points containing a huge number of farmers markets. Are those perhaps data errors? Or is something else strange going on in those counties? Were it not for those two data points, I would say that lowess is showing you a more or less linear decline in BMI with increasing numbers of farmers markets. That would incline me to model this with -xtreg bmi numfarmersmarkets and lots of potential confounders, re-.

                    But I worry about those two points. Is that one county in two separate years? Two counties each in one year? Is it credible that the number of farmersmarkets there abruptly jumped that high? If so, how did that happen?

                    That said, it is also plausible that there would be diminishing returns to the increase in farmers markets at some point. So if there is no good reason to exclude those two data points, I would be comfortable with perhaps using a linear spline with a joinpoint (knot) at 50 farmers markets to specify the model instead of just the simple numfarmersmarkets term. See -help mkspline-.

                    Comment

                    Working...
                    X