Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • DiD with multiple time periods, one intervention, 1 treatment and 1 control

    Hi,

    Like many others im completely new to STATA. I have a model which I think is quite simple yet I'm struggling to model in STATA.
    Essentially, I have a large dataset on early observations of players in a particular sport (by yearly I mean data on each player is his yearly summary, not game by game). My data range covers roughly 15 years.
    The story relates to how financial incentives alter player behaviour. The control group were not given the incentives, whilst the treatment group were given the incentives. So all players belong to two groups. The basic structure is a DiD model with multiple pre and post treatment periods ( 5 and 8 respectively) and I'm not sure how to go about this. A problem I have is that since the time period of my data is quite long, there is a lot of player attrition (since it is very rare for a player to play 15 years continuously), and missing values. That is, I might have player A for 2003,2007, but then nothing thereafter or inbetween. Or I might only have Player B for the year 2006 and nothing else. That is, there is quite a bit of variation in the player data I have. This poses a problem for me in thinking about how to organize my data. Which players should I use (e.g. only use players for which I have at least 2 obvs pre and post tratment)? What does stata do with all the missing values? Should I just somehow 'pool' all my data so basically consider each year as a new sample of players and do the analysis like that? This is the code I ran:

    reg y time time1 time2 time3 time5 time6 time7 time8 time9 time10 time11 time12 time13 treated did,

    Where I have the time dummies to control for time trends, 'treated' is an indicator of the treated group of players, did is my interaction term. I suspect this is wrong as the outcome variable does not have a t subscript. Is that correct? The second regression I ran was:

    xtreg y time time1 time2 time3 time5 time6 time7 time8 time9 time10 time11 time12 time13 treated did, fe

    Which did not work (did estimate came up as omitted). I can sort of understand the model on paper (i've looked at pischke's notes on how to model a DiD with multiple time periods), I find it hard translating it to stata. Also, i'm confused on whether I should player fixed effects or group fixed effects in the model, or both?

    The data is structured like:

    Player variable 1 variable 2 variable 3 year
    A
    A
    B

    Your help is much appreciated!



  • #2
    I have more questions than answers for you.

    You state that the basic structure is a DID model, but I am suspicious that you do not in fact have DID data. It is almost inconceivable (and would almost require a real conspiracy in the missing data) that you would end up with the did variable being omitted unless you either didn't correctly create the did variable or your data is not really DID data.

    So, DID data consists of observations on two groups of subjects. One group receives the treatment (incentives, in your case) and the other does not. Moreover, in the classic DID design, all of the subjects in the treated group begin treatment at the same point in time. And data are available on both the treated and untreated group both before and after that time. Does this describe your data? If not, you have something that is not quite (or perhaps not at all) a DID design and some modification to your analysis will be required.

    You don't say anything about what your outcome variable y is. Because you used it with -regress- and -xtreg-, I imagine it is a (quasi-)continuous variable. But more specifically what the variable y actually represents in the world will be an important factor in deciding how to deal with your missing data problem. In particular, you will need to give some thought to how and why the missing values came to be missing, and whether that missingness is independent of the (unknown) true values for those observations (or can be made independent by conditioning on variables that you have measurements of.)

    The simplest, and best case scenario is if you do have DID data as described above, and the missing values are independent of the true values (or can be made so by conditioning on available data). Your description of your data set mentions year and an identification variable for player. Then you say you have variable 1, variable 2, and variable 3. Well, of course, you don't really have those because those aren't legal variable names in Stata. More to the point, it isn't at all clear what those variables are, or what role, if any, they play in this analysis. You also apparently do not have variable y, your outcome, nor anything that indicates which players are in the treatment group and which are untreated. So I suspect your description of the data set is far from what you really have. Let me assume that you do, in fact, have variables for treated vs untreated (I'll call that one group, coded 1/0), and that the variable y really is in your data set. If you have calculated a variable, did, for the interaction term, get rid of it: it is better to let Stata do this for you automatically with factor variable notation. (Read -help fvvarlist-.) Finally, you know in which year all the players simultaneously began getting the treatment (incentives). For sake of illustration I'll assume that year was 2007. In this setting, you could do this model as follows:

    Code:
    xtset player year
    xtdes
    gen byte pre_post = (year >= 2007) // REPLACE 2007 BY THE ACTUAL CHANGE YEAR
    xtreg y i.group##i.pre_post i.year, fe // INCLUDE OTHER COVARIATES IF YOU LIKE
    margins group#pre_post
    margins pre_post, dydx(group)
    Now, given missing data, you will probably find some observations omitted. Perhaps so many that your analysis is based on too few cases to be meaningful. Or maybe there won't be too many gaps in the data. Inspecting all of the output of this will show us how extensive your missing data problem is. Then you can begin to ponder and discuss ways of dealing with it.

    Be prepared in advance: because every player is either always in the treated or always in the untreated group, the group variable will be constant within player, and so Stata will tell you that it is omitted due to colinearity (with the fixed effect). That is not a problem, and you should not give it even a moment's worry. The pre_post variable most definitely should not be omitted by Stata; if, somehow, it is, then you have either miscoded it or you have a very pathological pattern of missing data that spuriously creates this problem. If the latter, then you data simply cannot support a DID analysis unless we find a way to fill some of those data gaps. Similarly, the group#pre_post interaction term will not be omitted if your data are capable of fitting a DID model.

    So perhaps this is enough to get you rolling. Or perhaps your situation does not fit the description, or you have difficulties running the suggested code or interpreting the results. If so, do post back with clear descriptions and more questions.

    Comment


    • #3
      Thanks for that Clyde!

      Okay first of all I will describe my project and data a bit better.

      Dataset: Basically for every season I have a list of players and their relevant statistics for that season. I will use a couple of these statistics as my outcome variable which is continuous, e.g. free throw percentage over the entire year for a player, or maybe another outcome variable. I created the dataset by merging all the yearly statistics pages into one big sheet. Now, this is sort of a panel in the sense that I have players who I observe over the years, but there will also be random players who maybe only played for a year.

      The Story: Two similar teams exist, one is given financial incentives (you guessed right in 2007), whilst the other was not. I coded this in stata by: tabulate team, gen(t), and t1 was designated as my treated team, and t2 is the control team.

      Approaches: I can look at the stats on either side of the intervention (so look at the stats in the 2006 season and then the 2007 season) which would lend itself to a basic did criteria (i.e. create a two period dataset from my original dataset). I feel this might be a waste of my much larger dataset. Alternatively I can try and create a panel with differing 'windows', so maybe code in STATA to only use players for which I have data for 2 periods before and at least 3 periods after. And then repeat this exercise with differing before and after periods to create differing panels (this in itself might be a useful exercise as I can then compare how more experienced players reacted (i.e. those with more pre-treatment years) vs those will less experience). Or I can just continue with my current approach and try and create this massive panel which will inevitably have missing values for players (because players retire, get injured, get dropped from the team.This is a worry because as I said I have 14 years of data so it is very rare to have a player playing over all those years).

      What you wrote: Thank you for helping me understand my model better. My dataset is as you described, so I think DiD is the right way to go. If I understand your code correctly we have:

      xtset player year: This sets my data as a panel for STATA to use. I had to create original player markers since player was a string variable. I did this via the encode command. So each player now as a unique numerical ID.

      xtdes: Describes my data. This looks worrying for me. 75% of the players are observed for one year or less, and 95% are observed for 5 years or less.

      gen byte pre_post = (year >= 2007): This generates the year of the treatment (although I'm unsure what the byte is? I assume it is just an indicator for time)

      xtreg y i.group##i.pre_post i.year, fe: This is a far more elegant way of writing my model (thanks for that!). The ## indicates interaction terms + level terms, whilst i.year puts in time dummys. The fe is at the individual level?

      Okay so I ran the following regression:

      xtreg fr i.t1##i.pre_post i.year, fe

      Where fr is my outcome variable, and t1 i my treatment group.

      Results:

      note: 1.t1 omitted because of collinearity
      note: 1.t1#1.pre_post omitted because of collinearity
      note: 2015.year omitted because of collinearity


      So I think I have a problem here. Either my data isn't great, my code is wrong, or (probably) both.

      Thanks again




      Comment


      • #4
        Although I suspect something is wrong. In excel I highlighted all my duplicate players (which would highlight all the players who played more than 1 year) and I filtered them all out so the only remaining players would be the ones who only player for a year. Slightly less than half the observations are players only playing a single season, yet STATA is saying 75% of players are observed to be playing a year or less

        Comment


        • #5
          Though it's really just an aside, let me respond first to the question about the use of "byte" in that command. When you use the -generate- command, by default Stata creates the variable as a float, which takes up 4 bytes in memory (for each observation). Stata allows you to override that default and use different data storage types: double (8 bytes), long (also 8 bytes but more significant figures because no exponent), int (2 bytes) and byte (1 byte). Since a 0/1 variable only needs a single byte (actually it only needs one-eight byte) to carry all its information, I have a habit of specifying byte storage. That's because I've been programming computers since the early 1960's. In the old days computers had very small memories; memory was very expensive. So you had to be stingy with it when you programmed or you would quickly run out of it. Nowadays that's rarely necessary--most computers have plenty of memory and, if needed, you can always get more for very little cost. But old habits die hard.

          TL;DR--the "byte" specification is unnecessary and you can omit it unless your data set is so large and your computer's memory so small that you are pushing its limits.

          It sounds like the missing observations are due to the players actually not playing in those seasons. So, from a perspective of how the missingness biases the data, these may not really be missing data, they are "not in universe" observations that don't belong in the analysis anyway. On the other hand, it may well be that the not playing status is endogenous: if they performed poorly last season they get kicked off the team next season--that would be a serious problem, and one that I don't really know how to handle. It sounds vaguely like something of a Heckman selection model, but I really don't know if that makes sense or not--it's way out of my field.

          But apart from the issues of bias, it looks like your pattern of missing data is actually undermining your DID model. Your code looks correct to me. So we ordinarily wouldn't see "note: 1.t1#1.pre_post omitted because of collinearity". (The other two omissions are expected and normal in this context.) So, what could 1.t1#1.pre_post be colinear with. There are only three possibilities: the constant term, pre_post itself, or the fixed effect. The first possibility, the constant term, would mean that pre_post is always 0 or always 1--which would mean all of your data (after excluding any players who only played one season) precedes 2007, or alternatively, all of your data is on or after 2007. That seems unlikely and contradicts what you describe. So that's probably not it. If 1.t1#1.pre_post is colinear with pre_post itself, that would mean that in one of your treatment groups all of the data precedes 2007 and in the other all of it is at or after 2007 (again, after excluding all players who only played a single season). If 1.t1#1.pre_post is colinear with the fixed effect, it means that each player (again, excluding those who played only a single season) either played all of their seasons before 2007, or played all of their seasons in or after 2007. You need to check your data for these possibilities. If either one is correct, then your pattern of missing data has indeed conspired to disrupt your DID design. It is absolutely critical that your data set must include some players who have observations both before and on or after 2007 in order to do a DID. In fact, these are the only players who are directly informative!

          Now, you have already indicated that the number of single-season players in your Stata data appears to be bigger than that in your original spreadsheet. So the first thing I would do is carefully investigate this. Work back through all the steps of importation and data cleaning to see if you have somehow lost or mangled the data. Perhaps a fixed-up data set will make the other problems go away. If not, we can try to think about alternative analyses.

          Comment


          • #6
            Hi Clyde,

            Thank you once again. I went back to my dataset and I think I solved the issue (the issue was that I wasn't correctly identifying my players which was really screwing things up). In addition, going back to the data made me spot some other mistakes so thanks for that! Also thanks for explaining the "byte" command. Although it might not be completely relevant for my analysis, it's just nice understanding everything that STATA is doing, so thanks again for that explanation!

            My second question is how to I tell STATA to only include players who have played continuously for a specified number of years. For example, how do I tell STATA to drop all the players for whom I don't have data before 2007 (pre-treatment period) and all those for whom I don't have data after 2007. Further, how to I get STATA to perform different combinations of this procedure? That is, suppose I only want to include players who have played at least two years before 2007 and at least 1 or two etc. years post 2007 (e.g. include only players for whom I have stats in 2005,2006,2007,2008, 2009 and 2010).
            Not sure if that makes sense. I understand there will be issues of selection bias, but at the moment I'm more worried about getting the baseline results and then worrying about potential confounding problems.

            Comment


            • #7
              I can issue simple "drop if year==2005" commands, but these leave a lot to be desired as I don't completely get the players I want

              Comment


              • #8
                For example, how do I tell STATA to drop all the players for whom I don't have data before 2007 (pre-treatment period) and all those for whom I don't have data after 2007.
                Code:
                by player, sort: egen earliest = min(year)
                by player, sort: egen latest = max(year)
                drop if earliest > 2007 | latest < 2007
                include only players for whom I have stats in 2005,2006,2007,2008, 2009 and 2010
                Code:
                isid player year
                by player, sort: gen byte focal_years = total(inrange(year, 2005, 2010))
                drop if focal_years < 6 // == COUNT OF YEARS FROM 2005 THROUGH 2010
                Added: You are actually quite fortunate that the errors you made in building your data set tripped up your analysis so early in the game. It would have been far worse had you gotten plausible-looking results and gone on to do more analyses based on those, or present them to an audience, only to find out later that you have built your castle in the air! I'll use this opportunity to rant on one of my pet peeves.

                [BEGIN RANT]
                Data management is under-valued. Most analysts I know find it boring, except when it is frustrating. They want to get straight to the modeling. So they often give it short shrift, ending up with a data set that is just a ticking time bomb.

                But, in my view, data management is really the most interesting and challenging part of any project. Real world data comes from an enormous number of sources, all of which do things differently for different ways. So you have to learn about how the data were collected and coded, and you have to then harmonize the different sources with your own analytic needs. This means you need a deep understanding of both the genesis of the raw data you start with and the goals of your own analysis so you can work your way from here to there. There are lots of ways to make errors along the way. Handling missing values correctly, in particular, is often quite tricky. Variables obtained from different sources can have name clashes that must be resolved, or different coding. Even variables obtained from the same source can look deceptively similar but actually have subtle differences that end up mattering a great deal.

                On top of that, we know that any real-world data set of appreciable size will contain errors. That's just the way it is. But your job is to minimize the extent to which those errors infect your analyses and distort your conclusions. Thorough data management "traps the alligators upstream." That is, you anticipate the kinds of errors that are common and check for them at the very beginning of building your analytic data set. You look for out of range values, variables with implausible numbers of missing values. You look for internal inconsistencies within observations, e.g. pregnant males. You look for inconsistencies across observations: a person whose birthdate differs across observations. You find all of those and resolve them in some way before you begin even the simplest analyses.

                Operationally, in Stata, that means that your do-file(s) for creating analytic data sets are replete with -assert- commands. Better to puzzle out now why your data set reports a pregnant male with two different birthdates, and fix it, than to have somebody else point it out to you while you're presenting results that include that case! Better to deal with and correct body temperatures below 30C in apparently healthy people before you include them in a regression and end up having to retract your published paper with the fabulous, surprising, statistically significant effect of body temperature on which political party the person voted for!
                [END RANT]
                Last edited by Clyde Schechter; 26 Jan 2017, 12:50. Reason: Correct typo in code.

                Comment


                • #9

                  Hi Clyde, The following line isn't working: by player, sort: gen byte focal_years = total(inrange(year, 2005, 2010)) It keeps coming up 'unknown function total()' Also, yes I agree data management should be taken more seriously. Unfortunately it's just not considered as exciting as the modelling part (a mentality I've obviously just fallen prey to).

                  Comment


                  • #10
                    Although I think the following has worked:

                    egen focal = total(inrange(year, 2005, 2010)), by(playernum)

                    drop if focal < 6

                    Comment


                    • #11
                      Actually no, that has just given me all the players for whom i have at least 6 observations. It is not restricted it to the 2005-2010 period

                      Comment


                      • #12
                        I have combined that with the drop if year<2005 and drop if year>2010 command, and it has worked. I checked using:
                        by player, sort: egen earliest1 = min(year)
                        by player, sort: egen latest1 = max(year)

                        And it seems every player's earliest year is 2005 and latest is 2010.

                        Sorry for the spam!

                        Comment


                        • #13
                          OK. My original code in #8, as you discovered, is wrong: it should have been -egen-, not -gen-. My error, sorry.

                          But I don't understand why #10 didn't work as planned. It should. I tested it out in some artificial data and verified it:

                          Code:
                          clear*
                          set obs 200
                          
                          gen player = mod(_n, 10) + 1
                          gen year = int(2000 + 12*runiform())
                          summ year
                          duplicates drop
                          
                          by player, sort: egen focal_years = total(inrange(year, 2005, 2010))
                          
                          forvalues y = 2005/2010 {
                              egen has`y' = max(`y'.year), by(player)
                          }
                          assert (focal_years == 6) == (has2005 & has2006 & has2007 & has2008 & has2009 & has2010)
                          Anyway, I'm glad you've managed to solve your problem.

                          Comment


                          • #14
                            I strongly encourage everyone reading this post to go back to Clyde’s comments about data management in post #8. Understanding how your data were collected can often shape the decisions you make about what analysis methods to use. Failing to understand the “genesis of your raw data” can sometimes produce garbage results, which many times don’t smell but are rubbish nonetheless.

                            Comment


                            • #15
                              Hi guys,

                              The study: I am running a diff-in-diff between two countries, where my dependent variable is a sports statistic and my 'treatment' is financial incentives whereby players in one country received the treatment, and players in another country did not. So did the treated group 'improve' compared to the non-treated group

                              Econ Questions:
                              I've been starting to wonder at what level I should cluster my standard errors.
                              Clustering at the country level (2 countries) would mean 2 countries which produces unreliable standard errors. In fact I find it hard thinking of why errors might be serially correlated among sports statistics at the country level? I guess I could maybe see country specific league shocks, e.g. league structure changes which might have an impact of players within the league, or maybe if one country gets flushed with money which it reinvests into its players and this effect persists over time. Could these be 'good' explanations for serially correlated errors? But I am also thinking that I should cluster at the region level? Break the countries up into regions, and then cluster them. Although this is not satisfactory as I would think any sort of error shocks would take place within the leagues of different countries (i.e. we should cluster at the country level), and not regions + I feel uncomfortable have individual and country fixed effects, but then clustering at a different level. Another question relates to the interpretation of time dummies in the diff-in-diff. Would the interpretation be e.g. "how much more the year 2004 affected the dependent variable in American vs Canadian players (once we difference them of course)?" Further, I'm not sure whether I should add a time trend and interact it with each of the countries instead of a time dummies, is there a specific test for this? Also as an aside, my regression specification so far includes fixed effects (which I assume are at the player level) and an i.group##treamtent term which would include country level fixed effects. As Clyde mentioned earlier this variable should be omitted in my analysis (and it is), but I was wondering whether it is necessary at all (I mean no player moves countries in my sample). Apologies for the amount of questions, I've been trying to think my way through these and thought it would be helpful to discuss them with a third party.

                              Stata Questions:

                              1. I'm thinking of looking at whether the rate of players dropping out/in teams changed during the period. So for example, seeing whether there is more turnover in 2005 than in 2004 for example. Essentially I just want to see the proportion of players which are the same in each team over the years (i.e. those for which we have repeated observations over). Is there a way to implement this in STATA? Essentially I just want to see whether turnover is increasing. I'm not sure if what I'm saying is very clear, so sorry for that.

                              2. Also, I've tried conducting DW tests to check whether I have auto-correlation in my errors (hence my worry about clustering), and I could not get the test to work (keep getting estat dwatson not valid as an error)

                              3. In relation to clustering by region, How would I go about creating regions in Stata?. That is, if a player plays in New York, I would want to code that as `East' and so forth. Then I would have to destring the variable if I would want to use it in a regression.

                              Thank you






                              Comment

                              Working...
                              X