Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Stata question 1: this is too vague to answer. What does "over the years" mean? Do you mean stay in the same time from the beginning to the end of the data on them? Do you mean that they stay on some team for more than one year? On all teams they are ever on for more than one year? Or some other time period other than just more than one year? If you can make this more precise, I'm sure there's a way to code it.

    Stata question 2: Perhaps somebody else who knows this command can help. In any case, though, you should show the commands leading up to it: -estat- is a post-estimation command, and sometimes how it works depends on lot on the original estimation command. The problem may be there.

    Stata question 3. First there is the substantive question: what is a meaningful region in your context? Perhaps that is pre-defined by the leagues of the sport you are studying. Or perhaps there is some other economic basis for deciding which teams belong to which region. Anyway, that's not a Stata issue. But let me assume you've resolved that issue. The easiest way to do this is to create a new file with just two variables: team name and region--one observation for each team. Then you can -merge 1:m- that one with your original file, and that will assign the region to each observation for that team. When creating the region variable, you can create it as a string and then use the -encode- command to make a numeric variable out of it. If you are not familiar with -merge- and -encode-, do read the corresponding manual sections. -encode- is quite simple to learn and use. (The hard part is learning when it is appropriate to use!) -merge- is a bit more complicated, but the manual section is quite clear and has good examples.

    Let me make a general comment here regarding your econ questions, though not specifically answering any of them. It seems you have inherently multi-level data. You have repeated observations nested within players who are nested within teams which are nested in regions which are nested in countries! I recognize that mixed-effects multi-level models are viewed somewhat skeptically in economics and there is a strong preference for fixed effects estimators. And I understand that fixed-effects estimators offer unsurpassed advantages in controlling for omitted variable bias regarding any time-invariant attributes. And I understand that they provide consistent estimates, where random-effects may or may not. But omitted variable bias isn't the only problem one faces in data analysis, and often one can come very close to dealing with it in other ways that are not as constricting as fixed-effects models. And consistent estimates from a mis-specified model are not necessarily better than inconsistent estimates from a properly-specified model that also accounts for more sources of variation. In a data set with this many levels, I think it is likely that a mixed-effects multi-level model is the way to go. Then you don't have the dilemma of "which level should I cluster at" (a question that is asked because it is recognized that all of the possible answers are in some way wrong). You "cluster" at all relevant levels. Omitted variable bias can be partly, and often nearly completely, dealt with by including relevant covariates. With only two countries, I would not include a country-level in the model: I would just add an indicator variable for Canada vs US. But I would include all of the other levels of nesting in the model. Think about it.

    Comment


    • #17
      Hi Clyde thanks for that (especially for the mixed effects modelling suggestion. I will ask look into this and let you know how it turns out).

      My question with regards to seeing which players drop out stems from me wondering whether when the financial incentives are turned on, players change their behavior and become more/less likely to drop out of the team due to performance. I'm not trying to specify a model, just some descriptive stats really as it might be helpful trying to understand my results. So in my mind I could sort of just make a descriptive line graph which might show a discontinuity in the [old]/[new] proportion (averaged across teams within a country) in the treatment year. So suppose in 2006 this proportion averaged across the teams in the treatment country was 50%, but in 2007 the proportion dropped to 25% and then hit 20% in 2008 etc.. Then I could just eyeball the graph and see whether there was any discontinuity in either countries. So for each year I would average that proportion across all the teams, and see how it changes over time. I hope that made more sense.

      I'm running my regressions again, except this time with some player level covariates. A couple of questions:

      1. As with many sport statistics, they are related to each other just by the way they are calculated. For for example, if my outcome variable is 'shots attempted', then of course 'shots successful' will be related to this by the way both are defined, and as such, should I include it as a covariate? I was wondering how to deal with these (at the moment my method is not to include any of these variables as covariates). Further, some have more subtle relationships, for e.g. if my outcome variable is 'total goals in a season' and I also have a covariate with `highest number of goals in a game'. Of course the second will contribute to the first one, but in a much more subtle way that in the first example. Any suggestions here? Should I be completely conservative and not include any of these? I hesitate to do this because you see a number of studies which regress say unemployment on the population, or GDP on government expenditures. So it might not actually be a problem.

      2. I'm trying to conduct a preliminary sensitivity analysis, by assuming that the control country was actually the treated country and I'm getting results which makes me suspicious of my specification. When I switch the treatment countries, my estimates switch in sign. So for e.g. my DiD estimate goes from -5 to 5 with the same p-value. Now I can understand why this would be the case, but I have read some lecture notes which say you should be suspicious of this.

      Thanks again,

      Comment


      • #18
        To make sure we're on the same page: This is the regression I've ran:

        xtreg 'player outcome' i.USA##i.Treatment i.year covariates, fe
        margins USA#Treatment
        margins Treatment, dydx(USA)


        In my results I have:

        1, One year before the intervention which is not shown in the results (I assume STATA has just omitted this due to the dummy variable trap although I thought it would show up and just have 'omitted next to it), and one year post treatment which shows up but has `omitted' next to it. The country fixed effect is also omitted because it is perfectly collinear with the player fixed effect.

        2. My margins table under the 'delta method std, error show up (not estimable), and I was wondering what that means. Under the dy/dx column I just have a single dot.

        Also, is there a way to include age as a covariate? I ask because I would think age would be perfectly collinear with year, i.e. a player is 28 in 2007 but 29 in 2008 (unfortunately I only have data on age in years and not months and days).Relatedly, how do we Interpret of the coefficients on the covariates? I can interpret the coefficient on the dummy's for treamtent and the interaction terms.

        Comment


        • #19
          Regarding #17:

          So maybe you want to do something like this:
          Code:
          by player (year), sort: gen changed_team = (team != team[_n-1]) if _n > 1
          and then you could use that variable, suitably aggregated over teams, treatment groups, etc. to characterize turnover.

          Question 1. This really depends on your research goals. Modeling "shots successful" with and without "shots attempted" are both legitimate approaches. The results have different meanings. Which is appropriate depends on the specific question you are trying to ask. I should also note that in the context of count variables where the outcome variable represents a subset of the covariate, you might consider another approach: a generalized linear model with a binomial link function. Thus something like: -xtgee shots_successful predictor_variables, family(binomial shots_attempted) -.

          Question 2. I don't know why anyone would consider this suspicious. It is exactly what should happen when you exchange the coding of the two treatment groups.

          Regarding #18

          1. This is also normal and, as you have recognized, is due to multicolinearity.

          2. If I understand your design and model correctly, only US teams were subject to the intervention, not Canadian. So the Canadian#Intervention cell is empty in your design. That leads to the inestimability you see.

          You will not be able to include age as a covariate in this model. As you suspect, it is colinear with year within person, so it participates in a multicolinearity relationship with the player-level fixed effects and the year indicators. There is no way around this in a fixed effects model. In a random effects model you might get away with it. I say might because you could encounter convergence difficulties. The only way to know for sure is to try it. In principle it is possible. Another approach, if you think age effects are important, is to do -xtreg, be-.

          The coefficients of covariates are interpreted the same way as the covariates of the principal variables of interest. The coefficient of covariate X is the expected difference in outcome associated with a unit difference in the value of X within a person.

          Comment


          • #20
            That code worked a charm!
            I'm still having trouble getting my head around which controls to include (I think this should be a fairly easy decision and I feel i'm complicating it).
            Basically I find it unnatural to include controls like 'shots attempted' if my outcome variable is goals scored. The point of the entire model is to see whether
            we see a difference in the outcome variable for players who received incentives vs those that didn't. Intuitively i feel the only real 'control' should be matches played
            within a season since these can vary systematically between countries, and hence it seems to be a natural control. With my data, if my outcome variable is calculated as y=[x/(w-v)], I might have a controls like 'x,w,v', and for some reason it feels strange to include these. For example, suppose my outcome variable is something strange like [home runs/number of times batted], and I have a control which is 'home runs', isn't this an odd control to use? So then where to draw the line between odd controls and relevant controls? Apologies since this is not really a STATA question.

            Comment


            • #21
              So then where to draw the line between odd controls and relevant controls?
              Several of the examples of odd controls you've given fall under a simple rule that we use in epidemiology:

              Rule: If you have a causal chain of events whereby A leads to B which in turn leads to C, in estimating the effect of A on C you should NOT control for B.
              On the other hand if A and B both can influence C, and B's distribution differs depending on A but in a non-causal way, then controlling for B when estimating the effect of A on C is permissible. (It may or may not be the right thing to do in given circumstances and you need to think about whether the adjusted or unadjusted analysis provides the better answer to your particular research question.)

              So if home runs scored is the endpoint, clearly being at bat is one part of the path leading to runs scored. And to the extent that it is under the control of the player and subject to influence by incentives, it lies causally between incentives and runs scored. So it would be folly to control for times at bat when modeling the effect of incentives on runs scored.

              Similarly, on the other side, games played is exogenous to the mechanism under study because it is set by national policies and incentives to players cannot modify it. Games played certainly can influence runs scored, but, because it is not on a causal path from incentives, it is quite reasonable to control for games played. (Though one could argue that it might be better to control for the team's nationality so as to capture in a single variable a broad range of exogenous effects, some of which might be difficult to define or measure in isolation.)

              So I think your instincts are good, and perhaps bumping them up against the rule given above will verify them, or, occasionally, suggest that you need to rethink something. Note that the rule requires that you first have in mind an abstract model of the process leading from intervention to outcome and what are the causal steps between them.

              Apologies since this is not really a STATA question.
              No apologies called for. It's a statistical question, and statistical questions are always welcome in this Forum even if they are not specifically about Stata.


              Comment


              • #22
                That really helped me sort out my model, thanks!
                One question with respect to the "Though one could argue that it might be better to control for the team's nationality so as to capture in a single variable a broad range of exogenous effects, some of which might be difficult to define or measure in isolation". A dummy variable on the teams nationality could capture this, but it would be perfectly collinear with the treatment group (as every team in the treatment group belongs to one country). I was thinking of including a team dummy to capture 'team fixed effects', although then i will run into a problem in that some players (although not many) might be switching teams so it will not be time invariant, but also have very low variation to be of any use. So then maybe I was thinking that matches played by the team might be a good control variable as opposed to even matches played by the player (player players more rashly after financial incentives --> gets kicked out of the team (So low number of matches)-->affects my outcome variable).
                Secondly, since I have a very large Panel over 10 years, and since players age and drop out and come back in at random points (e.g. maybe get kicked out for two years and then make a comeback), I'm trying find ways to reduce this. Essentially if we think of DiD as an 'experiment' that happened at a point in time, with my current set-up of including everyone who has at least one observation on either side of the treatment group, my analysis would include people who played a in a season 3 years before the intervention, and maybe one season after the intervention, so how much information are they really giving me? I could just simply use all the players who played continuously for every year between a specific range? For example focus on 2 years before and after the intervention only instead of using my current set up which has I around 5 years before, and 7/8 years after.

                Comment


                • #23
                  Essentially if we think of DiD as an 'experiment' that happened at a point in time, with my current set-up of including everyone who has at least one observation on either side of the treatment group, my analysis would include people who played a in a season 3 years before the intervention, and maybe one season after the intervention, so how much information are they really giving me? I could just simply use all the players who played continuously for every year between a specific range? For example focus on 2 years before and after the intervention only instead of using my current set up which has I around 5 years before, and 7/8 years after.
                  Well, the statistical principle that would concern me here is that by restricting the analysis to people who played 2 years before and after the intervention, you may be getting a biased sample of players. Think about it: is there something different about the behavior of people who manage to play four consecutive years? Perhaps they are more persistent, or have more consistent skill levels. It probably has a reduced prevalence of players with drug or alcohol problems, or highly dysfunctional families. (I'm just speculating here. I know nothing about athletes, I'm just reasoning from general principles about people and employment.) It seems reasonable to me to think that these people might respond differently to incentives than the people who are excluded from your analysis. So your results would really not be generalizable to the consequences of instituting an incentive system overall. That might not be a problem if you could tell in advance who is going to stick around for four years--but you can't, so just knowing that the incentive system has a certain effect on that subgroup of players isn't all that helpful.

                  It's up to you to think about whether those who stick around for four years differ in a relevant way from those who don't. My prior is that they do, but you know something about athletes and I don't. So that has to be your judgment call. (Actually, you could run some subset analyses and compare the two types of athletes.)

                  One question with respect to the "Though one could argue that it might be better to control for the team's nationality so as to capture in a single variable a broad range of exogenous effects, some of which might be difficult to define or measure in isolation". A dummy variable on the teams nationality could capture this, but it would be perfectly collinear with the treatment group (as every team in the treatment group belongs to one country).
                  Well, if the country indicator would be colinear with the intervention variable, then it can't be included, plain and simple.

                  Comment


                  • #24
                    Thanks again for that Clyde.

                    I have talked to some people about my model and now I'm a bit more confused as what is going on.
                    So to recap and get my head around this:

                    I am using player level data to model a diff-in-diff design. I have data spanning around 13 years, with 4 pretreatment years, and 9 treatment years.
                    My 'POST' variable is year>=2007 (the year the intervention began). I have teams from two countries, USA and Canada, and only USA teams received financial incentives so my 'Treatment' group is 'USA'
                    I have player level panel data and I run the following reg:

                    "xtset player year"
                    "xtreg 'player outcome' i.USA##i.POST i.year covariates, fe "
                    In symbols this would look something like: y= time invariant player attributes (fixed effects) + time dummies for every year + USA dummy + post dummy + interaction + covariates + error

                    After talking to some people, I am now a bit confused as to why STATA is omitting certain variables. So please correct me if I'm wrong:

                    Running that regression, my USA dummy is omitted, and two time dummies are omitted. I have 13 time dummies, and 1 must naturally be omitted to avoid the dummy variable trap (STATA chooses to drop this time dummy in the period before the intervention). A second time dummy is dropped due to colinearity with the POST variable. One question here is that the POST dummy=1 for all years after and including 2007, so naturally every time dummy will be colinear with the POST dummy for that year. So in some sense, aren't all the year dummies post intervention colinear? That is, the time dummy for 2008 will equal 1, and POST in that year will also equal 1. But the same can be said for 2009, 2010 etc. Why is STATA only choosing to remove one time dummy in the post intervention period?

                    Now on to the USA dummy being omitted, is it omitted because a) it is colinear with the `fixed effects' or b) it is colinear with the the fact that after 2007 everyone who received the treatment was a USA player, hence the interaction term and the USA dummy are both equal to 1 after 2007. In the case of a) I would have thought my fixed effects are `player fixed effects' as I assume STATA generates the fixed effects based on the xtset command. So the player `fixed effects' should pick up everything time invariant about a specific player, such as his `ability' and in the case of my data where no player changes his country, the fixed effects model should also pick up which country he is from. Consequently, the USA country dummy is being omitted. This is what I have thought to be the case. However, some people have told me that USA dummy should not be omitted because of b) as the colinearity only applies after 2007, which means STATA will retain information on the variable (and hence the variable itself) before the treatment period begins.

                    I thought I had sorted this problem out, but apparently not?

                    Comment


                    • #25
                      Further, I was thinking of controlling for team dummies. I was wondering whether this would be a informative control, and how one would interpret it.
                      Since players don't change teams very often, will this only be informative for players who change teams? Secondly, how would I interpret any coefficient here, i.e. what would a -0.5 coefficient in front of 'team new york' actually mean? My thinking is that maybe since some teams are just inherently 'better' it might be useful control for some sort of team effects (e.g. some teams might go on a hot streak one year and all players within the team might play much better for that year. I will cluster my standard errors across teams in this example).

                      Also with clustering. I want to cluster on teams, but since players change teams I have decided to put in the following code: vce(cluster team) nonest dfadj

                      There isn't much information about nonest dfadj on these forums/net, and I was wondering whether anyone could shine more light on it. This seems to be the way most people go about this problem though.

                      Comment


                      • #26
                        Re #24: Actually, you did have it sorted out properly. But you have now gotten yourself confused. Let's deal with the player fixed effects issue first.

                        Each player either stays either USA or stays in Canada during his/her entire sojourn in your data set. So for any given player, the variable USA will be either constantly 0 or constantly 1. That makes the USA variable colinear with the fixed-effect for that player--so it's going to be omitted. ( If it were true that the USA variable were only 1 in observations after year 2007, then USA would not be colinear with the fixed effect. But if that were true, the variable USA would not be an appropriate variable to specify the treatment group in a DID model.)

                        Now let's turn to the time colinearities. Run the commands below. They create a data set that has the same time-variable structure as yours. There is an indicator for each year from 2003 through 2015. And pre-post is defined as 1 for year >= 2007. There are two different colinearity relationships here. The first is the generic colinearity among indicators for a categorical variable: the sum of all those indicators is 1 because regardless of the level of the categorical variable, one of the indicators is 1 and the others are 0. This is the "dummy variable trap" you refer to that requires dropping one of the variables. Any one will do. The first -assert- command confirms this linear relationship.

                        But even if we drop one of the time indicators, there is a second colinearity that remains unresolved. Since pre_post is 1 whenever year >= 2007 and 0 otherwise, it will always be the case that pre_post equals the sum of the yd2007 through yd2015 variables. That's because year < 2007, both pre_post and all of the yd2007-yd2015 variables are zero. When year >= 2007, pre_post is 1, but it is also true that exactly one of the yd2007-yd2013 variables is 1 and the others are zero. To break this colinearity, Stata must drop one of the yd2007-yd2015 variables, or it must drop pre_post. It could choose any one of these, they are all equally "at fault" for the colinearity. I believe the way Stata does it is to remove the one that appears last in the varlist of the regression command.

                        But you can see that it is not true that each of the year indicators is separately colinear with pre_post. Take yd2008 as an example. In observations with year == 2008, we do have yd2008 = pre_post, as both are 1. But in year 2007 or 2009 through 2015, pre_post is 1 but yd2008 is 0. And in 2003 through 2006, pre_post and yd2008 are both zero. So pre_post == yd2008 holds in some years, but not in others. So they are not colinear as a pair.

                        Code:
                        * Example generated by -dataex-. To install: ssc install dataex
                        clear
                        input float(year pre_post yd2003 yd2004 yd2005 yd2006 yd2007 yd2008 yd2009 yd2010 yd2011 yd2012 yd2013 yd2014 yd2015)
                        2003 0 1 0 0 0 0 0 0 0 0 0 0 0 0
                        2004 0 0 1 0 0 0 0 0 0 0 0 0 0 0
                        2005 0 0 0 1 0 0 0 0 0 0 0 0 0 0
                        2006 0 0 0 0 1 0 0 0 0 0 0 0 0 0
                        2007 1 0 0 0 0 1 0 0 0 0 0 0 0 0
                        2008 1 0 0 0 0 0 1 0 0 0 0 0 0 0
                        2009 1 0 0 0 0 0 0 1 0 0 0 0 0 0
                        2010 1 0 0 0 0 0 0 0 1 0 0 0 0 0
                        2011 1 0 0 0 0 0 0 0 0 1 0 0 0 0
                        2012 1 0 0 0 0 0 0 0 0 0 1 0 0 0
                        2013 1 0 0 0 0 0 0 0 0 0 0 1 0 0
                        2014 1 0 0 0 0 0 0 0 0 0 0 0 1 0
                        2015 1 0 0 0 0 0 0 0 0 0 0 0 0 1
                        end
                        
                        // SHOW COLINEARITY AMONG YEAR INDICATORS THEMSELVES
                        assert yd2003+yd2004+yd2005+yd2006+yd2007+yd2008+yd2009+yd2010+yd2011 ///
                            +yd2012+yd2013+yd2014+yd2015 == 1
                            
                        //    SHOW COLINEARITY OF pre_post WITH YEAR INDICATORS
                        assert pre_post == yd2007 + yd2008 + yd2009 + yd2010 + yd2011 + yd2012 ///
                            + yd2013 + yd2014 + yd2015
                        I hope this helps.

                        Comment


                        • #27
                          Re: #25

                          For those outcomes where there is reason to believe that a team effect exists, it would make sense to include team indicators. This is only viable if at least some players change teams over time, which apparently is the case in your data. And when there are also player fixed-effects in the model, the observations on players who never change teams do, nevertheless, contribute to the estimation. Evidently the design, where some players stay on a single team, but some move around, will be less efficient than one in which players were fully crossed with teams! But that is not the real world, and it would probably be quite difficult to design a study that maximized the efficiency of this aspect of your study. I don't think, however, that you will be able to use -vce(cluster team)- in your analyses, precisely because the players are not nested in teams. Perhaps that is what your -nonest dfadj- options are about. I'm not familiar with them, have never used them, don't know what they do, and can't comment. Perhaps somebody who knows these can chime in and advise you.

                          From my perspective, what's going wrong here is that you are trying to force what is a truly 3-level multiple-membership model into a two-level analysis. So something is going to give: you can't have it all. I understand that mixed-effects models are viewed skeptically in economic analysis, but it is also true that, ingeneral, results are not what they appear to be if the model is mis-specified, even if the model has been estimated with a procedure that produces consistent estimates for that model.

                          If you are willing to go to multilevel modeling this would look something like:

                          Code:
                          mixed outcome i.USA##i.pre_post || _all:R.team || player:
                          However you resolve these conflicting desiderata, the interpretation of a team effect such as new york = -0.5 is the same. It means that, all else equal, the expected value of the outcome for an observation (of any player in any year) is 0.5 units lower if the player is on the new york team that year, than if he/she is on the reference category team (or if it's a random effect, then it-s 0.5 units lower than average.) Note, by the way, that this model would not capture a team going on a hot streak in some year. The team effects, whether fixed or random, only capture time-invariant attributes of the team.

                          Comment


                          • #28
                            Thanks again (and again) for that Clyde.

                            I am now thinking about formal diff-in-diff tests for the parallel trends assumption. The Idea is to include leads and lags of the treatment interacted with the time dummies. These are some lecture notes by Pischke from the LSE (page 7 is the relevant page) (http://econ.lse.ac.uk/staff/spischke...valuation3.pdf) or equivalently, just google diff-in-diff parallel trend assumption test, and there is a stackexchange question which deals with this (http://stats.stackexchange.com/quest...mon-trend-betw). The idea is to test whether the diff-in-diff coefficient before the treatment period was significant (i.e. testing the coefficient on the leads). How would I code this in STATA? So far I have done:

                            gen treatment=year>=2007

                            gen lag1=treatment*2006

                            Then I repeat this until the year 2003

                            and for the leads I have:

                            gen lead1=treatment*2007

                            And do this until the year 2015

                            Then I run: xtreg "outcome" i.year "lags" "leads", fe

                            STATA omits my lags and leads because of colinearity (which makes sense since i also have year dummies).

                            Any idea about how to go about this?

                            On another note. My constant term in the diff-in-diff continuously has a p-value of 0.000 and I'm not to sure what the constant measures? I have read that it is the "average of the n-1 dummies" and has no interpretation. Secondly, is it normal to obtain really low R squares in diff-in-diffs? My R-squares are usually below 0.1. In addition, my F statistics seem to always show joint significance (which is not a problem, but makes me a bit suspicious since it is nearly always tells me my variables are jointly significant, i.e I always get something like Prob > F = 0.0012 or something else absurdly small). Further, and this is what worries me most, when I make a graph there seems to be a clear `positive effect' of the intervention. But when I conduct my diff-in-diff, it either shows a negative effect, or no effect. One reason I am thinking of is that the diff-in-diff in my case is showing the effect of "controlgroup-treatmentgroup" as opposed to "treatmentgroup-controlgroup", i.e. the sign should be positive but it is negative. This should not be the case as my regression specifies a dummy of 1 if a player lies in the treatment group. On another note (and sorry for all these questions, I have just spent some time on it and these are the latest issues I have encountered). sometimes when I change the treatment year to one year before the treatment, I continue to get significant results. I am thinking of deleting all the years after the treatment and then change the treatment year and then see whether this effect still continues to hold, and also, this might also help me test the parallel trend assumption. Is this a decent way to go?

                            thank you once again

                            Comment


                            • #29
                              I don't think it is possible to test the parallel trends assumption in this way when you are including yearly time effects. The time effects themselves capture the effects of any time trends that are present. In my work with longitudinal data (I'm an eipdemiologist--we do lots of things differently from economists), I seldom incorporate yearly time indicators. I might incorporate one or two indicators for years that are identifiably special in some way related to the outcome (e.g. 1917 because of the world flu pandemic in a study of mortality). But across the board adjustment for yearly "shocks" just isn't in our toolbox. We do often, however, look for time trends. Typically that is done by including a continuous time variable, or sometimes a linear spline, or a polynomial if we expect non-linear trends. So a typical DID analysis might look like this:

                              Code:
                              xtset panelvar time
                              mkspline pre 2007 post = time // SEE -help mkspline-
                              xtreg outcome i.treatment##c.(pre post) covariates
                              
                              // TEST OF PARALLEL TRENDS BEFORE
                              test 1.treatment#pre
                              
                              // TEST OF TREATMENT EFFECT
                              test 1.treatment#post
                              As for low R2, it is what it is. Have you done any plotting of predicted vs observed values. Take a look at those and they'll probably look like what you expect with a low R2, a relationship that is too weak to be visible to the eye. It may just be that your variables, all taken together, don't exert as much influence on your outcomes as to excluded (unmeasured, unobservable) variables and just plain luck. That would be my expectation for variables like sports outcomes.

                              The constant term in any linear regression represents the expected value of the outcome when all of the predictors take on zero value. In most situations that doesn't actually correspond to any realistic configuration of the variables, so it is of no real meaning. Even when there can be (or even are) observations where all of the predictors are zero, unless there is something really interesting or special about them, there is no reason to devote any attention to the constant term.

                              If your treatment variable is coded 1 for treatment group and 0 for control, and your pre-post variable is coded 0 for before and 1 for after, than a positive effect will be reflected as a positive 1.treatment#1.pre_post coefficient. This thread has gotten rather long, and I no longer remember exactly what kind of regression you are using in your analysis. If you are using -xtreg, fe-, you may be encountering a situation where the within player effect of the intervention (which is what -xtreg, fe- estimates) can have the opposite sign of the between player effect (which is what you would see in a graph unless you explicitly labeled each point to show which player it corresponds to. Here's an example:
                              Code:
                              clear
                              set obs 5
                              gen panel_id = _n
                              expand 2
                              
                              set seed 1234
                              by panel_id , sort: gen y = 4*panel_id - _n + 3 + rnormal(0, 0.5)
                              by panel_id: gen x = panel_id + _n
                              
                              xtset panel_id 
                              
                              xtreg y x, fe
                              regress y x
                              
                              //    GRAPH THE DATA TO SHOW WHAT'S HAPPENING
                              separate y, by(panel_id)
                              
                              graph twoway connect y? x || lfit y x
                              If you run the above, -xtreg, fe- will show a negative coefficient of y on x, but -regress- will show a positive one. (Both statistically significant, by the way, not that it matters.) The graph shows what's going on. But note that if the graph didn't explicitly identify what's going on within panels, the overall appearance of the graph would just suggest a typical positive correlation.

                              As for your final thought about deleting all the years after treatment and changing the treatment year as a test for parallel trend, I'm not sure exactly what you mean by this.

                              I hope these comments are helpful.





                              Comment


                              • #30
                                Yes the test for diff-in-diff did seem strange, I will try and sort that out. Also,yes I am using xtreg, fe.
                                Also thank you for explaining the between groups and within groups issue. One question which arises now is can visual inspection really help me understand whether the parallel trends assumption holds or not? The graph I have made basically has the average outcome variables for each player plotted across the years. Suppose player A and B are from the same country. if player A has a score of 50, and player B has a score of 100 in 2000, and scores of 20 and 40 respectively in 2001, my graph will have a point at 75 in 2000, and 30 in 2001 for players in that country. Then I compare these points across countries and see whether there was a divergence post intervention. But as you say, this will be a measure of 'between variation' and not the within variation the xtreg measures. I mean if I was comparing say unemployment rates between states, then I would think these two would coincide and the graph would give a good estimate of whether the parallel trend assumption holds. On a related note, the diff-in-diff measures the ATOT, and in this sense wouldn't my graph allow me to verify the parallel trends assumption as the average treatment effect would be the mean of the means (i.e. average change for a player averaged over all players), which is what my graph will show. Or have I got this completely wrong?

                                thank you

                                Comment

                                Working...
                                X