Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Test of effect of binary predictor on binary outcome, panel

    Dear users

    I have the following type of analysis that needs doing. I'm inexperienced with logistic regression so any pointers would be great.

    Outcome: binary (variable name: status; values employed and unemployed).

    Predictor: binary (variable name: treatment; values treatment and control)

    Data: collected from entrants to youth employability programmes. Entrants are randomly assigned to clusters (training sites, variable name siteid), where one or other type of employability programme is being run (treatment/control). It is a panel dataset with 2 waves: baseline and endline. In long format this time dimension is captured in the wave variable, which takes values of either wave 1 or wave 2. Individuals are uniquely identified with variable id.

    I am trying to discern whether entrants to control programmes have a significantly lower probability of becoming employed at wave 2, compared to entrants to treatment programmes.

    The test I thought I should be doing was a conditional logistic regression (using clogit) - looking only at individuals who have transitioned between employment states and comparing the likelihood of this transition between treatment and control groups.

    However this yields the error: outcome does not vary in any group. I suppose this is because the treatment variable is time-invariant and observation-invariant.

    Should I instead be using panel regression with random effects? eg:

    Code:
    xtset id wave
    xtlogit status treatment, re vce cluster siteid

    Thanks,
    Zoheb
    Last edited by Zoheb Khan; 19 Oct 2016, 08:02.

  • #2
    PS - descriptives on a range of individual, household and community-level variables suggest that the randomisation procedure was efficient, and that at least on observable characteristics the treatment and control groups are equal/comparable (another motivator for the use of RE?).

    Comment


    • #3
      If the -treatment- variable is time-invariant you cannot estimate the effect of the program on the employment status using this variable alone. Your problem actually sounds a lot like a difference-in-difference setting:

      Code:
      reg status i.treatment##i.wave
      or with individual fixed effects

      Code:
      xtreg status i.treatment##i.wave, fe
      I'm not an expert with random effects. I'm generally using fixed effects models because I'm always a bit worried about the potential bias in random effect models.

      Comment


      • #4
        Thanks - I hadn't considered a differences in differences approach, as I'm not familiar with it. I've been reading about it and have some more possibly stupid questions:

        1. Would I need to recode the treatment variable such that at wave 1 (baseline, before programmes are rolled out), it is equal to zero for all respondents? Ie everyone is untreated at wave 1, and only the treatment group is treated at wave 2? If I did this, would I then be able to use treatment as a predictor in a (normal) logistic regression with fixed effects, given that it now varies across time?

        2. I've read that the DID approach disregards the panel structure. I worry about serial autocorrelation, particularly on a variable like employment status. How would you suggest I correct for this? Or is the second example you give also based on a DID model (but with fixed effects), in which case the panel structure would be accounted for by xtreg?

        3. All the material I've come across since yesterday assumes a continuous dependent variable in the DID model, or in the case of a binary outcome, still models it in a linear way via OLS regression. I assume that this is because in the context of marginal change in probability, ie effect of the treatment, the effect of non-linearity becomes negligible. Is this why your suggestions are based on linear models?

        4. For the DID approach would I need to structure the dataset in such a way that it is a balanced panel?

        4. These are reasons which I thought validated the logistic regression with random effects approach: (a) for employment status, variation within individuals over time is expected to be substantially lower than variation across individuals, at least in this context of just a 2-wave panel (b) The non-variation of treatment over time (unless I recode the variable so that it varies at least for the treatment group). (c) I'm not particularly concerned with omitted variable bias, because of efficient randomisation between treatment and control.

        Thank you,
        Zoheb
        Last edited by Zoheb Khan; 20 Oct 2016, 03:01.

        Comment


        • #5
          Hello,

          I'd like to post a new question relating to the above which replaces the set of questions in post #4:

          For code

          Code:
          reg status i.treatment##i.wave
          What's the justification for not including the main effects of wave and treatment?

          My understanding is the interaction term measures how much bigger (smaller) the change in the treatment group was compared to the change in the control group. If I included a main effect for wave, would this measure the change in control between wave 1 and 2 (or equivalently the expected change for the treatment group if they were actually part of the control group)?

          Thanks
          Zoheb

          Comment


          • #6
            To answer your questions (or at least to try):

            1. In an DiD setting you need two dummies: a time-invariant dummy distinguishing between the treatment and the control group & and a dummy indicating whether the observation was before or after the potential treatment). Assuming individual 1 belongs to the treatment period and individual 2 to the control group, the coding should be something like this:

            Code:
            id     treatment     wave
            1      1             0
            1      1             1
            2      0             0
            2      0             1
            In an fixed-effects model, you would need - at least in my mind - just an indicator that is 0 before the treatment and 1 after the treatment (i.e. the wave dummy in my example above).
            2. As far as I know, serial correlation is a problem in panel models as well. It does not simply go away because you use
            xtreg instead of reg. In both models, you may account for that by using clustered standard errors adding the option vce(cluster id).

            3. That's correct. DiD models are typically based on linear models because the interpretation of interaction terms in non-linear models is not straightforward. With continuous dependent variables, this is no problem. It might become a problem with binary dependent variables because the linear prediction not longer bound between 0 an 1. This problem only arises if you use additional covariates in the DiD model. If you just use
            i.treatment##i.wave it is not a problem, because the regression essentially compares means, and thus the linear prediction is bound between 0 an 1. You can check how severe the problem of under- or overestimation is by producing the linear prediction (predict p, xb) and then plot a kernel density plot or use other descriptive statistics (e.g. percent of over- or underpredicted observations).

            4. A balanced panel always helps, but it is not more necessary than in a fixed-effects regression. In both models, I would include a set of time dummies to account for the fact that the observations come from different points of time. Panel attrition is, however, a serious problem for both models if it is extensive and systematical (e.g. treated observations drop out of the panel more/less frequent).

            5. I´'m not an expert with random effects models since those models are not as frequent as fixed effects models in economics. Therefore, I cannot give you any substantial advice whether a random effects model is appropriate here. A pragmatic approach may be to estimate both models (DiD and RE) and check whether the coefficient of interest is different in a meaningful way. Even though, there is some criticism to this approach, another way to validate the results of your RE model is to estimate a similar fixed effects model and compare the difference with a Hausman test.


            Code:
            webuse union, clear
            xtlogit union age grade i.not_smsa south##c.year, re        // Random effects model
            estimates store RE    
            xtlogit union age grade i.not_smsa south##c.year, fe        // Fixed effects model
            estimates store FE
            hausman FE RE
            
                             ---- Coefficients ----
                         |      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
                         |       FE           RE         Difference          S.E.
            -------------+----------------------------------------------------------------
                     age |    .0710973     .0156732        .0554241        .0948768
                   grade |    .0816111     .0870851       -.0054741        .0380104
              1.not_smsa |    .0224809    -.2511884        .2736693        .0776386
                 1.south |   -2.856488    -2.839112        -.017376        .2155589
                    year |   -.0636853    -.0068604       -.0568249        .0954997
            south#c.year |
                      1  |    .0264136     .0238506         .002563        .0023827
            ------------------------------------------------------------------------------
                                     b = consistent under Ho and Ha; obtained from xtlogit
                      B = inconsistent under Ha, efficient under Ho; obtained from xtlogit
            
                Test:  Ho:  difference in coefficients not systematic
            
                              chi2(6) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                                      =       17.48
                            Prob>chi2 =      0.0077
            If the null hypothesis of the Hausman test is not rejected you're fine. If it is rejected it does not necessarily mean that you cannot use a RE model, but - at least - you need to talk about why a RE model is still preferable in your mind.

            6. (in #5) The syntax with a double # will include both the dummies separately and an interaction. The following commands are equivalent:


            Code:
              
             reg status i.treatment##i.wave reg status i.treatment i.wave i.treatment#i.wave
            You're right with the interpretation; except for the fact that I think that the effect that you call "main effects of wave" is already included in the model (see above). The coefficient of wave measures how the dependent variable changes for the control group from period 0 to period 1. The coefficient of treatment measures the pre-treatment difference between both groups (i.e. in period 0). As you said, the interaction shows by how much the post-treatment difference differs form the pre-treatment difference. In other words, the interaction displays the treatment effect on the treated (ATT or ATET).

            I hope this helps.

            Comment


            • #7
              Hi Sebastian

              Thank you - this is very helpful. It's given me a lot to think about but is also immediately useful.

              Schoenes Wochenende,
              Zoheb

              Comment

              Working...
              X