Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with (pseudo) Panel Data - unbalanced panel

    Hi everyone,

    I amlooking for help working on an unbalanced panel data set.

    My dataset includes:
    - 55,957 observations
    - 4 waves (time variable)
    - 33873 individuals (cross section variable)
    - no missing values (I checked using the command "misstable sum")

    My dataset is a survey and often individuals did not respond to the questionaire of every wave.
    I used xtset to set my dataset to panel data
    Code:
    . xtset ident wave
           panel variable:  ident (unbalanced)
            time variable:  wave, 1 to 5, but with gaps
                    delta:  1 unit
    I used xtdescribe to see details
    Code:
    . xtdescribe
    
       ident:  2, 3, ..., 39561                                  n =      33873
        wave:  1, 2, ..., 5                                      T =          4
               Delta(wave) = 1 unit
               Span(wave)  = 5 periods
               (ident*wave uniquely identifies each observation)
    
    Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                             1       1       1         1         2       4       4
    
         Freq.  Percent    Cum. |  Pattern
     ---------------------------+---------
         9686     28.60   28.60 |  ....1
         5297     15.64   44.23 |  ...11
         4182     12.35   56.58 |  11...
         3711     10.96   67.53 |  ...1.
         3274      9.67   77.20 |  .1...
         1766      5.21   82.41 |  11.11
         1622      4.79   87.20 |  1....
         1415      4.18   91.38 |  .1.11
         1068      3.15   94.53 |  11.1.
         1852      5.47  100.00 | (other patterns)
     ---------------------------+---------
        33873    100.00         |  XX.XX
    My dependent variable is binary and I want to estimate a fixed effects model.

    If I run the regression (in the unbalanced panel)
    Code:
    xtlogit smoking i.female age i.employment ep013_mod thinc_m Long_term_UNEM_RATE Short_term_UNEM_RATE w1 w2 w4 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10, fe
    Stata drops the lion share of the observations (all those individuals that did not respond to every wave I guess) and the output is as follows:
    Code:
    note: multiple positive outcomes within groups encountered.
    note: 32,480 groups (52,257 obs) dropped because of all positive or
          all negative outcomes.
    note: 1.female omitted because of no within-group variance.
    note: c1 omitted because of no within-group variance.
    note: c2 omitted because of no within-group variance.
    note: c3 omitted because of no within-group variance.
    note: c4 omitted because of no within-group variance.
    note: c5 omitted because of no within-group variance.
    note: c6 omitted because of no within-group variance.
    note: c7 omitted because of no within-group variance.
    note: c8 omitted because of no within-group variance.
    note: c9 omitted because of no within-group variance.
    note: c10 omitted because of no within-group variance.
    
    Iteration 0:   log likelihood = -1281.1499  
    Iteration 1:   log likelihood = -1270.7461  
    Iteration 2:   log likelihood = -1270.5738  
    Iteration 3:   log likelihood = -1270.5513  
    Iteration 4:   log likelihood = -1270.5462  
    Iteration 5:   log likelihood =  -1270.545  
    Iteration 6:   log likelihood = -1270.5448  
    Iteration 7:   log likelihood = -1270.5447  
    
    Conditional fixed-effects logistic regression   Number of obs     =      3,697
    Group variable: ident                           Number of groups  =      1,392
    
                                                    Obs per group:
                                                                  min =          2
                                                                  avg =        2.7
                                                                  max =          4
    
                                                    LR chi2(19)       =     118.51
    Log likelihood  = -1270.5447                    Prob > chi2       =     0.0000
    
    -----------------------------------------------------------------------------------------------
                          smoking |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    ------------------------------+----------------------------------------------------------------
                           female |
                       1. female  |          0  (omitted)
                              age |  -.3267943   .1637562    -2.00   0.046    -.6477504   -.0058381
                                  |
                       employment |
              permanent employee  |  -.0258121   .2591523    -0.10   0.921    -.5337413    .4821171
        short-term civil servant  |   .5217493   .7114688     0.73   0.463    -.8727039    1.916202
         permanent civil servant  |  -.1018764   .3151085    -0.32   0.746    -.7194778     .515725
    permanently sick or disabled  |  -.3074872    .368893    -0.83   0.405    -1.030504    .4155298
                       homemaker  |  -.0474993   .3501035    -0.14   0.892    -.7336896     .638691
                      unemployed  |   .1966054   .3356884     0.59   0.558    -.4613317    .8545425
                           other  |  -.4163475   .4389228    -0.95   0.343     -1.27662    .4439255
                   seld-employed  |  -.3035182   .3188496    -0.95   0.341     -.928452    .3214155
             employee  undefined  |   .0960151   .2638505     0.36   0.716    -.4211224    .6131526
         civil servant undefined  |   .1273983   .3037235     0.42   0.675    -.4678888    .7226853
       employee or self-employed  |   12.53894   436.7359     0.03   0.977    -843.4476    868.5255
                                  |
                        ep013_mod |   .0024455   .0041442     0.59   0.555     -.005677     .010568
                          thinc_m |   9.67e-07   1.04e-06     0.93   0.351    -1.06e-06    3.00e-06
              Long_term_UNEM_RATE |   .5184911   .1590076     3.26   0.001      .206842    .8301402
             Short_term_UNEM_RATE |  -.1705346   .0830117    -2.05   0.040    -.3332345   -.0078347
                               w1 |  -1.330023   1.093346    -1.22   0.224    -3.472942    .8128962
                               w2 |  -.9778144   .6955041    -1.41   0.160    -2.340977    .3853486
                               w4 |    .415658   .3283179     1.27   0.206    -.2278332    1.059149
                               c1 |          0  (omitted)
                               c2 |          0  (omitted)
                               c3 |          0  (omitted)
                               c4 |          0  (omitted)
                               c5 |          0  (omitted)
                               c6 |          0  (omitted)
                               c7 |          0  (omitted)
                               c8 |          0  (omitted)
                               c9 |          0  (omitted)
                              c10 |          0  (omitted)
    -----------------------------------------------------------------------------------------------

    So far, I considered the following solutions:
    1) ipolate: in my case I cannot assume that the missing observations are linear, therefore I'd rather not use it
    2) clustering by country and year of birth, seems not to work because of repeated time values
    [egen long both = group (oldcountry dn003_mod)]
    Code:
    . xtset both wave
         repeated time values within panel
    3) dropping all observations that cause the imbalance in the panel (but then I would only analyze 6% of my dataset).

    My question is: how can I deal with an unbalanced panel? Are there methods to balance it? Can I treat this as a pseudo panel? (since I am only looking at static effects, I would not mind losing the dynamics) If so, how does this work in stata?
    I am working on my master thesis and I feel a bit lost, I would greatly appreciate any ideas / solutions to my problem.

    Many thanks
    Katharina Koe

  • #2
    Katharina:
    welcome to this forum.
    As far as your query is concerned:
    1) Stata can handle both balanced and unbalanced panel datasets;
    2) hence, do not drop the observations with missing values; if you think that -ipolate- is not the way to go, you may want to consider -mi- suite of commands;
    3) however, the main issue with your dataset rests on the perfect prediction;
    4) if you have repeated time variable in your panels, you can -xtset. your data including the -panelid- only (if you do not plan to use time-series operator):
    Code:
    xtset ident
    ;
    5) is there any reason why you should go (conditional) -fe- instead of -re-?
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Dear Carlo,
      thank you very much for your help!

      1) thats good to know
      2) I will look into -mi-
      3) I am not sure I understand what you mean. I am (trying to) analyze if individual behavior (smoking) changes over time when my independent variables change. I do not need the effect on individual level, but in aggregate terms would be sufficient.
      4) thank you, I am trying that right now, my Stata (Stata 15.1 SE) is calculation for 30 minutes already, I hope to get some results.

      5) I used -fe instead of -re because (that's what I learned in my econometrics class)
      RE can accommodate time-invariant variables but makes the unrealistic assumption that the omitted heterogeneity is uncorrelated with the regressors
      FE allows for correlation between the omitted heterogeneity and the regressors but cannot accommodate time-invariant variables.
      In my data, I observe more between variation than within variation.

      Please correct me if I am wrong here.

      Many thanks!
      Katharina

      Comment


      • #4
        Katharina:
        in 3) I meant what Stata warned you about: some predictors are omitted due no within-group variance;
        Your point # 5) about the difference between -fe- and -re- specification is correct. Just out of curiosity: did you test (conditional, in -xtlogit-) -fe- vs -re- specification via -hausman-?
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Carlo,
          thank you for your answer.

          The Hausman test suggest that I use FE indeed.

          I ran both FE an RE regressions and stored my results and then conducted the hausman test
          Code:
           xtlogit smoking i.female age i.iv009_mod eduyears_mod i.mar_stat hhsize i.partnerinhh ch001_ i.sphus bmi i.alcohol i.p
          > hysicalinac i.employment ep013_mod i.co007_ thinc_m GDP_growth_rate Long_term_UNEM_RATE Short_term_UNEM_RATE w1 w2 w3
          > w4 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10, fe
          
          estimates store fixed
          
           xtlogit smoking i.female age i.iv009_mod eduyears_mod i.mar_stat hhsize i.partnerinhh ch001_ i.sphus bmi i.alcohol i.p
          > hysicalinac i.employment ep013_mod i.co007_ thinc_m GDP_growth_rate Long_term_UNEM_RATE Short_term_UNEM_RATE w1 w2 w3
          > w4 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10, re
          
          
          estimates store random
          
          hausman fixed random
          
           b = consistent under Ho and Ha; obtained from xtlogit
                    B = inconsistent under Ha, efficient under Ho; obtained from xtlogit
          
              Test:  Ho:  difference in coefficients not systematic
          
                           chi2(26) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                                    =      223.09
                          Prob>chi2 =      0.0000

          Comment


          • #6
            However, I am still struggeling with the FE model because I lose too many observations.

            - observations: 52221
            - individuals: n = 31740
            - wave (time):t = 4

            If I run the FE model
            Code:
            xtlogit smoking i.female age i.iv009_mod eduyears_mod i.mar_stat hhsize i.partnerinhh ch001_ i.sphus bmi i.ever_smoked i.alcohol i.physicalinac i.employment ep013_mod i.co007_ thinc_m GDP_growth_rate Long_term_UNEM_RATE Short_term_UNEM_RATE w1 w2 w3 w4 w5 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10, fe
            Stata reports
            Code:
            note: multiple positive outcomes within groups encountered.
            note: 30,391 groups (48,642 obs) dropped because of all positive or
                  all negative outcomes.
            note: 1.female omitted because of no within-group variance.
            note: eduyears_mod omitted because of no within-group variance.
            note: c1 omitted because of no within-group variance.
            note: c2 omitted because of no within-group variance.
            note: c3 omitted because of no within-group variance.
            note: c4 omitted because of no within-group variance.
            note: c5 omitted because of no within-group variance.
            note: c6 omitted because of no within-group variance.
            note: c7 omitted because of no within-group variance.
            note: c8 omitted because of no within-group variance.
            note: c9 omitted because of no within-group variance.
            note: c10 omitted because of no within-group variance.
            where c are the countries.

            What does it mean that "note: 30,391 groups (48 ,642 obs) dropped because of all positive orall negative outcomes."?
            And is there some solution for that?

            Would it be possible to group the individuals into cohorts?

            Thank you!

            Comment


            • #7
              Katharina:
              Stata tells you that the omitted groups have no within-panel variation as far as the outcome is concerned: hence their inclusion in the regression in basically unuseful.
              Consitently with the -fe-machinery, if persons do not move to other countries within the same panel, -country- will be omitted.
              Unfortunately, I think there's nothing you can do but sticking with the limited sample size.
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                Thank you Carlos.
                It seems like I missunderstood the country-fixed-effect. It makes sense that in my panel individuals did not move to another country.

                What I am interested in is rather the country-specific effect of taxes / prices of cigarettes.
                And additionally the effect that in some countries smoking shows a downward sloping trend over time.

                Do you have an idea how I can model these?
                Do I need to set:
                Code:
                 xtset country year
                Many thanks!
                Last edited by Katharina Koe; 08 May 2018, 08:07.

                Comment


                • #9
                  Katharina:
                  let's stick with the backbones of -fe- estimator:
                  - -fe- is the right estimator to investigate what happen within the same panel as times go by (do people cut off smoking or not moving from, say, year 1 to year 2?), whereas -fe- will not tell you basically nothing about possible changes between different panels as time goes by;
                  . in a nutshell, -fe- estimator gets rid of time-invariant predictors (eg, country, if patient does not change country as time goes by) and estimate coefficients of time-varying predictor, with the implicit shortcoming that if the predictor expected to vary as time goes by does not behave so in your dataset, it will be omitted due to -fe-machinery and no coefficient will be estimated;
                  - the said, -fe- estimator works well when there's enough (whatever that qualitative term may mean) variation in time varying predictor;
                  - if you -xtset country year- your -panelid- will be country instead of patients and, since you have observations (ie, panel units) nested within countries, Stata will warn you about repeated -timevar- within the same panel. I also suspect that, considering data at the country level does not allow you to make any conclusion at individual level (https://en.wikipedia.org/wiki/Ecological_fallacy).

                  All that said, it is also true that, in some research fields different from economics, fixed effect are often side-tracked in favout of -re- sprecification and despite -hausman- outcome (Clyde Schechter touched on this feature many times on this forum).
                  Obviously, any statistical strategy should be defensible, especially against reviewers' criticisms: hence, I would recommend you to discuss the whole matter with a colleague and/or with your supervisor.
                  Kind regards,
                  Carlo
                  (Stata 19.0)

                  Comment


                  • #10
                    Carlos, thanks again for your hep!
                    I will discuss this with my thesis supervisor.
                    I was thinking about Mundlak's approach but then the logit model is not linear, so I am also not entirely convinced by this approach.
                    Anyways, thank you a lot for your help!

                    Comment

                    Working...
                    X