Problem with (pseudo) Panel Data - unbalanced panel

Katharina Koe

Join Date: May 2018
Posts: 13

Problem with (pseudo) Panel Data - unbalanced panel

07 May 2018, 03:18

Hi everyone,

I amlooking for help working on an unbalanced panel data set.

My dataset includes:
- 55,957 observations
- 4 waves (time variable)
- 33873 individuals (cross section variable)
- no missing values (I checked using the command "misstable sum")

My dataset is a survey and often individuals did not respond to the questionaire of every wave.
I used xtset to set my dataset to panel data

Code:

. xtset ident wave
       panel variable:  ident (unbalanced)
        time variable:  wave, 1 to 5, but with gaps
                delta:  1 unit

I used xtdescribe to see details

Code:

. xtdescribe

   ident:  2, 3, ..., 39561                                  n =      33873
    wave:  1, 2, ..., 5                                      T =          4
           Delta(wave) = 1 unit
           Span(wave)  = 5 periods
           (ident*wave uniquely identifies each observation)

Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                         1       1       1         1         2       4       4

     Freq.  Percent    Cum. |  Pattern
 ---------------------------+---------
     9686     28.60   28.60 |  ....1
     5297     15.64   44.23 |  ...11
     4182     12.35   56.58 |  11...
     3711     10.96   67.53 |  ...1.
     3274      9.67   77.20 |  .1...
     1766      5.21   82.41 |  11.11
     1622      4.79   87.20 |  1....
     1415      4.18   91.38 |  .1.11
     1068      3.15   94.53 |  11.1.
     1852      5.47  100.00 | (other patterns)
 ---------------------------+---------
    33873    100.00         |  XX.XX

My dependent variable is binary and I want to estimate a fixed effects model.

If I run the regression (in the unbalanced panel)

Code:

xtlogit smoking i.female age i.employment ep013_mod thinc_m Long_term_UNEM_RATE Short_term_UNEM_RATE w1 w2 w4 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10, fe

Stata drops the lion share of the observations (all those individuals that did not respond to every wave I guess) and the output is as follows:

Code:

note: multiple positive outcomes within groups encountered.
note: 32,480 groups (52,257 obs) dropped because of all positive or
      all negative outcomes.
note: 1.female omitted because of no within-group variance.
note: c1 omitted because of no within-group variance.
note: c2 omitted because of no within-group variance.
note: c3 omitted because of no within-group variance.
note: c4 omitted because of no within-group variance.
note: c5 omitted because of no within-group variance.
note: c6 omitted because of no within-group variance.
note: c7 omitted because of no within-group variance.
note: c8 omitted because of no within-group variance.
note: c9 omitted because of no within-group variance.
note: c10 omitted because of no within-group variance.

Iteration 0:   log likelihood = -1281.1499  
Iteration 1:   log likelihood = -1270.7461  
Iteration 2:   log likelihood = -1270.5738  
Iteration 3:   log likelihood = -1270.5513  
Iteration 4:   log likelihood = -1270.5462  
Iteration 5:   log likelihood =  -1270.545  
Iteration 6:   log likelihood = -1270.5448  
Iteration 7:   log likelihood = -1270.5447  

Conditional fixed-effects logistic regression   Number of obs     =      3,697
Group variable: ident                           Number of groups  =      1,392

                                                Obs per group:
                                                              min =          2
                                                              avg =        2.7
                                                              max =          4

                                                LR chi2(19)       =     118.51
Log likelihood  = -1270.5447                    Prob > chi2       =     0.0000

-----------------------------------------------------------------------------------------------
                      smoking |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------------------+----------------------------------------------------------------
                       female |
                   1. female  |          0  (omitted)
                          age |  -.3267943   .1637562    -2.00   0.046    -.6477504   -.0058381
                              |
                   employment |
          permanent employee  |  -.0258121   .2591523    -0.10   0.921    -.5337413    .4821171
    short-term civil servant  |   .5217493   .7114688     0.73   0.463    -.8727039    1.916202
     permanent civil servant  |  -.1018764   .3151085    -0.32   0.746    -.7194778     .515725
permanently sick or disabled  |  -.3074872    .368893    -0.83   0.405    -1.030504    .4155298
                   homemaker  |  -.0474993   .3501035    -0.14   0.892    -.7336896     .638691
                  unemployed  |   .1966054   .3356884     0.59   0.558    -.4613317    .8545425
                       other  |  -.4163475   .4389228    -0.95   0.343     -1.27662    .4439255
               seld-employed  |  -.3035182   .3188496    -0.95   0.341     -.928452    .3214155
         employee  undefined  |   .0960151   .2638505     0.36   0.716    -.4211224    .6131526
     civil servant undefined  |   .1273983   .3037235     0.42   0.675    -.4678888    .7226853
   employee or self-employed  |   12.53894   436.7359     0.03   0.977    -843.4476    868.5255
                              |
                    ep013_mod |   .0024455   .0041442     0.59   0.555     -.005677     .010568
                      thinc_m |   9.67e-07   1.04e-06     0.93   0.351    -1.06e-06    3.00e-06
          Long_term_UNEM_RATE |   .5184911   .1590076     3.26   0.001      .206842    .8301402
         Short_term_UNEM_RATE |  -.1705346   .0830117    -2.05   0.040    -.3332345   -.0078347
                           w1 |  -1.330023   1.093346    -1.22   0.224    -3.472942    .8128962
                           w2 |  -.9778144   .6955041    -1.41   0.160    -2.340977    .3853486
                           w4 |    .415658   .3283179     1.27   0.206    -.2278332    1.059149
                           c1 |          0  (omitted)
                           c2 |          0  (omitted)
                           c3 |          0  (omitted)
                           c4 |          0  (omitted)
                           c5 |          0  (omitted)
                           c6 |          0  (omitted)
                           c7 |          0  (omitted)
                           c8 |          0  (omitted)
                           c9 |          0  (omitted)
                          c10 |          0  (omitted)
-----------------------------------------------------------------------------------------------

So far, I considered the following solutions:
1) ipolate: in my case I cannot assume that the missing observations are linear, therefore I'd rather not use it
2) clustering by country and year of birth, seems not to work because of repeated time values
[egen long both = group (oldcountry dn003_mod)]

Code:

. xtset both wave
     repeated time values within panel

3) dropping all observations that cause the imbalance in the panel (but then I would only analyze 6% of my dataset).

My question is: how can I deal with an unbalanced panel? Are there methods to balance it? Can I treat this as a pseudo panel? (since I am only looking at static effects, I would not mind losing the dynamics) If so, how does this work in stata?
I am working on my master thesis and I feel a bit lost, I would greatly appreciate any ideas / solutions to my problem.

Many thanks
Katharina Koe

Tags: None

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17702
#2

07 May 2018, 05:03

Katharina:
welcome to this forum.
As far as your query is concerned:
1) Stata can handle both balanced and unbalanced panel datasets;
2) hence, do not drop the observations with missing values; if you think that -ipolate- is not the way to go, you may want to consider -mi- suite of commands;
3) however, the main issue with your dataset rests on the perfect prediction;
4) if you have repeated time variable in your panels, you can -xtset. your data including the -panelid- only (if you do not plan to use time-series operator):

Code:

xtset ident

;
5) is there any reason why you should go (conditional) -fe- instead of -re-?

Kind regards,
Carlo
(Stata 19.0)
Comment
Katharina Koe

Join Date: May 2018

Posts: 13
#3

07 May 2018, 09:27

Dear Carlo,
thank you very much for your help!

1) thats good to know
2) I will look into -mi-
3) I am not sure I understand what you mean. I am (trying to) analyze if individual behavior (smoking) changes over time when my independent variables change. I do not need the effect on individual level, but in aggregate terms would be sufficient.
4) thank you, I am trying that right now, my Stata (Stata 15.1 SE) is calculation for 30 minutes already, I hope to get some results.

5) I used -fe instead of -re because (that's what I learned in my econometrics class)
RE can accommodate time-invariant variables but makes the unrealistic assumption that the omitted heterogeneity is uncorrelated with the regressors
FE allows for correlation between the omitted heterogeneity and the regressors but cannot accommodate time-invariant variables.
In my data, I observe more between variation than within variation.

Please correct me if I am wrong here.

Many thanks!
Katharina
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17702
#4

07 May 2018, 10:01

Katharina:
in 3) I meant what Stata warned you about: some predictors are omitted due no within-group variance;
Your point # 5) about the difference between -fe- and -re- specification is correct. Just out of curiosity: did you test (conditional, in -xtlogit-) -fe- vs -re- specification via -hausman-?

Kind regards,
Carlo
(Stata 19.0)
Comment

Katharina Koe

Join Date: May 2018
Posts: 13

08 May 2018, 05:39

Carlo,
thank you for your answer.

The Hausman test suggest that I use FE indeed.

I ran both FE an RE regressions and stored my results and then conducted the hausman test

Code:

 xtlogit smoking i.female age i.iv009_mod eduyears_mod i.mar_stat hhsize i.partnerinhh ch001_ i.sphus bmi i.alcohol i.p
> hysicalinac i.employment ep013_mod i.co007_ thinc_m GDP_growth_rate Long_term_UNEM_RATE Short_term_UNEM_RATE w1 w2 w3
> w4 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10, fe

estimates store fixed

 xtlogit smoking i.female age i.iv009_mod eduyears_mod i.mar_stat hhsize i.partnerinhh ch001_ i.sphus bmi i.alcohol i.p
> hysicalinac i.employment ep013_mod i.co007_ thinc_m GDP_growth_rate Long_term_UNEM_RATE Short_term_UNEM_RATE w1 w2 w3
> w4 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10, re


estimates store random

hausman fixed random

 b = consistent under Ho and Ha; obtained from xtlogit
          B = inconsistent under Ha, efficient under Ho; obtained from xtlogit

    Test:  Ho:  difference in coefficients not systematic

                 chi2(26) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                          =      223.09
                Prob>chi2 =      0.0000

Comment

Katharina Koe

Join Date: May 2018
Posts: 13

08 May 2018, 05:47

However, I am still struggeling with the FE model because I lose too many observations.

- observations: 52221
- individuals: n = 31740
- wave (time):t = 4

If I run the FE model

Code:

xtlogit smoking i.female age i.iv009_mod eduyears_mod i.mar_stat hhsize i.partnerinhh ch001_ i.sphus bmi i.ever_smoked i.alcohol i.physicalinac i.employment ep013_mod i.co007_ thinc_m GDP_growth_rate Long_term_UNEM_RATE Short_term_UNEM_RATE w1 w2 w3 w4 w5 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10, fe

Stata reports

Code:

note: multiple positive outcomes within groups encountered.
note: 30,391 groups (48,642 obs) dropped because of all positive or
      all negative outcomes.
note: 1.female omitted because of no within-group variance.
note: eduyears_mod omitted because of no within-group variance.
note: c1 omitted because of no within-group variance.
note: c2 omitted because of no within-group variance.
note: c3 omitted because of no within-group variance.
note: c4 omitted because of no within-group variance.
note: c5 omitted because of no within-group variance.
note: c6 omitted because of no within-group variance.
note: c7 omitted because of no within-group variance.
note: c8 omitted because of no within-group variance.
note: c9 omitted because of no within-group variance.
note: c10 omitted because of no within-group variance.

where c are the countries.

What does it mean that "note: 30,391 groups (48 ,642 obs) dropped because of all positive orall negative outcomes."?
And is there some solution for that?

Would it be possible to group the individuals into cohorts?

Thank you!

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17702
#7

08 May 2018, 06:48

Katharina:
Stata tells you that the omitted groups have no within-panel variation as far as the outcome is concerned: hence their inclusion in the regression in basically unuseful.
Consitently with the -fe-machinery, if persons do not move to other countries within the same panel, -country- will be omitted.
Unfortunately, I think there's nothing you can do but sticking with the limited sample size.

Kind regards,
Carlo
(Stata 19.0)
Comment
Katharina Koe

Join Date: May 2018

Posts: 13
#8

08 May 2018, 08:04

Thank you Carlos.
It seems like I missunderstood the country-fixed-effect. It makes sense that in my panel individuals did not move to another country.

What I am interested in is rather the country-specific effect of taxes / prices of cigarettes.
And additionally the effect that in some countries smoking shows a downward sloping trend over time.

Do you have an idea how I can model these?
Do I need to set:

Code:

xtset country year

Many thanks!

Last edited by Katharina Koe; 08 May 2018, 08:07.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17702
#9

08 May 2018, 08:49

Katharina:
let's stick with the backbones of -fe- estimator:
- -fe- is the right estimator to investigate what happen within the same panel as times go by (do people cut off smoking or not moving from, say, year 1 to year 2?), whereas -fe- will not tell you basically nothing about possible changes between different panels as time goes by;
. in a nutshell, -fe- estimator gets rid of time-invariant predictors (eg, country, if patient does not change country as time goes by) and estimate coefficients of time-varying predictor, with the implicit shortcoming that if the predictor expected to vary as time goes by does not behave so in your dataset, it will be omitted due to -fe-machinery and no coefficient will be estimated;
- the said, -fe- estimator works well when there's enough (whatever that qualitative term may mean) variation in time varying predictor;
- if you -xtset country year- your -panelid- will be country instead of patients and, since you have observations (ie, panel units) nested within countries, Stata will warn you about repeated -timevar- within the same panel. I also suspect that, considering data at the country level does not allow you to make any conclusion at individual level (https://en.wikipedia.org/wiki/Ecological_fallacy).

All that said, it is also true that, in some research fields different from economics, fixed effect are often side-tracked in favout of -re- sprecification and despite -hausman- outcome (Clyde Schechter touched on this feature many times on this forum).
Obviously, any statistical strategy should be defensible, especially against reviewers' criticisms: hence, I would recommend you to discuss the whole matter with a colleague and/or with your supervisor.

Kind regards,
Carlo
(Stata 19.0)
Comment
Katharina Koe

Join Date: May 2018

Posts: 13
#10

08 May 2018, 11:55

Carlos, thanks again for your hep!
I will discuss this with my thesis supervisor.
I was thinking about Mundlak's approach but then the logit model is not linear, so I am also not entirely convinced by this approach.
Anyways, thank you a lot for your help!
Comment

Announcement

Problem with (pseudo) Panel Data - unbalanced panel

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment