Hi everyone,
I amlooking for help working on an unbalanced panel data set.
My dataset includes:
- 55,957 observations
- 4 waves (time variable)
- 33873 individuals (cross section variable)
- no missing values (I checked using the command "misstable sum")
My dataset is a survey and often individuals did not respond to the questionaire of every wave.
I used xtset to set my dataset to panel data
I used xtdescribe to see details
My dependent variable is binary and I want to estimate a fixed effects model.
If I run the regression (in the unbalanced panel)
Stata drops the lion share of the observations (all those individuals that did not respond to every wave I guess) and the output is as follows:
So far, I considered the following solutions:
1) ipolate: in my case I cannot assume that the missing observations are linear, therefore I'd rather not use it
2) clustering by country and year of birth, seems not to work because of repeated time values
[egen long both = group (oldcountry dn003_mod)]
3) dropping all observations that cause the imbalance in the panel (but then I would only analyze 6% of my dataset).
My question is: how can I deal with an unbalanced panel? Are there methods to balance it? Can I treat this as a pseudo panel? (since I am only looking at static effects, I would not mind losing the dynamics) If so, how does this work in stata?
I am working on my master thesis and I feel a bit lost, I would greatly appreciate any ideas / solutions to my problem.
Many thanks
Katharina Koe
I amlooking for help working on an unbalanced panel data set.
My dataset includes:
- 55,957 observations
- 4 waves (time variable)
- 33873 individuals (cross section variable)
- no missing values (I checked using the command "misstable sum")
My dataset is a survey and often individuals did not respond to the questionaire of every wave.
I used xtset to set my dataset to panel data
Code:
. xtset ident wave panel variable: ident (unbalanced) time variable: wave, 1 to 5, but with gaps delta: 1 unit
Code:
. xtdescribe ident: 2, 3, ..., 39561 n = 33873 wave: 1, 2, ..., 5 T = 4 Delta(wave) = 1 unit Span(wave) = 5 periods (ident*wave uniquely identifies each observation) Distribution of T_i: min 5% 25% 50% 75% 95% max 1 1 1 1 2 4 4 Freq. Percent Cum. | Pattern ---------------------------+--------- 9686 28.60 28.60 | ....1 5297 15.64 44.23 | ...11 4182 12.35 56.58 | 11... 3711 10.96 67.53 | ...1. 3274 9.67 77.20 | .1... 1766 5.21 82.41 | 11.11 1622 4.79 87.20 | 1.... 1415 4.18 91.38 | .1.11 1068 3.15 94.53 | 11.1. 1852 5.47 100.00 | (other patterns) ---------------------------+--------- 33873 100.00 | XX.XX
If I run the regression (in the unbalanced panel)
Code:
xtlogit smoking i.female age i.employment ep013_mod thinc_m Long_term_UNEM_RATE Short_term_UNEM_RATE w1 w2 w4 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10, fe
Code:
note: multiple positive outcomes within groups encountered. note: 32,480 groups (52,257 obs) dropped because of all positive or all negative outcomes. note: 1.female omitted because of no within-group variance. note: c1 omitted because of no within-group variance. note: c2 omitted because of no within-group variance. note: c3 omitted because of no within-group variance. note: c4 omitted because of no within-group variance. note: c5 omitted because of no within-group variance. note: c6 omitted because of no within-group variance. note: c7 omitted because of no within-group variance. note: c8 omitted because of no within-group variance. note: c9 omitted because of no within-group variance. note: c10 omitted because of no within-group variance. Iteration 0: log likelihood = -1281.1499 Iteration 1: log likelihood = -1270.7461 Iteration 2: log likelihood = -1270.5738 Iteration 3: log likelihood = -1270.5513 Iteration 4: log likelihood = -1270.5462 Iteration 5: log likelihood = -1270.545 Iteration 6: log likelihood = -1270.5448 Iteration 7: log likelihood = -1270.5447 Conditional fixed-effects logistic regression Number of obs = 3,697 Group variable: ident Number of groups = 1,392 Obs per group: min = 2 avg = 2.7 max = 4 LR chi2(19) = 118.51 Log likelihood = -1270.5447 Prob > chi2 = 0.0000 ----------------------------------------------------------------------------------------------- smoking | Coef. Std. Err. z P>|z| [95% Conf. Interval] ------------------------------+---------------------------------------------------------------- female | 1. female | 0 (omitted) age | -.3267943 .1637562 -2.00 0.046 -.6477504 -.0058381 | employment | permanent employee | -.0258121 .2591523 -0.10 0.921 -.5337413 .4821171 short-term civil servant | .5217493 .7114688 0.73 0.463 -.8727039 1.916202 permanent civil servant | -.1018764 .3151085 -0.32 0.746 -.7194778 .515725 permanently sick or disabled | -.3074872 .368893 -0.83 0.405 -1.030504 .4155298 homemaker | -.0474993 .3501035 -0.14 0.892 -.7336896 .638691 unemployed | .1966054 .3356884 0.59 0.558 -.4613317 .8545425 other | -.4163475 .4389228 -0.95 0.343 -1.27662 .4439255 seld-employed | -.3035182 .3188496 -0.95 0.341 -.928452 .3214155 employee undefined | .0960151 .2638505 0.36 0.716 -.4211224 .6131526 civil servant undefined | .1273983 .3037235 0.42 0.675 -.4678888 .7226853 employee or self-employed | 12.53894 436.7359 0.03 0.977 -843.4476 868.5255 | ep013_mod | .0024455 .0041442 0.59 0.555 -.005677 .010568 thinc_m | 9.67e-07 1.04e-06 0.93 0.351 -1.06e-06 3.00e-06 Long_term_UNEM_RATE | .5184911 .1590076 3.26 0.001 .206842 .8301402 Short_term_UNEM_RATE | -.1705346 .0830117 -2.05 0.040 -.3332345 -.0078347 w1 | -1.330023 1.093346 -1.22 0.224 -3.472942 .8128962 w2 | -.9778144 .6955041 -1.41 0.160 -2.340977 .3853486 w4 | .415658 .3283179 1.27 0.206 -.2278332 1.059149 c1 | 0 (omitted) c2 | 0 (omitted) c3 | 0 (omitted) c4 | 0 (omitted) c5 | 0 (omitted) c6 | 0 (omitted) c7 | 0 (omitted) c8 | 0 (omitted) c9 | 0 (omitted) c10 | 0 (omitted) -----------------------------------------------------------------------------------------------
So far, I considered the following solutions:
1) ipolate: in my case I cannot assume that the missing observations are linear, therefore I'd rather not use it
2) clustering by country and year of birth, seems not to work because of repeated time values
[egen long both = group (oldcountry dn003_mod)]
Code:
. xtset both wave repeated time values within panel
My question is: how can I deal with an unbalanced panel? Are there methods to balance it? Can I treat this as a pseudo panel? (since I am only looking at static effects, I would not mind losing the dynamics) If so, how does this work in stata?
I am working on my master thesis and I feel a bit lost, I would greatly appreciate any ideas / solutions to my problem.
Many thanks
Katharina Koe
Comment