Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to Create a Balanced Panel Data

    Hi everyone,
    I am working with an unbalanced panel of individuals observed over 5 waves:


    pid: 401011, 401012, ..., 799987 n = 24955
    wave: 1, 2, ..., 5 T = 5
    Delta(wave) = 1 unit
    Span(wave) = 5 periods
    (pid*wave uniquely identifies each observation)

    Distribution of T_i: min 5% 25% 50% 75% 95% max
    1 1 1 2 4 5 5

    Freq. Percent Cum. | Pattern
    ---------------------------+---------
    4140 16.59 16.59 | ....1
    3786 15.17 31.76 | 11111
    3355 13.44 45.21 | ...11
    1761 7.06 52.26 | ..111
    1736 6.96 59.22 | .1111
    1733 6.94 66.16 | 1....
    1637 6.56 72.72 | 111..
    1321 5.29 78.02 | ...1.
    1313 5.26 83.28 | 1111.
    4173 16.72 100.00 | (other patterns)
    ---------------------------+---------
    24955 100.00 | XXXXX


    I want to create a balanced panel consisting of individuals observed over the last 3 waves. However, when I run the code below my panel remains unbalanced:

    egen wanted = total(inrange(wave, 3, 5)), by(pid)
    keep if wanted==3

    xtdes

    pid: 401014, 401016, ..., 791879 n = 7283
    wave: 1, 2, ..., 5 T = 5
    Delta(wave) = 1 unit
    Span(wave) = 5 periods
    (pid*wave uniquely identifies each observation)

    Distribution of T_i: min 5% 25% 50% 75% 95% max
    3 3 4 5 5 5 5

    Freq. Percent Cum. | Pattern
    ---------------------------+---------
    3786 51.98 51.98 | 11111
    1761 24.18 76.16 | ..111
    1736 23.84 100.00 | .1111
    ---------------------------+---------
    7283 100.00 | XXXXX

    Which code should I use to get a balanced panel for the last 3 waves? Thanks.

  • #2
    Charles:
    this is a very bad idea indeed.
    Following your approach, you'll end up with a dataset that has nothing to do with the original one and any inference will be biased.
    That said, you may want to try:
    Code:
    keep if wave>=3
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Hi Charles, Carlo is right in the sense that if the entire sample is properly representative of a population and the final three waves are not, eliminating them will cause biased sample statistics and regression coefficients. Depending on your study, this might cause your results to be the opposite of what the data actually say, at the extreme, though this won't necessarily happen.

      What is it you're trying to accomplish by making the dataset balanced? Is it a matter of robust summary statistics, or that you haven't found an estimation procedure that works for your hypothesis? You might do just as well removing pids which are not present in every wave.

      Comment

      Working...
      X