Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to keep participants present in every wave/ year in panel data?

    I am using the British Household Panel Survey (BHPS) and have appended all 18 waves together. I now want to keep only those participants who have observations present in each wave. Some may have entered the survey in wave 2 or wave 3 or some started in wave 1 but dropped out later; these I want to drop.

    Here is 10 observations out of over 200,000:

    1000209 2 2 3 10002251 6 0 3488.5703125 1991 1
    1000381 1 2 2 10004491 6 0 1789.7335205078125 1991 1
    2000148 1 1 1 10004491 6 0 6345.7158203125 1992 2
    1000381 1 2 2 10004521 6 0 1789.7335205078125 1991 1
    2000148 1 2 2 10004521 6 0 5826.1171875 1992 2
    3000192 1 2 2 10004521 6 0 5101.91064453125 1993 3
    1000667 2 2 2 10007857 3 0 7200.06005859375 1991 1
    2000296 2 2 3 10007857 3 0 9829.087890625 1992 2
    3000257 2 2 2 10007857 3 0 8795.4912109375 1993 3
    8410658 2 2 2 10007857 3 0 2258.809814453125 1998 8


    This is what I have done:

    use "Q:\fulldata.dta"
    (Contains individual-level data for respondents)

    . tsset pid wave
    panel variable: pid (unbalanced)
    time variable: wave, 1 to 18, but with gaps
    delta: 1 unit

    . bysort pid: keep if _N ==18
    (206,781 observations deleted)

    My problem here is that it just deleted all the observations instead of keeping those present in every wave.

    Can someone advice?

    Thanks in advance

    p.s. first Statalist post, forgive me if the formatting is wrong.

  • #2
    If -bysort pid: keep if _N == 18- resulted in all observations being deleted, that would suggest that there is nobody in the data set who participated in all 18 waves. That doesn't really seem very surprising to me.

    The data example you showed is not helpful. There are no variable names: it's anybody's guess which variables pid, which is wave, etc. Also a data example with 10 observations is not very helpful for solving a problem relating to data chunks of 18 observations! And even had you included all that, the description shown is still missing attributes of the data that are sometimes important (though probably not in this particular situation.) for answering the question posed. The useful way to show example data is with the -dataex- command. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      hello,

      I have the same issue as above. I am using the understanding society database and I need to keep those who participated from wave 1 to 5.

      i have tried to install the dataex but due to the fact I am connected via bin to the uni server I could not install it
      here an example of the data where pidp is the person identifire, wave, age, and number of children




      thank you
      pidp wave age nchildren
      223725 5 38 1 .
      261125 2 24 1 .
      261125 3 25 1 .
      261125 4 26 1 .
      299885 4 32 1 .
      537205 2 34 1 .
      537205 3 35 1 .
      541285 3 25 1 .
      541285 4 26 1 .
      541285 5 27 1 .
      665045 2 28 1 .
      665045 3 29 1 0
      665045 4 30 1 0
      665045 5 31 1 0
      813285 2 40 1 .
      813285 3 41 1 .
      813285 4 42 1 .
      813285 5 43 1 .
      940445 2 30 1 .
      945205 2 36 1 .
      952005 2 59 1 .
      956765 2 55 1 .
      956765 3 56 1 .
      956765 4 57 1 .
      956765 5 58 1 .
      1114525 2 36 1 .
      1390605 2 19 1 .
      1731965 5 22 1 0
      1833965 2 45 1 0
      1833965 3 46 1 0
      1833965 4 47 1 .
      1833965 5 48 1 0
      2292285 3 36 1 .
      2292285 4 36 1 .
      2292285 5 38 1 .
      2297045 5 16 1 .
      2626845 2 32 1 2
      2626845 4 34 1 2
      2665605 2 38 1 .
      2665605 3 39 1 .
      2665605 4 40 1 .
      2665605 5 41 1 .
      2817245 2 48 1 2
      2817245 4 50 1 0
      2817245 5 51 1 .
      2825405 3 30 1 .
      2825405 4 31 1 .
      2825405 5 32 1 .
      2932845 2 27 1 .
      2932845 3 28 1 .
      2932845 4 29 1 .
      3063405 3 39 1 .
      3063405 4 40 1 .
      3489765 2 68 1 .
      3489765 3 69 1 .
      3489765 4 70 1 .
      3489765 5 71 1 .
      3565245 4 24 1 .
      3565245 5 25 1 .
      3567285 4 20 1 .
      3567285 5 21 1 .
      3568645 4 17 1 .
      3568645 5 17 1 .








      Comment


      • #4
        Code:
        isid pidp wave, sort
        assert inlist(wave, 1, 2, 3, 4, 5)
        by pidp (wave): keep if _N == 5
        Note: In your example data, nobody is in all five waves. In fact, nobody is in wave 1. So if your real data set is like this, you will ended up with an empty data set.

        Comment


        • #5
          thank you for your replay. I appreciate you help

          Comment

          Working...
          X