Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing duplicates within years in panel data

    Hello,

    I am struggling with something that I wanted to share. I have an unbalanced panel dataset. The panel has been described as follows:

    xtset xwaveid wave, year
    Here, the wave indicates the year, and xwaveid indicates the unique id for each respondent to the survey. However, it is the case that several people from the same household can have been surveyed as well. Households are enumerated using hhid, however this id is specific for that particular wave, and changes every wave.

    Basically, what I want to do is, for each wave (2006-2014):

    - Look at each hhid, and find the number of different xwaveid's under that hhid for that wave.
    - If there is only one xwaveid for that hhid then there is no problem for that wave.
    - If, however, there are more than one xwaveid's for that hhid for that wave, then I only want to keep the xwaveid for that wave which is older, by looking at hgage.

    So I want to remove duplicates at a yearly level and not over the entire dataset. I know what I want to do but I could not translate it into a coding language. I have below provided an example of my data. I would highly appreciate any help.

    Best,
    Merve

    P.S. Basically while it is possible to create a longitudinal individual dataset, I am trying to construct an individual dataset so that individuals will be representative of distinct households over time.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long xwaveid float wave long hhid int hgage
     108110 2009     81 36
     400830 2008   1061 24
     115868 2008   1061 26
     109993 2007   1431 25
     700012 2007   1431 30
     109850 2006   1521 26
     118285 2006   1561 36
     118286 2006   1561 37
     108023 2008   1631 38
     108023 2010   2381 40
     109850 2007   2441 27
     115525 2006   4071 22
     115525 2008   4171 24
     115525 2007   4541 23
     118285 2008   5991 38
     118286 2008   5991 39
     118286 2007   6471 38
     118285 2007   6471 37
     400075 2006   7801 26
     100619 2006   7801 24
     115131 2008   7971 28
     108023 2009   8241 39
     115131 2007   8641 27
     114769 2009   8891 18
     115131 2006   8901 26
     115131 2010   9352 30
     114322 2006   9881 35
     108110 2014 100111 41
     108110 2013 100121 40
     108110 2012 100181 39
     116083 2014 100631 26
     114322 2007  10131 36
     114322 2008  10331 37
    1000282 2014 110251 29
     115525 2014 110251 30
    1000282 2013 110281 28
     115525 2013 110281 29
     109013 2012 110511 33
     113069 2012 111501 50
     106444 2014 115121 21
     115131 2014 116851 34
     115131 2013 117292 33
     108902 2014 117391 29
     114263 2013 118321 29
     114263 2012 118751 28
     114570 2014 119452 21
     114570 2013 120381 20
     114570 2012 120881 19
     106945 2014 124391 35
     800232 2014 128451 32
     101763 2014 128451 34
     800232 2013 129981 31
     101763 2013 129981 33
     800232 2012 130691 30
     101763 2012 130691 32
     114263 2014 130991 30
     114569 2012 131211 21
    1000305 2012 131211 62
     115131 2012 133951 32
     113075 2014 134071 70
    1000276 2013 140491 28
     104369 2013 140491 36
    1000276 2012 140531 27
     104369 2012 140531 35
     119207 2014 140701 24
     117956 2012 140761 30
     119207 2013 140771 23
     116020 2014 140771 25
     114768 2013 140861 27
     104369 2014 141031 37
    1000276 2014 141031 29
     800232 2009  14191 27
     101763 2009  14191 29
     109809 2014 142551 40
     109808 2014 142551 40
     109809 2013 142681 39
     109808 2013 142681 39
     118272 2012 143251 25
     114768 2014 144221 28
     119381 2013 144861 31
     119380 2013 144861 40
     119380 2014 144881 41
     119381 2014 144881 32
     118272 2013 145321 26
     101411 2014 146651 30
     116593 2006  14691 51
     116594 2006  14691 48
     101411 2013 147041 29
     118272 2014 157451 27
     107223 2009  15921 20
     113341 2012 159371 23
     113341 2013 159831 24
     113341 2014 160171 25
     101979 2008  16131 18
     114769 2008  16891 17
     101979 2006  16961 16
     111422 2011 170021 25
     600377 2011 170021 24
     110312 2012 170021 84
     116423 2011 170031 49
    end
    format %ty wave
    label values hgage FHGAGE

  • #2
    Code:
    bys hhid wave (hgage): keep if _n==_N

    ADDED IN EDIT: Missing values are sorted last. Therefore, with missing values present for the variable "hgage", you can do the following:

    Code:
    gen hgage2= -hgage
    bys hhid wave (hgage2): keep if _n==1
    Last edited by Andrew Musau; 02 Apr 2019, 07:26.

    Comment


    • #3
      Hi Andrew,

      There were no missing values in hgage. Therefore your first code worked perfectly fine. Thank you very much for your help.
      Have a nice day.

      Comment


      • #4
        Hi Andrew,

        I think I made a big logical mistake in doing this for every round. What I have been trying to do now is:

        1. For each xwaveid, go to the first year it appears in the data (any year from 2006 to 2014).
        2. Look at the hhid, are there other individuals from the same household, i.e. with the same hhid for that year?
        3. If yes, then look at their ages. Keep the xwaveid with the higher age, not only for that year but for all subsequent years, i.e. remove the younger xwaveid from all subsequent years.
        4. If no, then no problem.

        Could you guide me on how I could write the code for this one?
        Kind regards


        Comment


        • #5
          2. Look at the hhid, are there other individuals from the same household, i.e. with the same hhid for that year?
          In that particular year (from #1) or in any year? It is possible that another individual was in the household in a year different from the year when xwaveid first appears. Secondly, what happens if a certain xwaveid is the oldest in one household but then moves to another household where there is someone older. Do we delete that individual's observations from the first household too?

          Comment


          • #6
            The following will search for an xwaveid the first year it appears on the data set and determine its household. In the same year, it will check whether there are other xwaveid in the same household and if this is the case, it will delete the observations of the younger xwaveid across all years in the data set. It assumes there are no missing values for the variable hgage. See if this is what you wanted.


            Code:
            by xwaveid (wave), sort: gen hhinit=hhid[1]
            bys hhinit wave: egen tag= count(xwaveid)
            replace tag = tag>1
            bys hhinit wave (hgage): gen tag2 = _n==_N & tag
            gen tag3= tag-tag2
            bys xwaveid: egen tag4= max(tag3)
            drop if tag4

            Comment


            • #7
              Hi Andrew,

              Thank you very much. I have used your code and it has helped with perfectly.

              Comment

              Working...
              X