Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Panel Data 1: Dropping observations from panel without unbalancing... i.e. dropping entire subject when needing to drop observations

    I have used PSID tools to create a panel consisting exclusively of family-level variables. I am attempting to clean my data by dropping the entire subject when I need to drop observations to make sure I do not unbalance my panel (already unbalanced everything once). For example, I am trying to drop every subject that has a change in family composition (famcompch > 0 for any year).

    Variables if interest are:
    x11101ll = ID variable
    wave = year variable
    famcompch = the variable I am attempting to clean

    Here is an example of some of the variables from dataex:

    input long x11101ll int(wave x11102) byte(xsqnr famcompch)
    4003 1 . . .
    4003 3 . . .
    4003 5 . . .
    4003 7 . . .
    4003 1999 2 1 0
    4003 2001 96 1 1
    4003 2003 1392 1 0
    4003 2005 289 1 0
    4003 2007 148 1 0
    4004 1 . . .
    4004 3 . . .
    4004 5 . . .
    4004 7 . . .
    4004 1999 6129 1 3
    4004 2001 5987 1 0
    4004 2003 6278 1 0
    4004 2005 2356 1 0
    4004 2007 5399 1 0
    4006 1 . . .
    4006 3 . . .
    4006 5 . . .
    4006 7 . . .
    4006 1999 4920 2 0
    4006 2001 5599 2 0
    4006 2003 4812 1 3
    4006 2005 4097 1 0
    4006 2007 720 1 0
    4031 1 . . .
    4031 3 . . .
    4031 5 . . .
    4031 7 . . .
    4031 1999 1702 1 0
    4031 2001 285 2 4
    4031 2003 1427 2 0
    4031 2005 1157 2 0
    4031 2007 196 2 0
    4033 1 . . .
    4033 3 . . .
    4033 5 . . .
    4033 7 . . .
    4033 1999 2 4 0
    4033 2001 5479 1 5
    4033 2003 6061 1 2
    4033 2005 641 1 0
    4033 2007 189 1 0
    4039 1 . . .
    4039 3 . . .
    4039 5 . . .
    4039 7 . . .
    4039 1999 2 3 0
    4039 2001 96 3 1
    4039 2003 1392 3 0
    4039 2005 289 3 0
    4039 2007 148 3 0

    I have read probably every thread on statalist and google about how to accomplish my goal, yet I still end up dropping my entire dataset if I use -bysort-, or just not getting results if I use -egen-. After reading -help by- I've tried my own hand at this, and I always get something like:


    . bysort x11101ll (famcompch): keep if famcompch[_N] == 0
    (126,207 observations deleted)


    Help!

  • #2
    Code:
    bys x11101ll: egen maxvar = max(famcompch)
    drop if maxvar>0

    Comment


    • #3
      Thank you so much. Worked perfectly. I am assuming I can use the same code for cleaning up other variables where I want to drop the max or min?

      Comment


      • #4
        If you have other variables you need to drop based on the same condition (!>0) it should work fine, since maxvar will only be >0 if any of the observations for that variable are >0 within an id.

        Comment


        • #5
          How would this work if I wanted to drop based on something being > or < a certain threshold? Such as dropping each id where var1 > 0 or where var2 > 5, depending on where the survey coded the missing info. I have a bunch of these to clean up.

          Even more specifically, if you see my other post "Panel Data 2," I'm trying to drop excess ids. I'm running family-level analysis, and I have family-level data, but it's been grafted onto each individual in the the family. Each family member has the same values for all the variables, so it effectively weights larger families (i.e. family A has 4 obs for each variables, family B has one). I believe I can drop excess family members by using variable "xsqnr", which is the sequence number for when each family member was interviewed. If I just keep all the xsqnr = 1, I should be fine and then have one panel subject per HH. So I'm trying to figure out how to code to either keep xsqnr = 1, or to drop xsqnr > 1, without unbalancing my panel.

          Thanks again.

          Comment


          • #6
            Edit - misunderstood what you were asking for.

            Does this not do what you want?

            Code:
            drop if xsqnr > 1
            Last edited by Ali Atia; 11 Dec 2020, 13:33.

            Comment


            • #7
              Let me try again. It might be that xsqnr isn’t what I thought it was. But if you take a look at my Panel Data 2 question it generally describes the dilemma of trying to drop individuals so I have only one subject per family per year

              Comment


              • #8
                Solved my Panel Data 2 question. I went back to wide format, and dropped all the xsqnr99>1. 1 corresponded to the head of household for that year. That allowed me to track family units just by tracking the person who was a head in 1999 (but might not be a head in subsequent waves).

                Now I'm back to the problem of cleaning data. Similar to your above suggestion, I attempted to remove "0" values for state of residence ("state"). Problem is that the maxvar generated does not match "state." See below:

                Code:
                 tab state
                
                 Psid State |
                         Of |
                  Residence |
                       Code |      Freq.     Percent        Cum.
                ------------+-----------------------------------
                          0 |         99        0.38        0.38
                          1 |        345        1.34        1.73
                          2 |        357        1.39        3.11
                          3 |        667        2.59        5.71
                
                . bys x11101ll: egen maxvar = max(state)
                
                . tab maxvar
                
                     maxvar |      Freq.     Percent        Cum.
                ------------+-----------------------------------
                          0 |         50        0.19        0.19
                          1 |        300        1.17        1.36
                          2 |        280        1.09        2.45
                          3 |        610        2.37        4.82
                
                . drop if maxvar<1
                (50 observations deleted)
                
                . tab maxvar
                
                     maxvar |      Freq.     Percent        Cum.
                ------------+-----------------------------------
                          1 |        300        1.17        1.17
                          2 |        280        1.09        2.26
                          3 |        610        2.38        4.64
                
                . tab state
                
                 Psid State |
                         Of |
                  Residence |
                       Code |      Freq.     Percent        Cum.
                ------------+-----------------------------------
                          0 |         49        0.19        0.19
                          1 |        345        1.34        1.54
                          2 |        357        1.39        2.93
                          3 |        667        2.60        5.53
                Why does the maxvar not match state? Any help would be appreciated. I have a few variables with 0 or 99, etc., as a value where I need to drop the subject with the 0 or 99 observation to maintain a balanced panel.

                Comment


                • #9
                  As a follow up, except for the state of residence variable, I was able to clean up most of my other variables (those with various debt and wealth measures) using

                  bysort [id] (var1) : drop if var1[_N]>x

                  For some reason when I ran this with my race variable, it dropped the whole data set, but -bys x11101ll: egen maxvar = max(race)- and -drop if maxvar>7- worked nicely.

                  The final sticking point is my state of residence variable. As in the example above, the egen maxvar approach will not match up with the underlying race variable. Similarly, if I try to use the above bysort code, only some of the relevant obervations are dropped. For example:

                  Code:
                  . bysort x11101ll (state) : drop if state[_N]<1
                  (35 observations deleted)
                  
                  . tab state
                  
                   Psid State |
                           Of |
                    Residence |
                         Code |      Freq.     Percent        Cum.
                  ------------+-----------------------------------
                            0 |         37        0.22        0.22
                            1 |        213        1.24        1.46
                            2 |        226        1.32        2.77
                            3 |        490        2.86        5.63
                  As you can see, I can't seem to get either code to work for "state." Similarly, I also have one value of "99" for state that I also need to drop, but am having problems with that as well.

                  Comment


                  • #10
                    In response to #8, that behavior makes sense. You're dropping panels where the highest value is 0, not panels where 0 ever appears. A better approach would be to use egen's min function:

                    Code:
                    bys x11101ll: egen minvar = min(state)
                    Where (assuming there are no negatives), minvar would be equal to 0 if 0 appears ≥1 times within a panel. You can then drop if minvar == 0.

                    Comment


                    • #11
                      Another way to do this which also lets you drop subjects where state == 99:

                      Code:
                      bys x11101ll: egen totalvar = total(state==99)
                      drop if totalvar > 0
                      This can be applied more widely because it doesn't rely on the value you want to drop being the max or min value within an id.

                      Comment

                      Working...
                      X