Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question on how to eliminate only certain data

    Good Evening and sorry for the not very descriptive title. I am not sure how exactly to word my issue for a title.

    I have a large amount of data that I would like to cut down to the relevant panel part. So there is a variable (in the code below it is sa0110) that indicates if the household was in the last wave too and the number in the sa0110 is the id of the household in the first wave. The variable survey indicates if the information is from wave 1 or wave 2.

    So in the example below one can see that household 234 and 456 were in both wave 1 and 2. Households 123, 345, 567, 678, 789 and 890 were not. So I would like to eliminate all these households out of the data so I would only have household that only appear in either wave 1 or 2, in the example below that would be 234 and 456. However 234 and 456 also are in wave 1 so I am not sure sure how I would keep these observations in wave 1 and in wave 2 while eliminating only those observation which household only appear in one wave. How would I do this?

    The variable implicate is just an indicator of the 5 multiple imputation and the original data point.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte survey int id byte implicate int sa0110
    1 123 0   .
    1 123 1   .
    1 123 2   .
    1 123 3   .
    1 123 4   .
    1 123 5   .
    1 234 0   .
    1 234 1   .
    1 234 2   .
    1 234 3   .
    1 234 4   .
    1 234 5   .
    1 345 0   .
    1 345 1   .
    1 345 2   .
    1 345 3   .
    1 345 4   .
    1 345 5   .
    1 456 0   .
    1 456 1   .
    1 456 2   .
    1 456 3   .
    1 456 4   .
    1 456 5   .
    1 567 0   .
    1 567 1   .
    1 567 2   .
    1 567 3   .
    1 567 4   .
    1 567 5   .
    1 678 0   .
    1 678 1   .
    1 678 2   .
    1 678 3   .
    1 678 4   .
    1 678 5   .
    1 789 0   .
    1 789 1   .
    1 789 2   .
    1 789 3   .
    1 789 4   .
    1 789 5   .
    1 890 0   .
    1 890 1   .
    1 890 2   .
    1 890 3   .
    1 890 4   .
    1 890 5   .
    2 234 1 234
    2 234 2 234
    2 234 3 234
    2 234 4 234
    2 234 5 234
    2 234 0 234
    2 456 1 456
    2 456 2 456
    2 456 3 456
    2 456 4 456
    2 456 5 456
    2 456 0 456
    end

  • #2
    So in the example below one can see that household 234 and 456 were in both wave 1 and 2. Households 123, 345, 567, 678, 789 and 890 were not. So I would like to eliminate all these households out of the data so I would only have household that only appear in either wave 1 or 2, in the example below that would be 234 and 456. However 234 and 456 also are in wave 1 so I am not sure sure how I would keep these observations in wave 1 and in wave 2 while eliminating only those observation which household only appear in one wave.
    This paragraph appears to contradict itself, and I do not understand what you actually want and why.

    When you refer to "all these" households, do you mean just 123, 345, 567, 678, 789, and 890, or also 234 and 456. The reference is ambiguous. "that would be 234 and 456" is simply not true if you are referring to appearing "in either wave 1 or 2." Do you perhaps mean both waves 1 and 2?

    Please clarify.

    Comment


    • #3
      Clyde,
      Ah sorry for explaining that so poorly. It should be

      So in the example below one can see that household 234 and 456 were in both wave 1 and 2. Households 123, 345, 567, 678, 789 and 890 were not. So I would like to eliminate all these households out of the data so I would only have household that appeared in BOTH wave 1 or 2, in the example below that would be 234 and 456. .
      Basically as you can see 234 and 456 appear in both waves, however in wave one thee two observations have a missing entry in the variable sa0110. sa0110 is a variable that appears in the second wave to mark households that have also previously appeared in the first wave. So naturally I would like to keep both observations in both waves for 234 and 456.

      However I am unsure how to do so, since I cannot think of a way to distinguish the data within wave 1. How can I keep 234 and 456 in wave 1 while removing, as an example, 123?

      Comment


      • #4
        I'm still not sure I understand. And, in particular, the variable sa0110 strikes me as superfluous--what am I missing? Anyway, if I do understand what you want, the following will do it:

        Code:
        by id, sort: egen in_wave_1 = max(survey == 1)
        by id: egen in_wave_2 = max(survey == 2)
        keep if in_wave_1 & in_wave_2

        Comment


        • #5
          Yes thank you Clyde! That seems to have worked! But sorry I didn't explain it well enough, what part is it that I need to explain more accurately?

          To copy the explanation from the variable list from the survey

          SA0010 household identification number (which I renamed to id in my data set above)

          SA0110 past household ID (only to be provided by countries with a panel component)

          Comment


          • #6
            Well if the code has done what you want, no further explanation is needed. Glad it worked.

            Comment

            Working...
            X