Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • data manipulation

    Dear All, I have the following survey data set:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(ID year firstmarr final)
    1 1991 1990 2011
    1 1993 1990 2011
    1 1997 1990 2011
    1 2000 1990 2011
    1 2004 1990 2011
    1 2006 1990 2011
    1 2009 1990 2011
    1 2011 1990 2011
    2 1993 1992 2011
    2 1997 1992 2011
    2 2000 1992 2011
    2 2006 1992 2011
    3 2000 2000 2011
    3 2006 2000 2011
    3 2009 2000 2011
    end
    Suppose that I have three individuals (ID=1,2,3) and eight waves of surveys (1991,1993,1997,2000,2004,2006,2009, and the last one 2011). But note that some individuals engage in all surveys but others do not. I'd like to form two new variables (ID1 and year1) which consist of all individuals and all waves (years) that are larger than or equal to the first married year `firstmarr'. The desired results are
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(ID_01 year_01)
    1 1991
    1 1993
    1 1997
    1 2000
    1 2004
    1 2006
    1 2009
    1 2011
    2 1993
    2 1997
    2 2000
    2 2004
    2 2006
    2 2009
    2 2011
    3 2000
    3 2004
    3 2006
    3 2009
    3 2011
    end
    For ID=1, he participates in all surveys and all survey years are larger than his first married year 1990, so that we keep its ID and all years as the results. For ID=2, while he only participates in four surveys (and his first married year is 1992), I'd like to keep the last seven survey years (1993,1997,2000,2004,2006,2009, and 2011, all are larger than 1992). For ID=3, he only participates in three surveys and his first married year is 2000, A such, I'd like to keep the last five survey years (2000,2004,2006,2009, and 2011, all are larger than or equal to 2000).

    Any suggestions?
    Ho-Chuan (River) Huang
    Stata 19.0, MP(4)

  • #2
    Hi River,

    does this do the trick?

    Code:
    clear
    input float(ID year firstmarr final)
    1 1991 1990 2011
    1 1993 1990 2011
    1 1997 1990 2011
    1 2000 1990 2011
    1 2004 1990 2011
    1 2006 1990 2011
    1 2009 1990 2011
    1 2011 1990 2011
    2 1993 1992 2011
    2 1997 1992 2011
    2 2000 1992 2011
    2 2006 1992 2011
    3 2000 2000 2011
    3 2006 2000 2011
    3 2009 2000 2011
    end
    
    // temporary variables
    version 14
    tempvar expandtarget obscounter offset
    
    // save total number of survey waves
    quietly : levelsof year , local(wavelist)
    local wavecount : word count `wavelist'
    
    // determine how many waves to exclude for each person
    generate `expandtarget'=`wavecount'
    foreach wave of local wavelist {
        replace `expandtarget'=`expandtarget'-1 if (`wave'<=firstmarr)
    }
    generate `offset'=`wavecount'-`expandtarget'
    
    // reduce dataset to one observation per ID
    keep ID `expandtarget' `offset'
    duplicates drop
    
    // expand dataset to number of waves (minus diminisher) per person
    expand `expandtarget'
    
    // ex-post fillin of wave information
    bysort ID : generate `obscounter'=_n
    generate wave=ustrword(`"`wavelist'"',`obscounter'+`offset')
    
    // rename variables to target format
    rename (ID wave) (ID_01 year_01)
    
    // list result
    list ID_01 year_01 , sepby(ID_01)
    Hopefully, I did not miss any detail of the question.

    Regards
    Bela
    Last edited by Daniel Bela; 04 Aug 2017, 01:13.

    Comment


    • #3
      Dear Daniel, Many thanks for your very helpful suggestion. Two slight modifications are made, though.
      Code:
      replace `expandtarget'=`expandtarget'-1 if (`wave' < firstmarr)
      and I
      Code:
      destring wave, replace
      to obtain numerical values.

      Ho-Chuan (River) Huang
      Stata 19.0, MP(4)

      Comment


      • #4
        A simpler method to approach this problem is to identify the complete number of survey years and then create a corresponding ID for each individual in the sample.

        Code:
        *//Keep ID with all survey years
        keep if ID==1
        keep year
        
        *Expand N where N= no. of individuals in the sample
        expand 3
        
        *Create ID for each individual at each year
        sort year
        by year: gen ID=_n
        You can then merge using ID year and then do the required manipulations.

        Comment


        • #5
          Dear Andrew, Thanks for your suggestion. By augmenting your code, it works as follows.
          Code:
          tempfile main
          save "`main'"
          
          // Keep ID with all survey years
          keep if ID == 1
          keep year 
          
          // Expand N where N= no. of individuals in the sample
          expand 3
          
          // Create ID for each individual at each year
          sort year
          by year: gen ID = _n
          sort ID year
          
          merge 1:1 ID year using "`main'"
          bys ID: egen m = mean(firstmarr)
          keep if year >= m
          Ho-Chuan (River) Huang
          Stata 19.0, MP(4)

          Comment


          • #6
            Originally posted by River Huang View Post
            Dear Daniel, Many thanks for your very helpful suggestion. Two slight modifications are made, though.
            Code:
            replace `expandtarget'=`expandtarget'-1 if (`wave' < firstmarr)
            and I
            Code:
            destring wave, replace
            to obtain numerical values.
            Sorry, you're right about both; I misread the condition to exclude waves in the original post. Regarding the wave variable: In generating it, my code should read
            Code:
            generate wave=real(ustrword(`"`wavelist'"',`obscounter'+`offset'))
            so that it is generated as a numeric variable in the first place (hence, no need for -destring- afterwards).

            Andrew Musau's solution also works perfectly on the example data; it won't, however, as soon as the ID variable is not filled with consecutive numbers, but something else (be it string IDs, or non-consecutively numbered numerical IDs), I suppose.

            Regards
            Bela

            Comment


            • #7
              Thank you again, Daniel.

              Ho-Chuan (River) Huang
              Stata 19.0, MP(4)

              Comment


              • #8
                Andrew Musau's solution also works perfectly on the example data; it won't, however, as soon as the ID variable is not filled with consecutive numbers, but something else (be it string IDs, or non-consecutively numbered numerical IDs), I suppose
                You are correct Daniel Bela. In the case of a string var, use the standard approach first

                Code:
                *\\ID is a string var
                destring ID, replace
                For non consecutive integers (ID variable), create an auxiliary variable with a one-to-one correspondence with the non consecutive ID var but still maintain the latter in the dataset

                Code:
                bys ID: gen ID2= 1 if _n==1
                replace ID2 = sum(ID2)
                Then merge on ID2. This works because the code relies only on uniqueness of ID and not both ID and year.

                Comment


                • #9
                  We're starting to enter a theoretical discussion here, as River Huang's problem has been solved by now.
                  Anyways:

                  Originally posted by Andrew Musau View Post

                  [...]. In the case of a string var, use the standard approach first

                  Code:
                  *\\ID is a string var
                  destring ID, replace
                  This does not work for non-numeric IDs -- destring can only convert numerical values that are stored in a string variable; I was talking about non-numeric content (such as, for instance, "A1", "A2", "B1", "B2"), as is often the case for social security numbers and the like.

                  Originally posted by Andrew Musau View Post
                  For non consecutive integers (ID variable), create an auxiliary variable with a one-to-one correspondence with the non consecutive ID var but still maintain the latter in the dataset
                  Code:
                  bys ID: gen ID2= 1 if _n==1
                  replace ID2 = sum(ID2)
                  Then merge on ID2. This works because the code relies only on uniqueness of ID and not both ID and year.
                  I don't see how this should work without saving the correspondence table between the original IDs and the artificial consecutive-numerical-IDs beforehand. Then there is a way to achieve this by, after merging things together, re-replacing IDs through merging with the correspondence table again. My guess is that with larger dataset sizes, this becomes less efficient than my approach. Of course, it is nonetheless a valid solution to the problem.

                  Regards
                  Bela

                  Comment


                  • #10
                    I am probably missing something. Because this will be useful to myself and others who face a similar problem in the future, can you create an example that illustrates the complications that you foresee?

                    Comment

                    Working...
                    X