data manipulation

River Huang

Join Date: Mar 2016

Posts: 1908
#1

data manipulation

03 Aug 2017, 18:11

Dear All, I have the following survey data set:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(ID year firstmarr final) 1 1991 1990 2011 1 1993 1990 2011 1 1997 1990 2011 1 2000 1990 2011 1 2004 1990 2011 1 2006 1990 2011 1 2009 1990 2011 1 2011 1990 2011 2 1993 1992 2011 2 1997 1992 2011 2 2000 1992 2011 2 2006 1992 2011 3 2000 2000 2011 3 2006 2000 2011 3 2009 2000 2011 end

Suppose that I have three individuals (ID=1,2,3) and eight waves of surveys (1991,1993,1997,2000,2004,2006,2009, and the last one 2011). But note that some individuals engage in all surveys but others do not. I'd like to form two new variables (ID1 and year1) which consist of all individuals and all waves (years) that are larger than or equal to the first married year `firstmarr'. The desired results are

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(ID_01 year_01) 1 1991 1 1993 1 1997 1 2000 1 2004 1 2006 1 2009 1 2011 2 1993 2 1997 2 2000 2 2004 2 2006 2 2009 2 2011 3 2000 3 2004 3 2006 3 2009 3 2011 end

For ID=1, he participates in all surveys and all survey years are larger than his first married year 1990, so that we keep its ID and all years as the results. For ID=2, while he only participates in four surveys (and his first married year is 1992), I'd like to keep the last seven survey years (1993,1997,2000,2004,2006,2009, and 2011, all are larger than 1992). For ID=3, he only participates in three surveys and his first married year is 2000, A such, I'd like to keep the last five survey years (2000,2004,2006,2009, and 2011, all are larger than or equal to 2000).

Any suggestions?

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Tags: None

Daniel Bela

Join Date: Apr 2014
Posts: 246

04 Aug 2017, 01:10

Hi River,

does this do the trick?

Code:

clear
input float(ID year firstmarr final)
1 1991 1990 2011
1 1993 1990 2011
1 1997 1990 2011
1 2000 1990 2011
1 2004 1990 2011
1 2006 1990 2011
1 2009 1990 2011
1 2011 1990 2011
2 1993 1992 2011
2 1997 1992 2011
2 2000 1992 2011
2 2006 1992 2011
3 2000 2000 2011
3 2006 2000 2011
3 2009 2000 2011
end

// temporary variables
version 14
tempvar expandtarget obscounter offset

// save total number of survey waves
quietly : levelsof year , local(wavelist)
local wavecount : word count `wavelist'

// determine how many waves to exclude for each person
generate `expandtarget'=`wavecount'
foreach wave of local wavelist {
    replace `expandtarget'=`expandtarget'-1 if (`wave'<=firstmarr)
}
generate `offset'=`wavecount'-`expandtarget'

// reduce dataset to one observation per ID
keep ID `expandtarget' `offset'
duplicates drop

// expand dataset to number of waves (minus diminisher) per person
expand `expandtarget'

// ex-post fillin of wave information
bysort ID : generate `obscounter'=_n
generate wave=ustrword(`"`wavelist'"',`obscounter'+`offset')

// rename variables to target format
rename (ID wave) (ID_01 year_01)

// list result
list ID_01 year_01 , sepby(ID_01)

Hopefully, I did not miss any detail of the question.

Regards
Bela

Last edited by Daniel Bela; 04 Aug 2017, 01:13.

Comment

River Huang

Join Date: Mar 2016

Posts: 1908
#3

04 Aug 2017, 01:29

Dear Daniel, Many thanks for your very helpful suggestion. Two slight modifications are made, though.

Code:

replace `expandtarget'=`expandtarget'-1 if (`wave' < firstmarr)

and I

Code:

destring wave, replace

to obtain numerical values.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10291
#4

04 Aug 2017, 01:42

A simpler method to approach this problem is to identify the complete number of survey years and then create a corresponding ID for each individual in the sample.

Code:

*//Keep ID with all survey years keep if ID==1 keep year *Expand N where N= no. of individuals in the sample expand 3 *Create ID for each individual at each year sort year by year: gen ID=_n

You can then merge using ID year and then do the required manipulations.
Comment

River Huang

Join Date: Mar 2016
Posts: 1908

04 Aug 2017, 02:50

Dear Andrew, Thanks for your suggestion. By augmenting your code, it works as follows.

Code:

tempfile main
save "`main'"

// Keep ID with all survey years
keep if ID == 1
keep year 

// Expand N where N= no. of individuals in the sample
expand 3

// Create ID for each individual at each year
sort year
by year: gen ID = _n
sort ID year

merge 1:1 ID year using "`main'"
bys ID: egen m = mean(firstmarr)
keep if year >= m

Ho-Chuan (River) Huang
Stata 19.0, MP(4)

Comment

Daniel Bela

Join Date: Apr 2014

Posts: 246
#6

04 Aug 2017, 03:16

Originally posted by River Huang View Post

Dear Daniel, Many thanks for your very helpful suggestion. Two slight modifications are made, though.

Code:

replace `expandtarget'=`expandtarget'-1 if (`wave' < firstmarr)

and I

Code:

destring wave, replace

to obtain numerical values.

Sorry, you're right about both; I misread the condition to exclude waves in the original post. Regarding the wave variable: In generating it, my code should read

Code:

generate wave=real(ustrword(`"`wavelist'"',`obscounter'+`offset'))

so that it is generated as a numeric variable in the first place (hence, no need for -destring- afterwards).

Andrew Musau's solution also works perfectly on the example data; it won't, however, as soon as the ID variable is not filled with consecutive numbers, but something else (be it string IDs, or non-consecutively numbered numerical IDs), I suppose.

Regards
Bela
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#7

04 Aug 2017, 03:35

Thank you again, Daniel.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10291
#8

04 Aug 2017, 04:04

Andrew Musau's solution also works perfectly on the example data; it won't, however, as soon as the ID variable is not filled with consecutive numbers, but something else (be it string IDs, or non-consecutively numbered numerical IDs), I suppose

You are correct Daniel Bela. In the case of a string var, use the standard approach first

Code:

*\\ID is a string var destring ID, replace

For non consecutive integers (ID variable), create an auxiliary variable with a one-to-one correspondence with the non consecutive ID var but still maintain the latter in the dataset

Code:

bys ID: gen ID2= 1 if _n==1 replace ID2 = sum(ID2)

Then merge on ID2. This works because the code relies only on uniqueness of ID and not both ID and year.
Comment
Daniel Bela

Join Date: Apr 2014

Posts: 246
#9

04 Aug 2017, 04:40

We're starting to enter a theoretical discussion here, as River Huang's problem has been solved by now.
Anyways:

Originally posted by Andrew Musau View Post

[...]. In the case of a string var, use the standard approach first

Code:

*\\ID is a string var destring ID, replace

This does not work for non-numeric IDs -- destring can only convert numerical values that are stored in a string variable; I was talking about non-numeric content (such as, for instance, "A1", "A2", "B1", "B2"), as is often the case for social security numbers and the like.

Originally posted by Andrew Musau View Post

For non consecutive integers (ID variable), create an auxiliary variable with a one-to-one correspondence with the non consecutive ID var but still maintain the latter in the dataset

Code:

bys ID: gen ID2= 1 if _n==1 replace ID2 = sum(ID2)

Then merge on ID2. This works because the code relies only on uniqueness of ID and not both ID and year.

I don't see how this should work without saving the correspondence table between the original IDs and the artificial consecutive-numerical-IDs beforehand. Then there is a way to achieve this by, after merging things together, re-replacing IDs through merging with the correspondence table again. My guess is that with larger dataset sizes, this becomes less efficient than my approach. Of course, it is nonetheless a valid solution to the problem.

Regards
Bela
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10291
#10

04 Aug 2017, 04:56

I am probably missing something. Because this will be useful to myself and others who face a similar problem in the future, can you create an example that illustrates the complications that you foresee?
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment