Collapse data so that subsequent waves stack?

Jaycob Applegate

Join Date: Jan 2023
Posts: 40

Collapse data so that subsequent waves stack?

20 Jul 2023, 10:21

Hi all,

I apologize if the title is unclear. I have the following data structure.

Code:

        +-----------------------------------------------------------------------------------------------------------+
        |   hhidpn        kidid   wave    riwstat   kabyea~k   keduc_~k   keduc_~4   valid_~d   indica~r   total_~r |
        |-----------------------------------------------------------------------------------------------------------|
166124. | 74892040   0748920101      8   4.nr,ali       1966         16         16          6          1          3 |
166125. | 74892040   0748920101      9   4.nr,ali       1966         16         16          6          1          3 |
166126. | 74892040   0748920101     10   1.resp,a       1966         16         16          6          1          3 |
        |-----------------------------------------------------------------------------------------------------------|
166127. | 74892040   0748920102      8   4.nr,ali       1968         12         12          6          1          5 |
166128. | 74892040   0748920102      9   4.nr,ali       1968         12         12          6          1          5 |
166129. | 74892040   0748920102     10   1.resp,a       1968         12         12          6          1          5 |
166130. | 74892040   0748920102     11   1.resp,a       1968         12         12          6          1          5 |
166131. | 74892040   0748920102     12   1.resp,a       1968         12         12          6          1          5 |
        |-----------------------------------------------------------------------------------------------------------|
166132. | 74892040   0748920103      8   4.nr,ali       1970         16         16          6          1          5 |
166133. | 74892040   0748920103      9   4.nr,ali       1970         16         16          6          1          5 |
166134. | 74892040   0748920103     10   1.resp,a       1970         16         16          6          1          5 |
166135. | 74892040   0748920103     11   1.resp,a       1970         16         16          6          1          5 |
166136. | 74892040   0748920103     12   1.resp,a       1970         16         16          6          1          5 |
        |-----------------------------------------------------------------------------------------------------------|
166137. | 74892040   0748920104      8   4.nr,ali       1971         11         11          6          1          5 |
166138. | 74892040   0748920104      9   4.nr,ali       1971         11         11          6          1          5 |
166139. | 74892040   0748920104     10   1.resp,a       1971         11         11          6          1          5 |
166140. | 74892040   0748920104     11   1.resp,a       1971         11         11          6          1          5 |
166141. | 74892040   0748920104     12   1.resp,a       1971         11         11          6          1          5 |
        |-----------------------------------------------------------------------------------------------------------|
166142. | 74892040   0748920105      8   4.nr,ali       1973         16         16          6          1          5 |
166143. | 74892040   0748920105      9   4.nr,ali       1973         16         16          6          1          5 |
166144. | 74892040   0748920105     10   1.resp,a       1973         16         16          6          1          5 |
166145. | 74892040   0748920105     11   1.resp,a       1973         16         16          6          1          5 |
166146. | 74892040   0748920105     12   1.resp,a       1973         16         16          6          1          5 |
        |-----------------------------------------------------------------------------------------------------------|
166147. | 74892040   0748920106      8   4.nr,ali       1976          9          9          6          1          5 |
166148. | 74892040   0748920106      9   4.nr,ali       1976          9          9          6          1          5 |
166149. | 74892040   0748920106     10   1.resp,a       1976          9          9          6          1          5 |
166150. | 74892040   0748920106     11   1.resp,a       1976          9          9          6          1          5 |
166151. | 74892040   0748920106     12   1.resp,a       1976          9          9          6          1          5 |
        +-----------------------------------------------------------------------------------------------------------+

In this example, children (kidid) are nested within respondents (hhidpn). At this point, the only relevant variables are those of the respondent which are attached to every child row, and aggregate child level characteristics which I have already created. I now wanted to remove the children and bring this to a respondent level file. The problem is that in many cases, not all children are equally present in the data. This occurs due to death but also age cutoffs I am using. In the example, you can see child 1 has waves 8-10, whereas later children have 8-12. If I were to select only the first child to keep, I would miss out the extra data.

In the past, I have used collapse(firstnm) for this. Unfortunately, I have a large dataset, and the collapse takes a very long time to run on top of decoding/encoding to preserve missing data. Is the collapse function still my solution? Or is there a quicker/simpler way?

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#2

20 Jul 2023, 10:49

-gcollapse-, part of the -gtools- suite, by Mauricio Caceres Bravo, is available from SSC or from github.com/mcaceresb/stata-gtools. It is appreciably faster than -collapse- in very large data sets, and it supports the -collapse- syntax, so there is essentially no learning curve for using it.

I don't understand how decoding/encoding preserves missing data. What is the issue here?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#3

20 Jul 2023, 10:59

I don't know if this will be faster than -gcollapse-, but the following will be faster than -collapse- because it avoids all the overhead of command parsing and the housekeeping required to handle all of the many possible things that -collapse- can be asked to do:

Code:

capture program drop one_hh program define one_hh foreach v of varlist list_of_variables_needing_firstnm_calculation { local t: type `v' gen `t' firstnm = `v' in 1 replace firstnm = cond(missing(firstnm[_n-1]), `v', firstnm[_n-1]) in 2/L replace `v' = firstnm in L drop firstnm } keep in L exit end runby one_hh, by(hhidpn) verbose

-runby- is written by Robert Picard and me. It is available from SSC.
Comment
Jaycob Applegate

Join Date: Jan 2023

Posts: 40
#4

20 Jul 2023, 12:20

Originally posted by Clyde Schechter View Post

-gcollapse-, part of the -gtools- suite, by Mauricio Caceres Bravo, is available from SSC or from github.com/mcaceresb/stata-gtools. It is appreciably faster than -collapse- in very large data sets, and it supports the -collapse- syntax, so there is essentially no learning curve for using it.

I don't understand how decoding/encoding preserves missing data. What is the issue here?

The majority of my variables have various extended missing codes. My understanding was that when collapsing, these would collapse to system missing, even if I am interested in keeping the extended missing values. Thank you for the example code. I was also not aware of the gtools suite.

One related question. If I wanted to collapse all variables, is there a shortcut to doing so with the wildcard command? I can't use wild card as it contains the variables I am using for collapse. I would to be able to do something like "collapse *(- hhidpn wave)"
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#5

20 Jul 2023, 12:35

I would to be able to do something like "collapse *(- hhidpn wave)"

Code:

ds hhidpn wave, not local vbles `r(varlist)'

will place the names of all variables in your data set other than hhidpn and wave into local macro vbles.

You can then use that with -collapse- or -gcollapse-: -collapse (firstnm) `vbles', by(hhidpn wave)-.

However, if you are going to try the code in #3, this approach will not work, because you cannot pass the local macro into program one_hh. So to use the code in #3 I would do it as:

Code:

ds hhidpn wave, not char _dta[vbles] `r(varlist)'

Then in the code for program one_hh, change the -foreach- command to:

Code:

foreach v of varlist `:char _dta[vbles]' {

Finally, I (mis?) understood your question to involve calcualting the first non-missing value in each household. But what you say in #4 suggests you actually need to do it for each combination of household and wave. If that is correct, change -runby one_hh, by(hhidpn)- to -runby one_hh, by(hhidpn wave)-.

Last edited by Clyde Schechter; 20 Jul 2023, 12:38.
Comment
Jaycob Applegate

Join Date: Jan 2023

Posts: 40
#6

20 Jul 2023, 13:02

That works, thank you Clyde!
Comment

Announcement

Collapse data so that subsequent waves stack?

Comment

Comment

Comment

Comment

Comment