Using only the first response in a CPS wave

Sebastian Lara

Join Date: Jan 2018

Posts: 9
#1

Using only the first response in a CPS wave

09 Jan 2018, 15:20

I am trying to do work with CPS data which works in waves (surveyed four months, then off for eight, then back on for four months, the survey months being waves). I am using waves 1-4. One of the variables is age, and I only want to keep the first response in wave 1 and copy it to the age variable for waves 2-4. Each person has a unique ID identifier (CPSIDP) that I can use to match the waves.

My professor said this, but I am not sure I understand:

"On the question of picking off wave 1 age etc: There are a number of ways it could work. It depends on variable names and the organization of the dataset that you're working with. Here is one way that can work if you have one big data file that has a column for age in each of the possible surveys:

First drop all of the observations that are not wave one. Then you can generate a new variable that is the sum of the age variables. Of course, the age doesn't exist for any of the months that are not wave==1. So, this should just give you the age in wave==1. It might be that the generate command adding up the various age variables doesn't work because the age is coded as missing in the other periods (and the sum when any one element is missing is mis"

Thank you in advance!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#2

09 Jan 2018, 16:03

I don't think I would do it this way--even the explanation gives several problems that could easily arise. Here's what I would do.

Code:

// VERIFY AGE IS CODED CONSISTENTLY IN ALL OBSERVATIONS // OF THE SAME PARTICIPANT IN ANY GIVEN WAVE by CPSIDP (age), sort: assert age == age[1] | missing(age) // CREATE AN AGE AT WAVE 1 VARIABLE FOR EACH PARTICIPANT // IN ALL OBSERVATIONS FOR THAT PARTICIPANT by CPSIDP: egen age_wave_1 = max(cond(wave == 1, age, .))

Notes: Large data sets often contain errors and inconsistencies. This code starts by verifying that age is at least consistently reported across all observations of the same participant in wave 1. It then identifies that unique age, and copies it into a new variable, age_wave_1, in all observations for that participant. I do not recommend just clobbering the values of the original age variable in subsequent waves because a) you might need that information later for some currently unforeseen reasons, and b) the variable would contain misleading information, as age for some participants will change over the course of the different waves.

Because you did not provide example data, this code is not tested. It may contain errors. It may actually be unsuitable for your data set, as I have made assumptions about how your data is organized. If your data are not as I imagine them to be, then we have both wasted our time here. I was willing to hazard a guess about your data because I have worked with CPS data myself on occasion. But one shouldn't rely on such happy coincidences. In the future, if you want assistance with code, you should always post example data. Please read the forum FAQ for excellent advice about posting effective questions that are likely to draw timely and helpful responses. Pay particular attention to #12 which describes the helpful way to show example Stata data.
Comment
Sebastian Lara

Join Date: Jan 2018

Posts: 9
#3

09 Jan 2018, 16:43

Thank you. For the sake of time we were thinking of doing it the original way (values of the original age variable in subsequent waves), taking note that doing so would be a limitation of our analysis. But this could also work! I will double check. And I took note of your last paragraph- my apologies.
Comment
Sebastian Lara

Join Date: Jan 2018

Posts: 9
#4

10 Jan 2018, 12:47

I was able to use the code that you offered (changing the variable names to match), and this was the results I got:

For the first line, I presume this means that 132,680 individuals either had a birthday at some point during the first four waves of rotation, or their age wasn't recorded to begin with?

What is the discrepancy between the number of contradictions in the first line of code, and the number of missing values generated in the second line of code? Is the difference between the two those observations that didn't have an age value to begin with (although I presume age is a pretty straightforward varible unlikely to be missing)?

My data set is comprised of observations from many years and rotations all in one file, sorted by year.

Thank you in advance!!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#5

10 Jan 2018, 13:26

For the first line, I presume this means that 132,680 individuals either had a birthday at some point during the first four waves of rotation, or their age wasn't recorded to begin with?

No, it means that 132,680 individuals had two different non-missing values for age reported People with no age recorded at all during the wave would not be contradictions, because the -missing(age)- part of the condition would always be satisfied for them. However, I note that I neglected to restrict the condition to a single wave. So if people reported different ages in different waves, these would be triggered. So that first command is wrong and should have been:

Code:

by cpsidp wave (age), sort: assert age == age[1] | missing(age)

You may find that many of the contradictions go away with this correction. Then there is the issue of what to do with any contradictions that might remain. Some of them may indeed be birthdays that occur during the course of a wave. Here's how I would approach these:

Code:

// FIRST MARK THE CONTRADICTIONS by cpsidp wave (age), sort: egen byte problem = max(age != age[1] & !missing(age)) // LOOK FOR BIRTHDAYS WITHIN WAVES: AGE CHANGES BY AT MOST 1 // AND OBSERVATIONS ARE IN CHRONOLOGICAL ORDER gen byte age_missing = missing(age) by cpsidp wave (age_missing date), sort: gen byte in_order = age <= age[_n+1] by cpsidp wave age_missing (date): gen age_change = age[_N]-age[1] replace age_change = 0 if age_missing by cpsidp wave (in_order), sort: replace in_order = in_order[1] by cpsidp wave: egen mitigated = min(in_order & age_change <= 1) replace problem = 0 if mitigated

Note: Because you have not shared example data, I cannot test this code. The logic here is somewhat complicated, so I'm not 100% that I have it right. But if I do, at this point those observations with problem == 1 are CPSIDP's and waves where there are inconsistent reports of age within the wave and where the inconsistencies are not chronologically compatible with the mere passage of a birthday. You can -list- or -browse- them to inspect them and figure out what to do about them.

What is the discrepancy between the number of contradictions in the first line of code, and the number of missing values generated in the second line of code?

There is no discrepancy; they are different things. In fact, there is no overlap between them. The number of contradictions is people who have inconsistent reports of non-missing values of age in wave 1. The missing observations generated are people for whom no value of age is reported in age 1.

I presume age is a pretty straightforward varible unlikely to be missing

It is not particularly uncommon for people to decline to disclose their age, for a variety of reasons. Also, if you are using a public use data set from the CPS, they may have withheld ages for some people to protect confidentiality. I don't recall what their policies regarding this are. You seem to have about 11% non-response here (in terms of observations, the percentage of people may be different if different people have different numbers of observations). That's not great, but it's not terrible.
Comment

Sebastian Lara

Join Date: Jan 2018
Posts: 9

10 Jan 2018, 19:19

Here is an example of my dataset (I use mish as the variable for wave):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int year byte(month mish age) double cpsidp
2008 1 1 18 20080100020104
2008 2 2 18 20080100020104
2008 1 1 32 20080100022101
2008 2 2 32 20080100022101
2008 1 1 16 20080100022605
2008 2 2 16 20080100022605
2008 3 3 16 20080100022605
2008 4 4 17 20080100022605
2008 2 2 19 20080100028903
2008 3 3 19 20080100028903
2008 4 4 19 20080100028903
2008 1 1 28 20080100032901
2008 2 2 29 20080100032901
2008 3 3 29 20080100032901
2008 4 4 29 20080100032901
2008 1 1 21 20080100033801
2008 2 2 22 20080100033801
2008 3 3 22 20080100033801
2008 4 4 22 20080100033801
2008 1 1 20 20080100033802
2008 1 1 26 20080100033803
2008 2 2 26 20080100033803
2008 3 3 26 20080100033803
2008 4 4 26 20080100033803
2008 3 3 20 20080100033804
2008 4 4 20 20080100033804
2008 1 1 20 20080100038203
2008 2 2 20 20080100038203
2008 3 3 20 20080100038203
2008 4 4 20 20080100038203
2008 1 1 23 20080100038801
2008 2 2 23 20080100038801
2008 3 3 23 20080100038801
2008 4 4 23 20080100038801
2008 1 1 28 20080100038802
2008 2 2 28 20080100038802
2008 3 3 28 20080100038802
2008 4 4 28 20080100038802
2008 3 3 16 20080100043102
2008 4 4 16 20080100043102
2008 1 1 18 20080100050902
2008 2 2 18 20080100050902
2008 1 1 29 20080100051101
2008 2 2 29 20080100051101
2008 3 3 29 20080100051101
2008 4 4 29 20080100051101
2008 1 1 32 20080100051102
2008 2 2 32 20080100051102
2008 3 3 32 20080100051102
2008 4 4 32 20080100051102
2008 1 1 32 20080100062001
2008 1 1 32 20080100062002
2008 1 1 31 20080100063001
2008 2 2 32 20080100063001
2008 3 3 32 20080100063001
2008 4 4 32 20080100063001
2008 1 1 16 20080100063002
2008 2 2 16 20080100063002
2008 3 3 16 20080100063002
2008 4 4 16 20080100063002
2008 1 1 18 20080100063901
2008 2 2 18 20080100063901
2008 3 3 18 20080100063901
2008 4 4 18 20080100063901
2008 1 1 30 20080100066801
2008 2 2 30 20080100066801
2008 3 3 30 20080100066801
2008 4 4 30 20080100066801
2008 1 1 24 20080100068801
2008 2 2 24 20080100068801
2008 3 3 24 20080100068801
2008 4 4 24 20080100068801
2008 1 1 20 20080100068802
2008 2 2 20 20080100068802
2008 3 3 20 20080100068802
2008 4 4 20 20080100068802
2008 1 1 17 20080100069203
2008 2 2 17 20080100069203
2008 3 3 17 20080100069203
2008 4 4 17 20080100069203
2008 1 1 22 20080100069204
2008 2 2 22 20080100069204
2008 3 3 22 20080100069204
2008 4 4 22 20080100069204
2008 1 1 29 20080100069801
2008 2 2 29 20080100069801
2008 3 3 29 20080100069801
2008 4 4 30 20080100069801
2008 1 1 29 20080100070002
2008 2 2 29 20080100070002
2008 3 3 30 20080100070002
2008 4 4 30 20080100070002
2008 1 1 20 20080100070103
2008 2 2 20 20080100070103
2008 3 3 20 20080100070103
2008 4 4 20 20080100070103
2008 1 1 22 20080100078301
2008 2 2 22 20080100078301
2008 3 3 22 20080100078301
2008 4 4 22 20080100078301
end
label values month month_lbl
label def month_lbl 1 "January", modify
label def month_lbl 2 "February", modify
label def month_lbl 3 "March", modify
label def month_lbl 4 "April", modify
label values mish mish_lbl
label def mish_lbl 1 "One", modify
label def mish_lbl 2 "Two", modify
label def mish_lbl 3 "Three", modify
label def mish_lbl 4 "Four", modify
label values age age_lbl
label def age_lbl 16 "16", modify
label def age_lbl 17 "17", modify
label def age_lbl 18 "18", modify
label def age_lbl 19 "19", modify
label def age_lbl 20 "20", modify
label def age_lbl 21 "21", modify
label def age_lbl 22 "22", modify
label def age_lbl 23 "23", modify
label def age_lbl 24 "24", modify
label def age_lbl 26 "26", modify
label def age_lbl 28 "28", modify
label def age_lbl 29 "29", modify
label def age_lbl 30 "30", modify
label def age_lbl 31 "31", modify
label def age_lbl 32 "32", modify

It's just chose a chunk of 100 observations, and I think there are two people in there who exhibit what I am talking about. When I run the first line of code

Code:

by cpsidp mish (age), sort: egen byte problem = max(age != age[1] & !missing(age))

I end up getting people reported as 'problems' when they don't look like they should be:

Click image for larger version

Name: problemvariable.png
Views: 1
Size: 89.1 KB
ID: 1425245

As you can see, that one person that is a 'problem' got labeled correctly, but everyone else who should be fine are also labeled as a 'problem.' (These people are included in the example dataset above).

Furthermore, in the second line of code,

Code:

by cpsidp wave (age_missing date), sort: gen byte in_order = age <= age[_n+1]

what did you mean by the variable date? I could not find one like that in the IPUMS-CPS either.

Thank you so much! Hopefully I gave you enough information so as to make this the most productive.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#7

10 Jan 2018, 20:04

Well, I assumed that there was a date variable that identified the date of participation for each observation. Evidently you have, instead, only a month and year. So we can combine those into a date variable that will serve the purpose here.

I don't think you read down and understood the code before starting it. The first command is just a start. It does, temporarily, identify observations at problems that are not. But then all the other lines go about correcting that over-generation. Notice the line in the code that says -replace problem = 0 if mitigated- at the end: if certain conditions apply, the "problem" designation is rescinded. So you can't interpret "problem" at the end of that first command. You have to execute the entire code.

That said, there are other discrepancies between the code and what I assumed the data to be. You appear not to have a wave variable. Unless that is mish. But if that is mish, it appears in your example that you have just one observation per participant in each wave, so there is no possibility of inconsistency within a wave, and we have, instead, to check for age inconsistencies across waves. That would be much harder, and perhaps effectively impossible without more precise dates for the observations.

So before we go any farther, let's clarify some things:

Is mish the variable that records wave? If not, what is mish? And what is the variable that does record wave?

Can the same participant have more than one observation in the data for the same wave?

Did you run

Code:

by cpsidp wave (age), sort: assert age == age[1] | missing(age)

or some variant of that with the appropriate variable name for wave, and, if so, how many contradictions were found?
Comment
Sebastian Lara

Join Date: Jan 2018

Posts: 9
#8

11 Jan 2018, 09:43

mish is just month in sample at the household level, which would be the same as wave.

When I ran

Code:

by cpsidp mish (age), sort: assert age == age[1] | missing(age)

I found no contradictions.

As for age inconsistencies across waves, that is exactly what I am trying to do. If it would be too hard then I think for the purposes of my analysis, just generating a new age variable where the first wave age is used would be fine. My question would be how to make it such that those who are for some reason not present in the first wave but have subsequent waves available, how to use the second wave age, etc.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#9

11 Jan 2018, 10:25

OK, since you found no contradictions, all of that extra code in #5 is unnecessary. To generate a variable containing the age recorded in the earliest wave for which the participant in the sample, it would be:

Code:

by cpsidp mish (age), sort: assert age == age[1] | missing(age) by cpsidp (mish age): gen age_at_first_participation = age[1]

Note that age_at_first_participation will having a missing value if the participant did not report an age during whatever wave he/she first participated in (even if he/she reports an age in a subsequent wave). If this is not what you want, consider instead:

Code:

by cpsidp mish (age), sort: assert age == age[1] | missing(age) by cpsidp (mish): assert age <= age[_n+1] if !missing(age) // SEE NOTE BELOW by cpsidp (age), sort: gen first_reported_age = age[1]

Note: This is a partial consistency check. It requires that the ages reported in successive waves be non-decreasing. Despite the vagaries of how many birthdays can occur between waves (see next paragraph), it can't be negative. If you get any contradictions from this -assert- statement it is a data problem that needs to be fixed.

By the way, the reason I think identifying age inconsistencies across waves is intractable is that the number of birthdays that can occur between waves 1 and 4 could vary considerably in this design. While most of the higher numbers would be errors, some would not be and without actual dates of participation and birthdates, it would be impossible to sort out.
Comment

Announcement