Replace missing values with those of previous values or those of the partner

Chris Boulis

Join Date: Feb 2019

Posts: 368
#1

Replace missing values with those of previous values or those of the partner

30 Nov 2020, 17:58

Hi Statalist.

I am using annual survey responses on marital status from a panel study (for each partner in a couple - married/cohabitation) for my analysis. Around 0.8% of responses are missing and I want to replace the missing values with lagged values (for the given partner) or those of the other partner if these are also missing - it is not uncommon that only one in the couple answers this question. Sometimes individuals in a couple miss answering this question in a given year(s). Is this approach reasonable, particularly if there are lagged values and future values and these are the same?

I'm sure my draft code does not achieve what I want it to, so help here is appreciated (mrcurr1 - marital status of male partner); mrcurr2 - marital status of female partner)

Code:

bys id (wave): replace mrcurr1 = L.mrcurr1 if missing(mrcurr1) bys id (wave): replace mrcurr1 = L.mrcurr2 if missing(mrcurr1) // in case the current wave response is missing for both bys id (wave): replace mrcurr2 = L.mrcurr2 if missing(mrcurr2) bys id (wave): replace mrcurr2 = L.mrcurr1 if missing(mrcurr2)

Sample data:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long(id p_id) byte(wave mrcurr1 mrcurr2) 11 61 8 . 2 11 61 12 . 2 11 61 13 . 2 11 61 14 . 2 11 61 15 . 2 11 61 16 . 2 11 61 17 . 2 11 61 18 . 2 13 60 12 . 1 13 60 13 . 1 13 60 14 . 1 13 60 15 . 1 13 60 16 . 1 13 60 17 . 1 13 60 18 . 1 16 64 8 2 . 16 64 9 2 . 16 64 10 2 . 16 64 11 1 . 16 64 12 1 . 16 64 13 1 . 16 64 14 1 . 16 64 15 1 . 16 64 16 1 . 16 64 17 1 . 16 64 18 1 . 23 24 6 . . 23 24 7 . . 23 24 8 . . 23 24 9 . . 23 24 10 . . 23 24 11 . . 23 24 12 . . end

Last edited by Chris Boulis; 30 Nov 2020, 18:01.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#2

30 Nov 2020, 18:33

I'm sure my draft code does not achieve what I want it to

Why do you say that? It seems to me that it does. It eliminates most of the missing values of both mrcurr* variables. The ones it does not eliminate are:

1. Missing values occurring in the very first wave of an id's data: in that case there is no previous value to carry forward.
2. Missing values occurring after an absence of participation over several waves. L.mrcurr1 (or 2) refers to the value from wave that is numbered one less: if waves have been skipped, then the lag is undefined. It may be that you want to carry forward results even if you have to go back several waves before getting to the last non-missing one. In that case, change L.mrucrr1 to mrcurr1[_n-1] and L.mrcurr2 to mrcurr2[_n-1] throughout the code, and that will happen.
3. Situations like id 23 where neither murcurr1 nor murcurr2 has ever been answered--so there is nothing to carry forward.
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#3

30 Nov 2020, 19:32

Thank you Clyde Schechter for your great feedback.

Why do you say that?

Because I still found some missing values (probably due to missing waves or missing values in the first wave of the couple's entry into the survey), but these should be addressed by replacing "L." with "[_n-1]" as you noted.

1. This could be addressed by the 2nd and 4th lines (#1) where I replace the missing with the other partner's response - as long as that is not a missing value. Otherwise, I could replace the missing value, e.g. in wave 1, with the next observed value, e.g. in wave 2 or after using [_n+1] correct?
2. Should I always use "[_n-1]" instead of "L."? or should I continue to use "L." if I only want to refer to the response in the previous wave and use "[_n-1]" when I want to search back to the last response regardless if that is four waves prior?
3. Yes I will leave the cases where neither ever answer the question. However, the code above should replace missing values if there are random missing values from one partner or the other over time, right?

Here's my updated code. Help writing this more efficiently is appreciated.

Code:

bys id (wave): replace mrcurr1 = mrcurr1[_n-1] if missing(mrcurr1) bys id (wave): replace mrcurr1 = mrcurr2[_n-1] if missing(mrcurr1) bys id (wave): replace mrcurr1 = mrcurr1[_n+1] if missing(mrcurr2[_n-1]) bys id (wave): replace mrcurr2 = mrcurr2[_n-1] if missing(mrcurr2) bys id (wave): replace mrcurr2 = mrcurr1[_n-1] if missing(mrcurr2) bys id (wave): replace mrcurr2 = mrcurr2[_n+1] if missing(mrcurr1[_n-1])

Last edited by Chris Boulis; 30 Nov 2020, 19:35.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#4

30 Nov 2020, 21:39

Code:

bys id (wave): replace mrcurr1 = mrcurr1[_n-1] if missing(mrcurr1) bys id (wave): replace mrcurr1 = mrcurr2[_n-1] if missing(mrcurr1) bys id (wave): replace mrcurr1 = mrcurr1[_n+1] if missing(mrcurr1) bys id (wave): replace mrcurr2 = mrcurr2[_n-1] if missing(mrcurr2) bys id (wave): replace mrcurr2 = mrcurr1[_n-1] if missing(mrcurr2) bys id (wave): replace mrcurr2 = mrcurr2[_n+1] if missing(mrcurr2)

I have no suggestions for greater efficiency. The code could be made a bit more compact by putting it in a loop over mrcurr1 and mrcurr2, but that would just make the code less transparent and harder to read and understand. I rarely write loops over code to run it twice: I usually just leave the doubled code (unless the doubled code itself is very, very long, or if I anticipate expanding the loop to involve more than just two iterations.)

I have made two changes to the code (in bold face) where the logic was incorrect. The condition for this code should always be -if missing(X)- where X is the same variable being replaced. If one of the first two commands finds a non-missing substitute for mrcurr1, then you don't want to then replace it again with mrcurr[_n+1]: you want to stick with what you have already gotten. In fact, the way you had it, you could even end up replacing your newly found non-missing value with a missing value. So the code as I have revised it basically searches for the most recently available value of mrcurr1 if mrcurr1 is missing. If that still leaves it missing, then it tries the most recently available value of mrcurr2. And if that still doesn't work, it looks for the next available value of mrcurr1. It then does the analogous series of replacements for any missing value of mrcurr2.

L. should only be used to refer to the numerically previous wave--which may or may not actually be there in the data set. If you want the most recent available even if you have to go back several, then it has to be [_n-1].
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#5

01 Dec 2020, 04:23

Thank you Clyde Schechter. Very helpful as always. Thank you and I'm happy the code was mostly correct. So the only missing values for mrcurr1 and mrcurr2 should only relate to those noted in "3" (in #3) i.e. where neither in a couple respond to the marital status question the whole time they are in the survey. To test this, I would like to view these couples in the browse window. I tried:

Code:

br id p_id wave mrcurr1 mrcurr2 if missing(mrcurr1) & missing(mrcurr2)

but I believe this only captures the waves in which they don't respond and I am not sure if it includes all the waves these couples were in the survey. Help appreciated.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#6

02 Dec 2020, 14:12

but I believe this only captures the waves in which they don't respond

That is correct. To see all the waves for a couple who ever both fail to respond, it would be:

Code:

by id p_id, sort: egen to_show = max(missing(mrcurr1) & missing(mrcurr2)) browse if to_show
1 like
Comment

Chris Boulis

Join Date: Feb 2019
Posts: 368

02 Dec 2020, 17:15

Thank you Clyde Schechter. Interestingly, the code is picking up couples (around 15) that actually do respond to mrcurr (most of the time) but where the initial surveys (e.g. 1 or 2) are missing. I'm happy I found this, as I thought the earlier code would replace initial missing values with future values, but this has not occurred on these occasions. I've included example data to help with a possible solution:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(id p_id) byte(wave mrcurr1 mrcurr2)
183 139  5 . .
183 139  6 1 1
183 139 11 1 1
183 139 12 1 1
183 139 13 1 1
183 139 14 1 1
183 139 15 1 1
183 139 16 1 1
183 139 17 1 1
183 139 18 1 1
223 224  1 . .
223 224  2 . .
223 224  3 1 1
223 224  5 1 1
223 224  7 1 1
223 224  8 1 1
223 224  9 1 1
223 224 10 1 1
223 224 11 1 1
223 224 12 1 1
223 224 13 1 1
223 224 14 1 1
223 224 15 1 1
223 224 16 1 1
223 224 18 1 1
106 484  1 . .
106 484  2 . .
106 484  3 . .
106 484  4 . 1
106 484  5 1 1
106 484  6 1 1
106 484  7 1 1
106 484  8 1 1
106 484  9 1 1
106 484 10 1 1
106 484 11 1 1
106 484 12 1 1
106 484 13 1 1
106 484 14 1 1
106 484 15 1 1
106 484 16 1 1
106 484 17 1 1
106 484 18 1 1
115 112  1 . .
115 112  2 . .
115 112  3 . 1
115 112  4 1 1
115 112  5 1 1
115 112  6 1 1
end

I appreciate your help/suggestions.

Last edited by Chris Boulis; 02 Dec 2020, 17:17.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30095

02 Dec 2020, 17:47

OK. I see the problem: the code would reach as far back in time as necessary to find a value, but it would only reach forward one wave and not propagate any farther than that. This code will reach arbitrarily far backwards and forward.

Code:

by id p_id (wave), sort: replace mrcurr1 = mrcurr1[_n-1] if missing(mrcurr1)
by id p_id (wave): replace mrcurr1 = mrcurr2[_n-1] if missing(mrcurr1)
by id p_id (wave), sort: replace mrcurr2 = mrcurr2[_n-1] if missing(mrcurr1)
by id p_id (wave): replace mrcurr2 = mrcurr1[_n-1] if missing(mrcurr1)

gsort id p_id -wave
by id p_id: replace mrcurr1 = mrcurr1[_n-1] if missing(mrcurr1)
by id p_id: replace mrcurr2 = mrcurr2[_n-1] if missing(mrcurr2)

sort id p_id wave

Comment

Chris Boulis

Join Date: Feb 2019

Posts: 368
#9

02 Dec 2020, 18:45

Thank you Clyde Schechter. Regarding the first four lines: I am unclear when I need to include -sort- and when not to. In the last lines, I see that by using "-wave" you can then use [_n-1] rather than [_n+1] - is that because of limitations using the latter and/or because it is more reliable to replace future missing values with prior values than the alternative? Your clarifications are appreciated.

After using the updated code in #8, I found four couples where mrcurr is missing in latter waves. That said, except for the last couple, there are a lot of missing values and missing waves, which may suggest some kind of failure may have occurred (although not explicitly clear), as such I think I'll leave them as is.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long(id p_id) byte(wave mrcurr1 mrcurr2) 113 118 1 1 1 113 118 2 1 1 113 118 5 . 1 113 118 6 . . 113 118 7 . . 115 105 1 1 1 115 105 2 1 1 115 105 3 . 1 115 105 4 . . 115 105 5 . . 115 105 6 . . 115 105 7 . . 115 105 8 . . 115 157 15 2 2 115 157 16 1 1 118 156 1 1 1 118 156 2 1 1 118 156 3 1 1 118 156 5 . 1 118 156 6 . . 118 156 7 . . 118 156 10 . . 110 197 11 1 1 110 197 12 1 1 110 197 13 1 1 110 197 14 1 1 110 197 15 . 1 110 197 16 . . 110 197 17 . . end

Last edited by Chris Boulis; 02 Dec 2020, 18:52.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#10

02 Dec 2020, 21:25

Sorry, my error. The third and fourth lines had the wrong variable in the -if missing()- clauses. It should be

Code:

by id p_id (wave), sort: replace mrcurr1 = mrcurr1[_n-1] if missing(mrcurr1) by id p_id (wave): replace mrcurr1 = mrcurr2[_n-1] if missing(mrcurr1) by id p_id (wave): replace mrcurr2 = mrcurr2[_n-1] if missing(mrcurr2) by id p_id (wave): replace mrcurr2 = mrcurr1[_n-1] if missing(mrcurr2) gsort id p_id -wave by id p_id: replace mrcurr1 = mrcurr1[_n-1] if missing(mrcurr1) by id p_id: replace mrcurr2 = mrcurr2[_n-1] if missing(mrcurr2) sort id p_id wave

You need to specify the -sort- option in a -by- prefix whenever the required sort order changes, or if the preceding command has destroyed the previously established sort order. The use of -sort- in the third line of code in #8 was not necessary. I had it there simply because, in my mind, I was thinking that the third and fourth lines should be just like the first and second, except interchanging the rules of mrcurr1 and mrcurr2. So I mindlessly copied the sort.

Now, in this situation, the extra sort had no effect whatsoever because id p_id and wave uniquely identify observations, so Stata would recognize that the data were already fully sorted on those variables and just ignore the unnecessary sort. But if those variables did not uniquely identify observations, then Stata might have re-sorted the data, re-randomizing the order within id p_id and wave. That could, in some circumstances, break the logic of a sequence of commands that required that the sort order remain unchanged throughout (even though the sort order wasn't uniquely determined). That could be a particularly nasty bug to find and fix!

So it is a good idea to be more careful than I was about the use of the -sort- option in -by:- prefixes. Use it when necessary to establish the sort order for the command following -by-. Do not re-use it if the existing sort order is already correct for the command.
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#11

03 Dec 2020, 17:07

Hi Clyde Schechter. Wonderful, that is great. In hindsight, I should have picked up that issue with the code (my laziness), but thank you and thanks for your explanation re using -sort- after -by-. That is much clearer.
Comment

Announcement