Filling missing values on variables only appearing every four years in the panel

Chris Boulis

Join Date: Feb 2019

Posts: 368
#1

Filling missing values on variables only appearing every four years in the panel

12 Nov 2020, 23:50

Hi Statalist.

I have a couple of variables that only appear in waves 4, 7, 10, 14, 18 (of a panel dataset of 18 waves). As such, these variables contain large numbers of missing values. I used the code (slightly amended from here)

Code:

replace dummy = L1.dummy if missing(dummy)

which fills in the gaps between waves 4 - 18. How can I amend this code to copy the response from wave 4 back to waves 1 to 3?

Help kindly appreciated.

Stata v.15.1
Tags: None
Chris Boulis

Join Date: Feb 2019

Posts: 368
#2

13 Nov 2020, 01:04

Added: To clarify, I want to fill the missings for waves 5 and 6 with the response in 4, waves 8 and 9 with the response in 7, etc, but as the first response only occurs in wave 4, I want to 'back date' the response for wave 4 to waves 1 to 3. Kind regards, Chris
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10481
#3

13 Nov 2020, 02:42

The inverse of the lag is the lead. So, you can also specify

Code:

replace dummy = F.dummy if missing(dummy)

Of course, sequence matters here.
1 like
Comment

Nick Cox

Join Date: Mar 2014
Posts: 36053

13 Nov 2020, 03:02

Andrew Musau's suggestion helps, but not enough. The reason is spelled out in an ancient FAQ on this topic https://www.stata.com/support/faqs/d...issing-values/

But let's show that first, as there is no data example in this thread.

Code:

.  clear

.  set obs 18
number of observations (_N) was 0, now 18

.  gen id = 42

.  gen wave = _n

.  gen test = 1 if inlist(_n, 4, 7, 10, 14, 18)
(13 missing values generated)

.  
.  tsset id wave
       panel variable:  id (strongly balanced)
        time variable:  wave, 1 to 18
                delta:  1 unit

.  
.  clonevar tryit = test
(13 missing values generated)

.  replace tryit = L1.tryit if missing(tryit)
(10 real changes made)

.  
.  replace tryit = F1.tryit if missing(tryit)
(1 real change made)

. l

     +--------------------------+
     | id   wave   test   tryit |
     |--------------------------|
  1. | 42      1      .       . |
  2. | 42      2      .       . |
  3. | 42      3      .       1 |
  4. | 42      4      1       1 |
  5. | 42      5      .       1 |
     |--------------------------|
  6. | 42      6      .       1 |
  7. | 42      7      1       1 |
  8. | 42      8      .       1 |
  9. | 42      9      .       1 |
 10. | 42     10      1       1 |
     |--------------------------|
 11. | 42     11      .       1 |
 12. | 42     12      .       1 |
 13. | 42     13      .       1 |
 14. | 42     14      1       1 |
 15. | 42     15      .       1 |
     |--------------------------|
 16. | 42     16      .       1 |
 17. | 42     17      .       1 |
 18. | 42     18      1       1 |
     +--------------------------+

The reasoning is as follows. .

1. Stata works separately by panel, as you would hope and expect.

2. The data example has only one panel, but that's enough to discuss.

3. Stata starts with wave 1. The value of your variable is missing, but the next value found with F1 is missing too, so no change is made.

4. Same story with wave 2.

5. With wave 3, you get what you want.

So, this method is sufficient if and only if there is just one wave to replace, whenever the 2nd value in each panel is known.

Swapping L. with F. doesn't make Stata go backwards. It still works using the current order of observations.

A solution here -- as explained in the FAQ just cited -- is to reverse time temporarily. Alternatively, what you want is just an interpolation for the first run of missing values using say ipolate.

Comment

Chris Boulis

Join Date: Feb 2019
Posts: 368

13 Nov 2020, 21:52

Thank you Andrew Musau, however, as Nick Cox noted it only worked for wave 3 and not waves 1 and 2.

Thank you Nick Cox for providing the FAQ - this was helpful. Having tsset my data (tsset id wave) I used the provided code

Code:

replace myvar = F.myvar if myvar >=.

however, as you noted above and in the FAQ, while "L1.myvar" has a cascading effect forward in time, "F1.myvar" does not have the same effect (in reverse), and only fills one period - in my case, it replaced the missing value in wave 3 with the value in wave 4. I thought I would address this issue by reversing time as you suggest in the FAQ using

Code:

gsort -wave
replace myvar = L.myvar if myvar >= .

to which Stata advised "not sorted r(5)".

Can you kindly suggest a way forward using the following sample? (Note: there are individuals that entered the survey after wave 1).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(id p_id) byte wave int(rel1 rel2)
11 63 12    .    .
11 63 13    .    .
11 63 14 2330 7000
11 63 15    .    .
11 63 16    .    .
11 63 17    .    .
11 63 18 7000 7000
14  15  1    .    .
14  15  2    .    .
14  15  3    .    .
14  15  4 2010 2010
14  15  5    .    .
14  15  6    .    .
14  15  7 2010 2010
14  15  8    .    .
14  15  9    .    .
14  15 10 2010 2010
14  15 11    .    .
14  15 12    .    .
14  15 13    .    .
14  15 14    . 2010
14  15 15    .    .
14  15 16    .    .
18  19  1    .    .
18  19  2    .    .
18  19  3    .    .
18  19  4 7000 1000
18  19  5    .    .
18  19  6    .    .
18  19  7 7000 1000
18  19  8    .    .
18  19  9    .    .
18  19 10 7000 1000
18  19 11    .    .
18  19 12    .    .
18  19 13    .    .
18  19 14 7000 1000
18  19 15    .    .
18  19 16    .    .
18  19 17    .    .
23  24  1    .    .
23  24  2    .    .
23  24  3    .    .
23  24  4 2400 2400
23  24  5    .    .
23  24  6    .    .
23  24  7 2400 2400
23  24  8    .    .
23  24  9    .    .
23  24 10 2000 2000
23  24 11    .    .
23  24 12    .    .
23  24 13    .    .
23  24 14 2400 2400
23  24 15    .    .
23  24 16    .    .
23  24 17    .    .
23  24 18 2400 2400
end

Added: In the FAQ, you use "if myvar >= .", which I adopted above assuming it is the same as "if missing(myvar)". Can you please explain the difference, if any, between "if missing(myvar)" and "if myvar >= ." and "if myvar < ."

As an aside, if I repeat the first piece of code twice, will it not 'backfill' waves 1 and 2? Kind regards, Chris

Last edited by Chris Boulis; 13 Nov 2020, 22:09.

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10481
#6

13 Nov 2020, 23:36

You cannot use time-series operators with gsort, unless you create a new time variable and change the xtsettings. So either,

Code:

xtset id wave replace rel1 = l.rel1 if missing(rel1) forval i=1/3{ replace rel1 = f.rel1 if missing(rel1) }

or

Code:

bys id (wave): replace rel1 = rel1[_n-1] if missing(rel1) gsort id -wave by id: replace rel1 = rel1[_n-1] if missing(rel1)

Last edited by Andrew Musau; 13 Nov 2020, 23:39.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36053
#7

14 Nov 2020, 03:39

The method used for reversing time in #6 is not quite that discussed in the FAQ cited earlier (because it doesn't work, among other reasons). Here's the link again: https://www.stata.com/support/faqs/d...issing-values/

Changing the sort order isn't compatible with tsset or xtset unless you create a negated version of the time variable and specify that to either command.

There is no difference between being >= . and being 1 (true) with missing() for numeric variables (or numeric scalars or constants). missing() can be applied to string arguments. Sorry, but I don't recall a reason for explaining things one way rather than another. That FAQ probably goes back about 20 years, and some details have been updated, but what wasn't broken wasn't fixed in revisions.
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#8

14 Nov 2020, 21:53

Thank you for your reply Andrew Musau. I appreciate the code for the two approaches. As I have -tsset- my data, I focussed on the first approach. Note that I am dealing with data for couples, and have unique identifiers for each (id & p_id). In addition, I need to run this for six variables (rel1 rel2 rep1 rep2 ret1 ret2) - can you show me a way to include them in a single loop? Kind regards, Chris.

Last edited by Chris Boulis; 14 Nov 2020, 22:05.
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#9

14 Nov 2020, 23:14

Thank you Nick Cox. Section 6 of the FAQ "With tsset" does not appear to provide a solution for backfilling further than one data point, so the answer is likely in your comment

Changing the sort order isn't compatible with tsset or xtset unless you create a negated version of the time variable and specify that to either command

So I tried this and it seems to have worked (though I still have a few missing variables here and there, but that may be due to some other reason)

Code:

gen nwave = -wave tsset id wave replace rel1 = L.rel1 if rel1 >= . tsset id nwave replace rel1 = L.rel1 if rel1 >= . tsset id wave

If this is what you meant, can you show how I can apply it to six variables (rel1 rel2 rep1 rep2 ret1 ret2) - potentially in a single loop?
Comment

Nick Cox

Join Date: Mar 2014
Posts: 36053

#10

15 Nov 2020, 04:21

Code:

gen nwave = -wave
tsset id wave

foreach v in rel1 rel2 rep1 rep2 ret1 ret2
    replace `v' = L.`v' if `v' >= .
}

tsset id nwave

foreach v in rel1 rel2 rep1 rep2 ret1 ret2
    replace `v' = L.`v' if `v' >= .
}

tsset id wave

Comment

Chris Boulis

Join Date: Feb 2019

Posts: 368
#11

15 Nov 2020, 16:54

Thank you Nick Cox. I really appreciate your help, guidance and support. All I can say is that it's helping. I did not have a clear picture of what the loop would look like when I posted #9. This morning it became a little clearer how I'd go about it, then I read your post and thought "yes! that's it". Maybe I'll be able to do it on my own next time. Thank you again
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#12

15 Nov 2020, 18:44

(duplicated above post)

Last edited by Chris Boulis; 15 Nov 2020, 18:47.
Comment
Shamsudini Amidu

Join Date: Jun 2022

Posts: 69
#13

01 Nov 2022, 07:37

Hi Family,
I need help to clean up my Data.
I have panel data (wave1,2 and 3) Data with 3 variables.
I realized there is some inconsistency in the responses and missing data the in ethnicity across the 3waves for same person which I think it should not be the same across since a person's ethnicity can't be change regardless of the wave.
In effect I want to use wave 1 response in ethnicity to replicate same in response to ethnicity for wave 2 and 3 of the same respondent.

I tried using this command below but it did not work out for me.
replace ethnicity=ethnicity if ethnicity ==. & wave>1

the results I get is
(0 real changes made)

Below is the dataex

input str11 FPrimaryhhid int ethnicity float wave
"1010010261" 8 1
"1010010261" . 2
"1010010261" 16 3
"1010020501" 30 1
"1010020501" . 2
"1010020501" 30 3
"1010020521" 9 1
"1010020521" . 2
"1010020521" 7 3
"1010030371" 3 1
"1010030371" . 2
"1010030371" 3 3
"1010040271" 16 1
"1010040271" . 2
"1010040271" 16 3
"1010060111" 16 1
"1010060111" . 2
"1010060111" 16 3
"1010060561" 17 1
"1010060561" . 2
"1010060561" 17 3
end

Thank you.

Shamsu

Last edited by Shamsudini Amidu; 01 Nov 2022, 08:00.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 36053

#14

01 Nov 2022, 08:08

A cautious approach is to interpolate or extrapolate if and only if one distinct value is on offer.

Code:

clear 

input str11 FPrimaryhhid int ethnicity float wave
"1010010261" 8 1
"1010010261" . 2
"1010010261" 16 3
"1010020501" 30 1
"1010020501" . 2
"1010020501" 30 3
"1010020521" 9 1
"1010020521" . 2
"1010020521" 7 3
"1010030371" 3 1
"1010030371" . 2
"1010030371" 3 3
"1010040271" 16 1
"1010040271" . 2
"1010040271" 16 3
"1010060111" 16 1
"1010060111" . 2
"1010060111" 16 3
"1010060561" 17 1
"1010060561" . 2
"1010060561" 17 3
end


egen min = min(eth), by(FPrimaryhhid) 

egen max = max(eth), by(FPrimaryhhid) 

gen ethnicity2  = max if max == min

order FPrimary wave  

list, sepby(FPrimaryhhid) 


     +-----------------------------------------------------+
     | FPrimary~d   wave   ethnic~y   min   max   ethnic~2 |
     |-----------------------------------------------------|
  1. | 1010010261      1          8     8    16          . |
  2. | 1010010261      2          .     8    16          . |
  3. | 1010010261      3         16     8    16          . |
     |-----------------------------------------------------|
  4. | 1010020501      1         30    30    30         30 |
  5. | 1010020501      2          .    30    30         30 |
  6. | 1010020501      3         30    30    30         30 |
     |-----------------------------------------------------|
  7. | 1010020521      1          9     7     9          . |
  8. | 1010020521      2          .     7     9          . |
  9. | 1010020521      3          7     7     9          . |
     |-----------------------------------------------------|
 10. | 1010030371      1          3     3     3          3 |
 11. | 1010030371      2          .     3     3          3 |
 12. | 1010030371      3          3     3     3          3 |
     |-----------------------------------------------------|
 13. | 1010040271      1         16    16    16         16 |
 14. | 1010040271      2          .    16    16         16 |
 15. | 1010040271      3         16    16    16         16 |
     |-----------------------------------------------------|
 16. | 1010060111      1         16    16    16         16 |
 17. | 1010060111      2          .    16    16         16 |
 18. | 1010060111      3         16    16    16         16 |
     |-----------------------------------------------------|
 19. | 1010060561      1         17    17    17         17 |
 20. | 1010060561      2          .    17    17         17 |
 21. | 1010060561      3         17    17    17         17 |
     +-----------------------------------------------------+

Comment

Shamsudini Amidu

Join Date: Jun 2022

Posts: 69
#15

01 Nov 2022, 08:40

Hi Nick Cox,
Thanks soo much! it is working.
But to prefill response for ethnicity2 whose min is not equal to the maximum?
Is it okay to use the formulae below;
replace ethnicity2= min if ethnicity2==.

to prefill ethnicity2 using the min value for respondent who have missing ethnicity2?

Your advice sir.

I appreciate your time.

Thank you.
Comment

Announcement