Replacing missing values by copying non-missing values

Mat Sko

Join Date: Apr 2019
Posts: 52

Replacing missing values by copying non-missing values

02 Jun 2019, 11:20

Hi all,

I have the following situation:

The variables I have selected for a regression contain values for the years 1999-2011, except for the binary variable "poolacc"(= does the property has a swimming pool?) for which I have only values for the year 2011. I believe that I have to copy the same values of 2011 to the other years before (1999-2010).

How can I fill in the dots for years 1999-2010 of this variable 'poolacc' with the values of this particular year 2011 in an efficient way?

Here is some data with 3 variables; the id (I generated myself) ; hhmove=year moved into house; poolacc=presence of a pool (1= Yes, 2=No).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float id int hhmove byte poolacc
  1 2001 .
  2 1999 .
  3 1999 .
  4 1999 .
  5 1999 .
  6 2000 .
  7 1999 .
  8 2000 .
  9 2000 .
 10 1999 .
 11 2000 .
 12 2001 .
 13 2000 .
 14 1999 .
 15 2001 .
 16 2001 .
 17 1999 .
 18 1999 .
 19 1999 .
 20 1999 .
 21 1999 .
 22 2000 .
 23 2001 .
 24 2000 .
 25 2000 .
 26 1999 .
 27 2000 .
 28 1999 .
 29 1999 .
 30 1999 .
 31 1999 .
 32 2000 .
 33 2001 .
 34 1999 .
 35 2000 .
 36 2000 .
 37 2000 .
 38 2001 .
 39 2000 .
 40 1999 .
 41 2001 .
 42 1999 .
 43 2001 .
 44 2001 .
 45 2000 .
 46 2001 .
 47 2001 .
 48 2001 .
 49 2000 .
 50 1999 .
 51 1999 .
 52 1999 .
 53 2000 .
 54 1999 .
 55 1999 .
 56 2001 .
 57 1999 .
 58 2001 .
 59 1999 .
 60 1999 .
 61 2000 .
 62 1999 .
 63 1999 .
 64 2001 .
 65 1999 .
 66 2000 .
 67 1999 .
 68 2000 .
 69 1999 .
 70 1999 .
 71 2001 .
 72 2000 .
 73 2000 .
 74 2000 .
 75 2001 .
 76 1999 .
 77 2001 .
 78 2000 .
 79 1999 .
 80 1999 .
 81 2000 .
 82 1999 .
 83 2001 .
 84 1999 .
 85 2001 .
 86 2001 .
 87 2000 .
 88 1999 .
 89 2000 .
 90 2000 .
 91 2000 .
 92 2000 .
 93 2000 .
 94 2000 .
 95 2001 .
 96 2000 .
 97 2001 .
 98 2000 .
 99 1999 .
100 2000 .
end

I tried the following code, but the data remained unchanged...:

Code:

bysort id (hhmove): replace poolacc = poolacc[_n -1] if poolacc ==. & hhmove < 2011

and

Code:

by id (hhmove), sort: replace poolacc= poolacc[_n-1] if poolacc>= . & hhmove<2011

(retrieved from the Stata website).

Does anyone know which code I need to put in to solve this issue?

Thanks in advance!

Tags: copy, missing, replace

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

02 Jun 2019, 12:22

It is often awkward to fill values backward in time from later observations. Your sample code is filling in values in the wrong direction - copying from earlier to later. But you can't easily copy from later to earlier, because Stata goes through your data in the order from earlier to later.

But your data has a very useful characteristic: the observations end in 2011, and that is the year that you want to copy values from. So this example points the way

Code:

. * Example generated by -dataex-. To install: ssc install dataex . clear . input float id int hhmove byte poolacc id hhmove poolacc 1. 1 2009 . 2. 1 2010 . 3. 1 2011 0 4. 2 2009 . 5. 2 2010 . 6. 2 2011 1 7. 3 2009 . 8. 3 2010 . 9. 3 2011 . 10. end . by id (hhmove), sort: replace poolacc = poolacc[_N] (4 real changes made) . list, noobs sepby(id) +-----------------------+ | id hhmove poolacc | |-----------------------| | 1 2009 0 | | 1 2010 0 | | 1 2011 0 | |-----------------------| | 2 2009 1 | | 2 2010 1 | | 2 2011 1 | |-----------------------| | 3 2009 . | | 3 2010 . | | 3 2011 . | +-----------------------+

will do what you set out to do.

With that said, I have a concern. If a property had no pool in 2011, it seems reasonable to assume there was not a pool in the earlier years that was removed. But if a property had a pool in 2011, it may have been built sometime between 1999 and 2011, so it is not clear that copying the value from 2011 is a good idea.

Realistically, you have a single variable "property had a pool in 2011" and that is how I would interpret the filled-in variable. I don't know how you intend to use this variable in your analysis, but that's really all you can correctly infer from it. .
Comment
Mat Sko

Join Date: Apr 2019

Posts: 52
#3

02 Jun 2019, 13:11

Hi William,

Thanks for your response.

I tried it for the full data, but '0 real changes made' appears on screen after entering the suggested code.

FYI: 2013 is actually the last year in my data, but I will focus on the period between 1999-2011 by entering

Code:

keep if hhmove>=1999 & hhmove<=2011

just before.

Furthermore, the binary variable 'poolacc' is either 1= if pool is present or 2= if not present in my data instead of the standard 1/0. Do I have to switch these values before using the replace code (and how)?

(Regarding the assumption, because the same houses are interviewed every 2 years, it was okay for my mentor to assume the house characteristics were the same prior to 2011)
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

02 Jun 2019, 14:07

Code:

. * Example generated by -dataex-. To install: ssc install dataex
. clear

. input float id int hhmove byte poolacc

            id    hhmove   poolacc
  1. 1 2009 .
  2. 1 2010 .
  3. 1 2011 1
  4. 1 2012 .
  5. 1 2013 .
  6. 2 2009 .
  7. 2 2010 .
  8. 2 2011 2
  9. 2 2012 .
 10. 2 2013 . 
 11. 3 2009 .
 12. 3 2010 .
 13. 3 2011 .
 14. 3 2012 .
 15. 3 2013 .
 16. end

. by id (hhmove), sort: egen fixed = max(poolacc)
(5 missing values generated)

. list, noobs sepby(id)

  +-------------------------------+
  | id   hhmove   poolacc   fixed |
  |-------------------------------|
  |  1     2009         .       1 |
  |  1     2010         .       1 |
  |  1     2011         1       1 |
  |  1     2012         .       1 |
  |  1     2013         .       1 |
  |-------------------------------|
  |  2     2009         .       2 |
  |  2     2010         .       2 |
  |  2     2011         2       2 |
  |  2     2012         .       2 |
  |  2     2013         .       2 |
  |-------------------------------|
  |  3     2009         .       . |
  |  3     2010         .       . |
  |  3     2011         .       . |
  |  3     2012         .       . |
  |  3     2013         .       . |
  +-------------------------------+

Comment

Mat Sko

Join Date: Apr 2019

Posts: 52
#5

10 Jun 2019, 09:12

Hi William,

I tried the suggested code above, and it generated a variable 'fixed' with values 1 and 2, but it copied the values of variable 'poolacc' only from rows HHMOVE=2011.
So for the other years of HHMOVE, the variable 'fixed' still shows missing values/dots.

My aim is to fill in the missing values of poolacc/fixed for the whole period before 2011 as well (by repeating the same values to the prior years).

Do I you have an idea how to achieve this?

Many thanks for your help.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

10 Jun 2019, 09:54

Code:

replace poolacc = fixed

seems to be the obvious solution.
Comment
Mat Sko

Join Date: Apr 2019

Posts: 52
#7

15 Jun 2019, 11:29

Hi,

Your answer #4 seems to be the answer to my problem.

However:

say variable POOLACC has 10000 observations in yearvariable HHMOVE=2011; the code generates a variable 'FIXED' with the same 10000 values but again only for HHMOVE=2011.
The values for the other years, i.e. HHMOVE=2005 of variable FIXED remain missing/dots.

I don't understand why the missing values of FIXED are not filled in with values 1 or 2 as your answer #4 shows it does??

FYI: the exact code I am entering in Stata is the following:

Code:

clear all set more off cd "C:\Users\eljattaa\Desktop\MP" use "C:\Users\eljattaa\Desktop\MP\data analyse\FULL merge pers.dta" keep if hhmove>=1999 & hhmove<=2011 keep if poolacc>=0 gen id = _n by id (hhmove), sort: egen fixed = max(poolacc) replace poolacc=fixed

Thanks.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

15 Jun 2019, 12:44

I see now in post #1 that you tell us you generated the ID yourself, and in post #7 we reveal that you assigned a different ID to each observation with

Code:

gen id = _n

That id is of no use for what you are doing.

You did not fully explain what your data is, but it seems to be data for a number of properties(?) over a number of years. I am assuming each property has one observation for each year of data, as my invented data in post #2 has three observations for each of three households, and the households are identified as ID 1, 2, and 3.

Because the identifier you generated is different for every observation, when poolacc is nonmissing the other observations for that property have a different ID and nothing is copied to them.

Your data presumably has some way of telling which observations belong to the same property. Use that for the identifier, not your made-up id.
Comment

Announcement

Replacing missing values by copying non-missing values

Comment

Comment

Comment

Comment

Comment

Comment

Comment