Filling in missing data from multiple variables

Beatrix Kim

Join Date: Jul 2016
Posts: 4

Filling in missing data from multiple variables

16 Jul 2016, 23:17

I have this data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int year str3 c1 float(area1 rain1) str3 c2 float(area2 rain2) str3 c3 float(area3 rain3)
1991 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
1992 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
1993 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
1994 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
1995 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
1996 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
1997 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
1998 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
1999 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
2000 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
2001 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
1991 "RUS"     .     . "UKR"     .     . "NA"      .     .
1991 "RUS"     .     . "BLR"     .     . "NA"      .     .
1991 "UKR"     .     . "BLR"     .     . "NA"      .     .
1991 "RUS"     .     . "BLR"     .     . "UKR"     .     .
1992 "RUS"     .     . "UKR"     .     . "NA"      .     .
1992 "RUS"     .     . "BLR"     .     . "NA"      .     .
1992 "UKR"     .     . "BLR"     .     . "NA"      .     .
1992 "RUS"     .     . "BLR"     .     . "UKR"     .     .
1993 "RUS"     .     . "UKR"     .     . "NA"      .     .
1993 "RUS"     .     . "BLR"     .     . "NA"      .     .
1993 "UKR"     .     . "BLR"     .     . "NA"      .     .
1993 "RUS"     .     . "BLR"     .     . "UKR"     .     .
end

This is a river basin with 3 countries. The first 4 observations are the whole basin per year with area and rainfall. The remaining observations are combinations of the countries within the basin. I need to fill in the area and rain data throughout the dataset, by country. As you can see, sometimes the country is in c1, sometimes in c2, sometimes in c3. For example, in observation 13, I need to fill area2 and rain2 for BLR, but that data is located in area3 and rain3 in obs 1-11.
I have tried xfill (installed by typing in: net from http://www.sealedenvelope.com/), but get errors. I have also tried this loop, but it says I am specifying too many variables (I think with the * to get all the country and area variables):

Code:

levelsof c*, local(lm_country)
foreach c in `lm_country' {
    levelsof area* if c* == `country', local(lm_area)
    replace area* = `lm_area' if c == `country'
}

I appreciate any insight - I'm really stumped here.

I have also cross-posted this on Stack Overflow.

Stata - Fill in missing data conditional on data in other variables - Stack Overflow

http://stackoverflow.com

I have data that looks like this: year c1 area1 rain1 c2 area2 rain2 c3 area3 rain3 1991 RUS 17.93 460.1 UKR 57.97 564.7 BLR 24.19 618.91 1992 RUS 17.93 460.1 UKR 57.97 564.7 BLR 24.19 618.91 1...

Tags: data management, loop, missing data

Beatrix Kim

Join Date: Jul 2016

Posts: 4
#2

17 Jul 2016, 00:44

ADDITIONAL attempt added: This further attempt results in filling in the area* variable from only the first country, not the various countries in each c* column:

Code:

foreach v of varlist area* { by year (`v'), sort: assert `v' == `v'[1] | missing(`v') by year (`v'): replace `v' = `v'[1] }

I need the replace `v' statement to be "if area* value matches c* country in that observation", but I have yet to figure out how to write that in code.
Comment
Oded Mcdossi

Join Date: Jun 2014

Posts: 577
#3

17 Jul 2016, 03:31

The first thing to do with this shape of data is to reshape the data, then it is easier to achieve your goals, try something like:

Code:

g id=_n reshape long c area rain, i(id)j(j_var) sort c year duplicates drop c year, force //YOU MAY KEEP OR DROP THE DUPLICATES I PREFER TO DROP foreach i of var area rain { sort c `i' bys c (`i'): replace `i'=`i'[_n-1] if mi(`i') } sort c year

Last edited by Oded Mcdossi; 17 Jul 2016, 04:03.
1 like
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

17 Jul 2016, 06:03

Welcome to Statalist!

Experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data, and Oded's solution follows that advice.

With that said, if there are reasons I'm unaware of to prefer the layout you show, the following should do what you need. Thank you for providing your sample data in dataex format; it made writing and testing this code straightforward. Note that I only use the data for 1991-1993 to keep the example small.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int year str3 c1 float(area1 rain1) str3 c2 float(area2 rain2) str3 c3 float(area3 rain3)
1991 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
1992 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
1993 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
1991 "RUS"     .     . "UKR"     .     . "NA"      .     .
1991 "RUS"     .     . "BLR"     .     . "NA"      .     .
1991 "UKR"     .     . "BLR"     .     . "NA"      .     .
1991 "RUS"     .     . "BLR"     .     . "UKR"     .     .
1992 "RUS"     .     . "UKR"     .     . "NA"      .     .
1992 "RUS"     .     . "BLR"     .     . "NA"      .     .
1992 "UKR"     .     . "BLR"     .     . "NA"      .     .
1992 "RUS"     .     . "BLR"     .     . "UKR"     .     .
1993 "RUS"     .     . "UKR"     .     . "NA"      .     .
1993 "RUS"     .     . "BLR"     .     . "NA"      .     .
1993 "UKR"     .     . "BLR"     .     . "NA"      .     .
1993 "RUS"     .     . "BLR"     .     . "UKR"     .     .
end

generate row_id = _n
reshape long c area rain, i(row_id) j(j123)
sort year c row_id
list if c=="RUS" & year==1991, clean
by year c (row_id): replace area = area[1] if _n>1
by year c (row_id): replace rain = rain[1] if _n>1
list if c=="RUS" & year==1991, clean
reshape wide
list, clean

Comment

Beatrix Kim

Join Date: Jul 2016

Posts: 4
#5

17 Jul 2016, 11:20

This is fantastic, thank you!

One question about the j(j123) piece of reshape long. What is the difference between j(j123) and j(j_var) as in Oded's example?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

17 Jul 2016, 13:19

Nothing whatever, other than personal tastes. I wasn't looking at Oded's example when I prepared mine.
Comment
Beatrix Kim

Join Date: Jul 2016

Posts: 4
#7

17 Jul 2016, 22:06

I see, thank you. I tried in both ways and saw no difference, but I wasn't sure if that was a coincidence. I appreciate all your help!
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

18 Jul 2016, 08:16

The "j" option on the reshape long command simply specifies the name of a variable that will be created by reshape long to hold the "suffix" (for lack of a better term) of the variables from which the elements of the row were taken. In your case, this will be either a 1, 2, or 3, corresponding to c1/area1/rain1, c2/area2/rain2, and c3/area3/rain3. The output from my first list command shows that for Russia in 1991, j123=1 because the data were in c1/area1/rain1. Were you to change "RUS" to "BLR" in that list command, j123 would be 3 in the output.

If you expect to use Stata extensively (as opposed to "not so much after I finish the class I'm taking", a perfectly acceptable position) it will be worth your while to review my code while reading the documentation for reshape in the Stata Data-Management Reference Manual PDF. (This PDF is included in the Stata installation (since version 11) and accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu.) The experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data. But much data these days comes in from spreadsheets in wide layout, so reshape becomes an essential part of the professional user's toolkit.
Comment

Announcement

Filling in missing data from multiple variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment