Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Filling in missing data from multiple variables

    I have this data:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int year str3 c1 float(area1 rain1) str3 c2 float(area2 rain2) str3 c3 float(area3 rain3)
    1991 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
    1992 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
    1993 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
    1994 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
    1995 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
    1996 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
    1997 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
    1998 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
    1999 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
    2000 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
    2001 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
    1991 "RUS"     .     . "UKR"     .     . "NA"      .     .
    1991 "RUS"     .     . "BLR"     .     . "NA"      .     .
    1991 "UKR"     .     . "BLR"     .     . "NA"      .     .
    1991 "RUS"     .     . "BLR"     .     . "UKR"     .     .
    1992 "RUS"     .     . "UKR"     .     . "NA"      .     .
    1992 "RUS"     .     . "BLR"     .     . "NA"      .     .
    1992 "UKR"     .     . "BLR"     .     . "NA"      .     .
    1992 "RUS"     .     . "BLR"     .     . "UKR"     .     .
    1993 "RUS"     .     . "UKR"     .     . "NA"      .     .
    1993 "RUS"     .     . "BLR"     .     . "NA"      .     .
    1993 "UKR"     .     . "BLR"     .     . "NA"      .     .
    1993 "RUS"     .     . "BLR"     .     . "UKR"     .     .
    end
    This is a river basin with 3 countries. The first 4 observations are the whole basin per year with area and rainfall. The remaining observations are combinations of the countries within the basin. I need to fill in the area and rain data throughout the dataset, by country. As you can see, sometimes the country is in c1, sometimes in c2, sometimes in c3. For example, in observation 13, I need to fill area2 and rain2 for BLR, but that data is located in area3 and rain3 in obs 1-11.
    I have tried xfill (installed by typing in: net from http://www.sealedenvelope.com/), but get errors. I have also tried this loop, but it says I am specifying too many variables (I think with the * to get all the country and area variables):
    Code:
    levelsof c*, local(lm_country)
    foreach c in `lm_country' {
        levelsof area* if c* == `country', local(lm_area)
        replace area* = `lm_area' if c == `country'
    }
    I appreciate any insight - I'm really stumped here.

    I have also cross-posted this on Stack Overflow.
    I have data that looks like this: year c1 area1 rain1 c2 area2 rain2 c3 area3 rain3 1991 RUS 17.93 460.1 UKR 57.97 564.7 BLR 24.19 618.91 1992 RUS 17.93 460.1 UKR 57.97 564.7 BLR 24.19 618.91 1...

  • #2
    ADDITIONAL attempt added: This further attempt results in filling in the area* variable from only the first country, not the various countries in each c* column:

    Code:
    foreach v of varlist area* {
        by year (`v'), sort: assert `v' == `v'[1] | missing(`v')    
        by year (`v'): replace `v' = `v'[1]
    }
    I need the replace `v' statement to be "if area* value matches c* country in that observation", but I have yet to figure out how to write that in code.

    Comment


    • #3
      The first thing to do with this shape of data is to reshape the data, then it is easier to achieve your goals, try something like:
      Code:
      g id=_n
      reshape long c area rain, i(id)j(j_var)
      sort c year
      duplicates drop c year, force //YOU MAY KEEP OR DROP THE DUPLICATES I PREFER TO DROP
      foreach i of var area rain {
          sort c `i'
          bys c (`i'): replace `i'=`i'[_n-1] if mi(`i')
      }    
      sort c year
      Last edited by Oded Mcdossi; 17 Jul 2016, 04:03.

      Comment


      • #4
        Welcome to Statalist!

        Experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data, and Oded's solution follows that advice.

        With that said, if there are reasons I'm unaware of to prefer the layout you show, the following should do what you need. Thank you for providing your sample data in dataex format; it made writing and testing this code straightforward. Note that I only use the data for 1991-1993 to keep the example small.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input int year str3 c1 float(area1 rain1) str3 c2 float(area2 rain2) str3 c3 float(area3 rain3)
        1991 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
        1992 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
        1993 "RUS" 17.83 460.1 "UKR" 57.97 564.7 "BLR" 24.19 618.1
        1991 "RUS"     .     . "UKR"     .     . "NA"      .     .
        1991 "RUS"     .     . "BLR"     .     . "NA"      .     .
        1991 "UKR"     .     . "BLR"     .     . "NA"      .     .
        1991 "RUS"     .     . "BLR"     .     . "UKR"     .     .
        1992 "RUS"     .     . "UKR"     .     . "NA"      .     .
        1992 "RUS"     .     . "BLR"     .     . "NA"      .     .
        1992 "UKR"     .     . "BLR"     .     . "NA"      .     .
        1992 "RUS"     .     . "BLR"     .     . "UKR"     .     .
        1993 "RUS"     .     . "UKR"     .     . "NA"      .     .
        1993 "RUS"     .     . "BLR"     .     . "NA"      .     .
        1993 "UKR"     .     . "BLR"     .     . "NA"      .     .
        1993 "RUS"     .     . "BLR"     .     . "UKR"     .     .
        end
        
        generate row_id = _n
        reshape long c area rain, i(row_id) j(j123)
        sort year c row_id
        list if c=="RUS" & year==1991, clean
        by year c (row_id): replace area = area[1] if _n>1
        by year c (row_id): replace rain = rain[1] if _n>1
        list if c=="RUS" & year==1991, clean
        reshape wide
        list, clean

        Comment


        • #5
          This is fantastic, thank you!

          One question about the j(j123) piece of reshape long. What is the difference between j(j123) and j(j_var) as in Oded's example?

          Comment


          • #6
            Nothing whatever, other than personal tastes. I wasn't looking at Oded's example when I prepared mine.

            Comment


            • #7
              I see, thank you. I tried in both ways and saw no difference, but I wasn't sure if that was a coincidence. I appreciate all your help!

              Comment


              • #8
                The "j" option on the reshape long command simply specifies the name of a variable that will be created by reshape long to hold the "suffix" (for lack of a better term) of the variables from which the elements of the row were taken. In your case, this will be either a 1, 2, or 3, corresponding to c1/area1/rain1, c2/area2/rain2, and c3/area3/rain3. The output from my first list command shows that for Russia in 1991, j123=1 because the data were in c1/area1/rain1. Were you to change "RUS" to "BLR" in that list command, j123 would be 3 in the output.

                If you expect to use Stata extensively (as opposed to "not so much after I finish the class I'm taking", a perfectly acceptable position) it will be worth your while to review my code while reading the documentation for reshape in the Stata Data-Management Reference Manual PDF. (This PDF is included in the Stata installation (since version 11) and accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu.) The experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data. But much data these days comes in from spreadsheets in wide layout, so reshape becomes an essential part of the professional user's toolkit.

                Comment

                Working...
                X