Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replacing missing values by copying non-missing values

    Hi all,

    I have the following situation:

    The variables I have selected for a regression contain values for the years 1999-2011, except for the binary variable "poolacc"(= does the property has a swimming pool?) for which I have only values for the year 2011. I believe that I have to copy the same values of 2011 to the other years before (1999-2010).

    How can I fill in the dots for years 1999-2010 of this variable 'poolacc' with the values of this particular year 2011 in an efficient way?

    Here is some data with 3 variables; the id (I generated myself) ; hhmove=year moved into house; poolacc=presence of a pool (1= Yes, 2=No).

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float id int hhmove byte poolacc
      1 2001 .
      2 1999 .
      3 1999 .
      4 1999 .
      5 1999 .
      6 2000 .
      7 1999 .
      8 2000 .
      9 2000 .
     10 1999 .
     11 2000 .
     12 2001 .
     13 2000 .
     14 1999 .
     15 2001 .
     16 2001 .
     17 1999 .
     18 1999 .
     19 1999 .
     20 1999 .
     21 1999 .
     22 2000 .
     23 2001 .
     24 2000 .
     25 2000 .
     26 1999 .
     27 2000 .
     28 1999 .
     29 1999 .
     30 1999 .
     31 1999 .
     32 2000 .
     33 2001 .
     34 1999 .
     35 2000 .
     36 2000 .
     37 2000 .
     38 2001 .
     39 2000 .
     40 1999 .
     41 2001 .
     42 1999 .
     43 2001 .
     44 2001 .
     45 2000 .
     46 2001 .
     47 2001 .
     48 2001 .
     49 2000 .
     50 1999 .
     51 1999 .
     52 1999 .
     53 2000 .
     54 1999 .
     55 1999 .
     56 2001 .
     57 1999 .
     58 2001 .
     59 1999 .
     60 1999 .
     61 2000 .
     62 1999 .
     63 1999 .
     64 2001 .
     65 1999 .
     66 2000 .
     67 1999 .
     68 2000 .
     69 1999 .
     70 1999 .
     71 2001 .
     72 2000 .
     73 2000 .
     74 2000 .
     75 2001 .
     76 1999 .
     77 2001 .
     78 2000 .
     79 1999 .
     80 1999 .
     81 2000 .
     82 1999 .
     83 2001 .
     84 1999 .
     85 2001 .
     86 2001 .
     87 2000 .
     88 1999 .
     89 2000 .
     90 2000 .
     91 2000 .
     92 2000 .
     93 2000 .
     94 2000 .
     95 2001 .
     96 2000 .
     97 2001 .
     98 2000 .
     99 1999 .
    100 2000 .
    end
    I tried the following code, but the data remained unchanged...:

    Code:
    bysort id (hhmove): replace poolacc = poolacc[_n -1] if poolacc ==. & hhmove < 2011
    and

    Code:
    by id (hhmove), sort: replace poolacc= poolacc[_n-1] if poolacc>= . & hhmove<2011
    (retrieved from the Stata website).

    Does anyone know which code I need to put in to solve this issue?


    Thanks in advance!

  • #2
    It is often awkward to fill values backward in time from later observations. Your sample code is filling in values in the wrong direction - copying from earlier to later. But you can't easily copy from later to earlier, because Stata goes through your data in the order from earlier to later.

    But your data has a very useful characteristic: the observations end in 2011, and that is the year that you want to copy values from. So this example points the way
    Code:
    . * Example generated by -dataex-. To install: ssc install dataex
    . clear
    
    . input float id int hhmove byte poolacc
    
                id    hhmove   poolacc
      1. 1 2009 .
      2. 1 2010 .
      3. 1 2011 0
      4. 2 2009 .
      5. 2 2010 .
      6. 2 2011 1
      7. 3 2009 .
      8. 3 2010 .
      9. 3 2011 .
     10. end
    
    . by id (hhmove), sort: replace poolacc = poolacc[_N]
    (4 real changes made)
    
    . list, noobs sepby(id)
    
      +-----------------------+
      | id   hhmove   poolacc |
      |-----------------------|
      |  1     2009         0 |
      |  1     2010         0 |
      |  1     2011         0 |
      |-----------------------|
      |  2     2009         1 |
      |  2     2010         1 |
      |  2     2011         1 |
      |-----------------------|
      |  3     2009         . |
      |  3     2010         . |
      |  3     2011         . |
      +-----------------------+
    will do what you set out to do.

    With that said, I have a concern. If a property had no pool in 2011, it seems reasonable to assume there was not a pool in the earlier years that was removed. But if a property had a pool in 2011, it may have been built sometime between 1999 and 2011, so it is not clear that copying the value from 2011 is a good idea.

    Realistically, you have a single variable "property had a pool in 2011" and that is how I would interpret the filled-in variable. I don't know how you intend to use this variable in your analysis, but that's really all you can correctly infer from it. .

    Comment


    • #3
      Hi William,

      Thanks for your response.

      I tried it for the full data, but '0 real changes made' appears on screen after entering the suggested code.

      FYI: 2013 is actually the last year in my data, but I will focus on the period between 1999-2011 by entering
      Code:
      keep if hhmove>=1999 & hhmove<=2011
      just before.

      Furthermore, the binary variable 'poolacc' is either 1= if pool is present or 2= if not present in my data instead of the standard 1/0. Do I have to switch these values before using the replace code (and how)?

      (Regarding the assumption, because the same houses are interviewed every 2 years, it was okay for my mentor to assume the house characteristics were the same prior to 2011)

      Comment


      • #4
        Code:
        . * Example generated by -dataex-. To install: ssc install dataex
        . clear
        
        . input float id int hhmove byte poolacc
        
                    id    hhmove   poolacc
          1. 1 2009 .
          2. 1 2010 .
          3. 1 2011 1
          4. 1 2012 .
          5. 1 2013 .
          6. 2 2009 .
          7. 2 2010 .
          8. 2 2011 2
          9. 2 2012 .
         10. 2 2013 . 
         11. 3 2009 .
         12. 3 2010 .
         13. 3 2011 .
         14. 3 2012 .
         15. 3 2013 .
         16. end
        
        . by id (hhmove), sort: egen fixed = max(poolacc)
        (5 missing values generated)
        
        . list, noobs sepby(id)
        
          +-------------------------------+
          | id   hhmove   poolacc   fixed |
          |-------------------------------|
          |  1     2009         .       1 |
          |  1     2010         .       1 |
          |  1     2011         1       1 |
          |  1     2012         .       1 |
          |  1     2013         .       1 |
          |-------------------------------|
          |  2     2009         .       2 |
          |  2     2010         .       2 |
          |  2     2011         2       2 |
          |  2     2012         .       2 |
          |  2     2013         .       2 |
          |-------------------------------|
          |  3     2009         .       . |
          |  3     2010         .       . |
          |  3     2011         .       . |
          |  3     2012         .       . |
          |  3     2013         .       . |
          +-------------------------------+

        Comment


        • #5
          Hi William,

          I tried the suggested code above, and it generated a variable 'fixed' with values 1 and 2, but it copied the values of variable 'poolacc' only from rows HHMOVE=2011.
          So for the other years of HHMOVE, the variable 'fixed' still shows missing values/dots.

          My aim is to fill in the missing values of poolacc/fixed for the whole period before 2011 as well (by repeating the same values to the prior years).

          Do I you have an idea how to achieve this?

          Many thanks for your help.

          Comment


          • #6
            Code:
            replace poolacc = fixed
            seems to be the obvious solution.

            Comment


            • #7
              Hi,

              Your answer #4 seems to be the answer to my problem.

              However:

              say variable POOLACC has 10000 observations in yearvariable HHMOVE=2011; the code generates a variable 'FIXED' with the same 10000 values but again only for HHMOVE=2011.
              The values for the other years, i.e. HHMOVE=2005 of variable FIXED remain missing/dots.

              I don't understand why the missing values of FIXED are not filled in with values 1 or 2 as your answer #4 shows it does??

              FYI: the exact code I am entering in Stata is the following:

              Code:
              clear all
              set more off
              
              cd "C:\Users\eljattaa\Desktop\MP"
              use "C:\Users\eljattaa\Desktop\MP\data analyse\FULL merge pers.dta"
              
              keep if hhmove>=1999 & hhmove<=2011
              keep if poolacc>=0
              
              gen id = _n
              by id (hhmove), sort: egen fixed = max(poolacc)
              replace poolacc=fixed
              Thanks.

              Comment


              • #8
                I see now in post #1 that you tell us you generated the ID yourself, and in post #7 we reveal that you assigned a different ID to each observation with
                Code:
                gen id = _n
                That id is of no use for what you are doing.

                You did not fully explain what your data is, but it seems to be data for a number of properties(?) over a number of years. I am assuming each property has one observation for each year of data, as my invented data in post #2 has three observations for each of three households, and the households are identified as ID 1, 2, and 3.

                Because the identifier you generated is different for every observation, when poolacc is nonmissing the other observations for that property have a different ID and nothing is copied to them.

                Your data presumably has some way of telling which observations belong to the same property. Use that for the identifier, not your made-up id.

                Comment

                Working...
                X