Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Filling missing values with mean

    Hello everyone,

    I am struggling with filling missing values with mean values, because I have to separate several cases and fill missing values when it is appropriate.

    I want to fill missing values (.) if those missing spaces are both blocked with numbers except 0 and -99. Mean value would be the average of those two values, the front one and the back.

    If there is either 0 or -99 in either front or back, then missing spaces should be left as it is.

    Below is my data form.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte pid int(salary2000 salary2001) byte salary2002 int(salary2003 salary2004 salary2005 salary2006)
    1 100   . .   . 250 250 280
    2   .   . . 300 450 450 600
    3   0   0 0  80  70  80  85
    4 150 170 .   .  50 -99 -99
    5 -99 -99 0   0 100 180   .
    end

    The form I want should look like below!


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte pid int(salary2000 salary2001 salary2002 salary2003 salary2004 salary2005 salary2006)
    1 100  75  75  75 250 250 280
    2   .   .   . 300 450 450 600
    3   0   0   0  80  70  80  85
    4 150 170 110 110  50 -99 -99
    5 -99 -99   0   0 100 180   .
    end

    I think it is quite complicated, and I really need help!

    Thank you very much.

    Halim.

  • #2
    I can't follow this. The imputed values are just 75 for identifier 1 and 110 and for identifier 4. 110 is the mean of 50 and 170, but 75 isn't the mean of 100 and 250.

    Generally in Stata, this kind of data should be put in long layout, not wide, as below.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte pid int(salary2000 salary2001) byte salary2002 int(salary2003 salary2004 salary2005 salary2006)
    1 100   . .   . 250 250 280
    2   .   . . 300 450 450 600
    3   0   0 0  80  70  80  85
    4 150 170 .   .  50 -99 -99
    5 -99 -99 0   0 100 180   .
    end
    
    reshape long salary, i(pid) j(year)
    
    list, sepby(pid)
    
         +---------------------+
         | pid   year   salary |
         |---------------------|
      1. |   1   2000      100 |
      2. |   1   2001        . |
      3. |   1   2002        . |
      4. |   1   2003        . |
      5. |   1   2004      250 |
      6. |   1   2005      250 |
      7. |   1   2006      280 |
         |---------------------|
      8. |   2   2000        . |
      9. |   2   2001        . |
     10. |   2   2002        . |
     11. |   2   2003      300 |
     12. |   2   2004      450 |
     13. |   2   2005      450 |
     14. |   2   2006      600 |
         |---------------------|
     15. |   3   2000        0 |
     16. |   3   2001        0 |
     17. |   3   2002        0 |
     18. |   3   2003       80 |
     19. |   3   2004       70 |
     20. |   3   2005       80 |
     21. |   3   2006       85 |
         |---------------------|
     22. |   4   2000      150 |
     23. |   4   2001      170 |
     24. |   4   2002        . |
     25. |   4   2003        . |
     26. |   4   2004       50 |
     27. |   4   2005      -99 |
     28. |   4   2006      -99 |
         |---------------------|
     29. |   5   2000      -99 |
     30. |   5   2001      -99 |
     31. |   5   2002        0 |
     32. |   5   2003        0 |
     33. |   5   2004      100 |
     34. |   5   2005      180 |
     35. |   5   2006        . |
         +---------------------+
    
    .

    Comment


    • #3
      Sorry Nick,

      I put the wrong number.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte pid int(salary2000 salary2001 salary2002 salary2003 salary2004 salary2005 salary2006)
      1 100 175 175 175 250 250 280
      2   .   .   . 300 450 450 600
      3   0   0   0  80  70  80  85
      4 150 170 110 110  50 -99 -99
      5 -99 -99   0   0 100 180   .
      end

      Also, I made my data wide because every line should show each individual's salary at each year.

      Thank you!

      Comment


      • #4
        This is what I think you're asking for. Good luck doing anything non-trivial with that data layout. I still recommend against it.


        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input byte pid int(salary2000 salary2001) byte salary2002 int(salary2003 salary2004 salary2005 salary2006)
        1 100   . .   . 250 250 280
        2   .   . . 300 450 450 600
        3   0   0 0  80  70  80  85
        4 150 170 .   .  50 -99 -99
        5 -99 -99 0   0 100 180   .
        end
        
        reshape long salary, i(pid) j(year)
        
        gen previous = salary if !inlist(salary, 0, -99)
        gen next = previous 
        
        bysort pid (year) : replace previous = previous[_n-1] if missing(previous) 
        
        gen negyear = -year 
        bysort pid (negyear) : replace next = next[_n-1] if missing(next) 
        sort pid year 
        drop negyear 
        
        gen wanted = cond(salary < ., salary, (previous + next)/2)
        list, sepby(pid)
        
            +------------------------------------------------+
             | pid   year   salary   previous   next   wanted |
             |------------------------------------------------|
          1. |   1   2000      100        100    100      100 |
          2. |   1   2001        .        100    250      175 |
          3. |   1   2002        .        100    250      175 |
          4. |   1   2003        .        100    250      175 |
          5. |   1   2004      250        250    250      250 |
          6. |   1   2005      250        250    250      250 |
          7. |   1   2006      280        280    280      280 |
             |------------------------------------------------|
          8. |   2   2000        .          .    300        . |
          9. |   2   2001        .          .    300        . |
         10. |   2   2002        .          .    300        . |
         11. |   2   2003      300        300    300      300 |
         12. |   2   2004      450        450    450      450 |
         13. |   2   2005      450        450    450      450 |
         14. |   2   2006      600        600    600      600 |
             |------------------------------------------------|
         15. |   3   2000        0          .     80        0 |
         16. |   3   2001        0          .     80        0 |
         17. |   3   2002        0          .     80        0 |
         18. |   3   2003       80         80     80       80 |
         19. |   3   2004       70         70     70       70 |
         20. |   3   2005       80         80     80       80 |
         21. |   3   2006       85         85     85       85 |
             |------------------------------------------------|
         22. |   4   2000      150        150    150      150 |
         23. |   4   2001      170        170    170      170 |
         24. |   4   2002        .        170     50      110 |
         25. |   4   2003        .        170     50      110 |
         26. |   4   2004       50         50     50       50 |
         27. |   4   2005      -99         50      .      -99 |
         28. |   4   2006      -99         50      .      -99 |
             |------------------------------------------------|
         29. |   5   2000      -99          .    100      -99 |
         30. |   5   2001      -99          .    100      -99 |
         31. |   5   2002        0          .    100        0 |
         32. |   5   2003        0          .    100        0 |
         33. |   5   2004      100        100    100      100 |
         34. |   5   2005      180        180    180      180 |
         35. |   5   2006        .        180      .        . |
             +------------------------------------------------+

        Comment

        Working...
        X