Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • filling in missing values for panel data

    Hi there,

    I have panel data in which the education variable has missing values. I must use the surrounding non-missing values to fill in the missing ones.

    The education variable is categorical with values 1 to 3, with a higher value indicating better education (1 = low education, 2 = moderate education, 3 = high education).
    Hence, when looking at education over time for each individual, the value at time t+1 must be equal to or higher than the value at time t.




    Here is example data. The variable educ is my current data, and I need to obtain the variable neweduc.
    In this dataset only 1 or 2 consecutive observations are missing, but in my actual dataset this differs widely, with about 20 non-missing observations in a row at max. Hence, I don't think using subscripts [_n+-t] is useful. Could anyone help me solve this? Thank you in advance (:

    Code:
    clear all
    input id time educ neweduc
    1 1 1 1
    1 2 2 2
    1 3 . 2
    1 4 2 2
    1 5 . .
    2 1 2 2
    2 2 . 2
    2 3 . 2
    2 4 2 2
    2 5 3 3
    3 1 . 1
    3 2 . 1
    3 3 1 1
    3 4 . 1
    3 5 1 1
    4 1 2 2
    4 2 3 3
    4 3 . 3
    4 4 . 3
    4 5 3 3
    end
    
    
    list, sepby(id)
    
         +----------------------------+
         | id   time   educ   neweduc |
         |----------------------------|
      1. |  1      1      1         1 |
      2. |  1      2      2         2 |
      3. |  1      3      .         2 |
      4. |  1      4      2         2 |
      5. |  1      5      .         . |
         |----------------------------|
      6. |  2      1      2         2 |
      7. |  2      2      .         2 |
      8. |  2      3      .         2 |
      9. |  2      4      2         2 |
     10. |  2      5      3         3 |
         |----------------------------|
     11. |  3      1      .         1 |
     12. |  3      2      .         1 |
     13. |  3      3      1         1 |
     14. |  3      4      .         1 |
     15. |  3      5      1         1 |
         |----------------------------|
     16. |  4      1      2         2 |
     17. |  4      2      3         3 |
     18. |  4      3      .         3 |
     19. |  4      4      .         3 |
     20. |  4      5      3         3 |
         +----------------------------+

    Thus, per individual the following needs to happen.

    - All observations before the last time educ = 1 need to be educ = 1 as well (can be seen for individual 3: the last observation with educ = 1 is at time = 5, so the observations at time=1,..,4 need to be educ = 1)

    - All observations between the first and last observation with educ = 2 need to be educ = 2 as well (can be seen for individual 1 and 2: all observations between the first time educ = 2 and the last time educ = 2 should be educ = 2 as well)

    - All observations after the first time educ = 3 need to be educ = 3 as well (can be seen for individual 4: at time = 2 is the first observation of educ = 3, which means that education must be educ = 3 for time = 2,...5)




  • #2
    Filling upwards works in your example

    Code:
    gen wanted= educ
    gsort id -time
    by id: replace wanted= wanted[_n-1] if missing(wanted)
    But I presume that you are looking for a combination of up and down. It is not clear whether you want the earlier value of the later value to replace the missing values.

    Code:
    gsort id -time
    by id: gen tofill=sum(!missing(educ))>=1
    gen wanted= educ
    bys id (time): replace wanted= wanted[_n-1] if missing(wanted) & tofill
    gsort id -time
    by id: replace wanted= wanted[_n-1] if missing(wanted)
    Res.:

    Code:
    . l, sepby(id)
    
         +----------------------------------------------+
         | id   time   educ   neweduc   tofill   wanted |
         |----------------------------------------------|
      1. |  1      5      .         .        0        . |
      2. |  1      4      2         2        1        2 |
      3. |  1      3      .         2        1        2 |
      4. |  1      2      2         2        1        2 |
      5. |  1      1      1         1        1        1 |
         |----------------------------------------------|
      6. |  2      5      3         3        1        3 |
      7. |  2      4      2         2        1        2 |
      8. |  2      3      .         2        1        2 |
      9. |  2      2      .         2        1        2 |
     10. |  2      1      2         2        1        2 |
         |----------------------------------------------|
     11. |  3      5      1         1        1        1 |
     12. |  3      4      .         1        1        1 |
     13. |  3      3      1         1        1        1 |
     14. |  3      2      .         1        1        1 |
     15. |  3      1      .         1        1        1 |
         |----------------------------------------------|
     16. |  4      5      3         3        1        3 |
     17. |  4      4      .         3        1        3 |
     18. |  4      3      .         3        1        3 |
     19. |  4      2      3         3        1        3 |
     20. |  4      1      2         2        1        2 |
         +----------------------------------------------+
    
    .
    Last edited by Andrew Musau; 17 May 2021, 02:32.

    Comment


    • #3
      Andrew Musau

      Oh wow, this definitely helped a ton. I finally solved it (: Thank you, have a blessed day.


      For anyone looking at this with a similar issue, this linked gave some nice insights as well

      https://www.ls3.soziologie.uni-muenc...tacommands.pdf

      Or type in -useful stata analysis for longitudinal data- and you can find it as well.

      Comment

      Working...
      X