Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Filling missing values on variables only appearing every four years in the panel

    Hi Statalist.

    I have a couple of variables that only appear in waves 4, 7, 10, 14, 18 (of a panel dataset of 18 waves). As such, these variables contain large numbers of missing values. I used the code (slightly amended from here)
    Code:
    replace dummy = L1.dummy if missing(dummy)
    which fills in the gaps between waves 4 - 18. How can I amend this code to copy the response from wave 4 back to waves 1 to 3?

    Help kindly appreciated.

    Stata v.15.1

  • #2
    Added: To clarify, I want to fill the missings for waves 5 and 6 with the response in 4, waves 8 and 9 with the response in 7, etc, but as the first response only occurs in wave 4, I want to 'back date' the response for wave 4 to waves 1 to 3. Kind regards, Chris

    Comment


    • #3
      The inverse of the lag is the lead. So, you can also specify

      Code:
      replace dummy = F.dummy if missing(dummy)
      Of course, sequence matters here.

      Comment


      • #4
        Andrew Musau's suggestion helps, but not enough. The reason is spelled out in an ancient FAQ on this topic https://www.stata.com/support/faqs/d...issing-values/


        But let's show that first, as there is no data example in this thread.

        Code:
        .  clear
        
        .  set obs 18
        number of observations (_N) was 0, now 18
        
        .  gen id = 42
        
        .  gen wave = _n
        
        .  gen test = 1 if inlist(_n, 4, 7, 10, 14, 18)
        (13 missing values generated)
        
        .  
        .  tsset id wave
               panel variable:  id (strongly balanced)
                time variable:  wave, 1 to 18
                        delta:  1 unit
        
        .  
        .  clonevar tryit = test
        (13 missing values generated)
        
        .  replace tryit = L1.tryit if missing(tryit)
        (10 real changes made)
        
        .  
        .  replace tryit = F1.tryit if missing(tryit)
        (1 real change made)
        
        . l
        
             +--------------------------+
             | id   wave   test   tryit |
             |--------------------------|
          1. | 42      1      .       . |
          2. | 42      2      .       . |
          3. | 42      3      .       1 |
          4. | 42      4      1       1 |
          5. | 42      5      .       1 |
             |--------------------------|
          6. | 42      6      .       1 |
          7. | 42      7      1       1 |
          8. | 42      8      .       1 |
          9. | 42      9      .       1 |
         10. | 42     10      1       1 |
             |--------------------------|
         11. | 42     11      .       1 |
         12. | 42     12      .       1 |
         13. | 42     13      .       1 |
         14. | 42     14      1       1 |
         15. | 42     15      .       1 |
             |--------------------------|
         16. | 42     16      .       1 |
         17. | 42     17      .       1 |
         18. | 42     18      1       1 |
             +--------------------------+
        The reasoning is as follows. .

        1. Stata works separately by panel, as you would hope and expect.

        2. The data example has only one panel, but that's enough to discuss.

        3. Stata starts with wave 1. The value of your variable is missing, but the next value found with F1 is missing too, so no change is made.

        4. Same story with wave 2.

        5. With wave 3, you get what you want.

        So, this method is sufficient if and only if there is just one wave to replace, whenever the 2nd value in each panel is known.

        Swapping L. with F. doesn't make Stata go backwards. It still works using the current order of observations.

        A solution here -- as explained in the FAQ just cited -- is to reverse time temporarily. Alternatively, what you want is just an interpolation for the first run of missing values using say
        ipolate.

        Comment


        • #5
          Thank you Andrew Musau, however, as Nick Cox noted it only worked for wave 3 and not waves 1 and 2.

          Thank you Nick Cox for providing the FAQ - this was helpful. Having tsset my data (tsset id wave) I used the provided code
          Code:
          replace myvar = F.myvar if myvar >=.
          however, as you noted above and in the FAQ, while "L1.myvar" has a cascading effect forward in time, "F1.myvar" does not have the same effect (in reverse), and only fills one period - in my case, it replaced the missing value in wave 3 with the value in wave 4. I thought I would address this issue by reversing time as you suggest in the FAQ using
          Code:
          gsort -wave
          replace myvar = L.myvar if myvar >= .
          to which Stata advised "not sorted r(5)".

          Can you kindly suggest a way forward using the following sample? (Note: there are individuals that entered the survey after wave 1).
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input long(id p_id) byte wave int(rel1 rel2)
          11 63 12    .    .
          11 63 13    .    .
          11 63 14 2330 7000
          11 63 15    .    .
          11 63 16    .    .
          11 63 17    .    .
          11 63 18 7000 7000
          14  15  1    .    .
          14  15  2    .    .
          14  15  3    .    .
          14  15  4 2010 2010
          14  15  5    .    .
          14  15  6    .    .
          14  15  7 2010 2010
          14  15  8    .    .
          14  15  9    .    .
          14  15 10 2010 2010
          14  15 11    .    .
          14  15 12    .    .
          14  15 13    .    .
          14  15 14    . 2010
          14  15 15    .    .
          14  15 16    .    .
          18  19  1    .    .
          18  19  2    .    .
          18  19  3    .    .
          18  19  4 7000 1000
          18  19  5    .    .
          18  19  6    .    .
          18  19  7 7000 1000
          18  19  8    .    .
          18  19  9    .    .
          18  19 10 7000 1000
          18  19 11    .    .
          18  19 12    .    .
          18  19 13    .    .
          18  19 14 7000 1000
          18  19 15    .    .
          18  19 16    .    .
          18  19 17    .    .
          23  24  1    .    .
          23  24  2    .    .
          23  24  3    .    .
          23  24  4 2400 2400
          23  24  5    .    .
          23  24  6    .    .
          23  24  7 2400 2400
          23  24  8    .    .
          23  24  9    .    .
          23  24 10 2000 2000
          23  24 11    .    .
          23  24 12    .    .
          23  24 13    .    .
          23  24 14 2400 2400
          23  24 15    .    .
          23  24 16    .    .
          23  24 17    .    .
          23  24 18 2400 2400
          end
          Added: In the FAQ, you use "if myvar >= .", which I adopted above assuming it is the same as "if missing(myvar)". Can you please explain the difference, if any, between "if missing(myvar)" and "if myvar >= ." and "if myvar < ."

          As an aside, if I repeat the first piece of code twice, will it not 'backfill' waves 1 and 2? Kind regards, Chris
          Last edited by Chris Boulis; 13 Nov 2020, 22:09.

          Comment


          • #6
            You cannot use time-series operators with gsort, unless you create a new time variable and change the xtsettings. So either,

            Code:
            xtset id wave
            replace rel1 = l.rel1 if missing(rel1)
            forval i=1/3{
               replace rel1 = f.rel1 if missing(rel1)
            }
            or

            Code:
            bys id (wave): replace rel1 = rel1[_n-1] if missing(rel1)
            gsort id -wave
            by id: replace rel1 = rel1[_n-1] if missing(rel1)
            Last edited by Andrew Musau; 13 Nov 2020, 23:39.

            Comment


            • #7
              The method used for reversing time in #6 is not quite that discussed in the FAQ cited earlier (because it doesn't work, among other reasons). Here's the link again: https://www.stata.com/support/faqs/d...issing-values/

              Changing the sort order isn't compatible with tsset or xtset unless you create a negated version of the time variable and specify that to either command.

              There is no difference between being
              >= . and being 1 (true) with missing() for numeric variables (or numeric scalars or constants). missing() can be applied to string arguments. Sorry, but I don't recall a reason for explaining things one way rather than another. That FAQ probably goes back about 20 years, and some details have been updated, but what wasn't broken wasn't fixed in revisions.

              Comment


              • #8
                Thank you for your reply Andrew Musau. I appreciate the code for the two approaches. As I have -tsset- my data, I focussed on the first approach. Note that I am dealing with data for couples, and have unique identifiers for each (id & p_id). In addition, I need to run this for six variables (rel1 rel2 rep1 rep2 ret1 ret2) - can you show me a way to include them in a single loop? Kind regards, Chris.
                Last edited by Chris Boulis; 14 Nov 2020, 22:05.

                Comment


                • #9
                  Thank you Nick Cox. Section 6 of the FAQ "With tsset" does not appear to provide a solution for backfilling further than one data point, so the answer is likely in your comment
                  Changing the sort order isn't compatible with tsset or xtset unless you create a negated version of the time variable and specify that to either command
                  So I tried this and it seems to have worked (though I still have a few missing variables here and there, but that may be due to some other reason)
                  Code:
                  gen nwave = -wave
                  tsset id wave
                  replace rel1 = L.rel1 if rel1 >= .
                  
                  tsset id nwave
                  replace rel1 = L.rel1 if rel1 >= .
                  tsset id wave
                  If this is what you meant, can you show how I can apply it to six variables (rel1 rel2 rep1 rep2 ret1 ret2) - potentially in a single loop?

                  Comment


                  • #10
                    Code:
                    gen nwave = -wave
                    tsset id wave
                    
                    foreach v in rel1 rel2 rep1 rep2 ret1 ret2
                        replace `v' = L.`v' if `v' >= .
                    }
                    
                    tsset id nwave
                    
                    foreach v in rel1 rel2 rep1 rep2 ret1 ret2
                        replace `v' = L.`v' if `v' >= .
                    }
                    
                    tsset id wave

                    Comment


                    • #11
                      Thank you Nick Cox. I really appreciate your help, guidance and support. All I can say is that it's helping. I did not have a clear picture of what the loop would look like when I posted #9. This morning it became a little clearer how I'd go about it, then I read your post and thought "yes! that's it". Maybe I'll be able to do it on my own next time. Thank you again

                      Comment


                      • #12
                        (duplicated above post)
                        Last edited by Chris Boulis; 15 Nov 2020, 18:47.

                        Comment


                        • #13
                          Hi Family,
                          I need help to clean up my Data.
                          I have panel data (wave1,2 and 3) Data with 3 variables.
                          I realized there is some inconsistency in the responses and missing data the in ethnicity across the 3waves for same person which I think it should not be the same across since a person's ethnicity can't be change regardless of the wave.
                          In effect I want to use wave 1 response in ethnicity to replicate same in response to ethnicity for wave 2 and 3 of the same respondent.

                          I tried using this command below but it did not work out for me.
                          replace ethnicity=ethnicity if ethnicity ==. & wave>1

                          the results I get is
                          (0 real changes made)

                          Below is the dataex

                          input str11 FPrimaryhhid int ethnicity float wave
                          "1010010261" 8 1
                          "1010010261" . 2
                          "1010010261" 16 3
                          "1010020501" 30 1
                          "1010020501" . 2
                          "1010020501" 30 3
                          "1010020521" 9 1
                          "1010020521" . 2
                          "1010020521" 7 3
                          "1010030371" 3 1
                          "1010030371" . 2
                          "1010030371" 3 3
                          "1010040271" 16 1
                          "1010040271" . 2
                          "1010040271" 16 3
                          "1010060111" 16 1
                          "1010060111" . 2
                          "1010060111" 16 3
                          "1010060561" 17 1
                          "1010060561" . 2
                          "1010060561" 17 3
                          end


                          Thank you.

                          Shamsu
                          Last edited by Shamsudini Amidu; 01 Nov 2022, 08:00.

                          Comment


                          • #14
                            A cautious approach is to interpolate or extrapolate if and only if one distinct value is on offer.

                            Code:
                            clear 
                            
                            input str11 FPrimaryhhid int ethnicity float wave
                            "1010010261" 8 1
                            "1010010261" . 2
                            "1010010261" 16 3
                            "1010020501" 30 1
                            "1010020501" . 2
                            "1010020501" 30 3
                            "1010020521" 9 1
                            "1010020521" . 2
                            "1010020521" 7 3
                            "1010030371" 3 1
                            "1010030371" . 2
                            "1010030371" 3 3
                            "1010040271" 16 1
                            "1010040271" . 2
                            "1010040271" 16 3
                            "1010060111" 16 1
                            "1010060111" . 2
                            "1010060111" 16 3
                            "1010060561" 17 1
                            "1010060561" . 2
                            "1010060561" 17 3
                            end
                            
                            
                            egen min = min(eth), by(FPrimaryhhid) 
                            
                            egen max = max(eth), by(FPrimaryhhid) 
                            
                            gen ethnicity2  = max if max == min
                            
                            order FPrimary wave  
                            
                            list, sepby(FPrimaryhhid) 
                            
                            
                                 +-----------------------------------------------------+
                                 | FPrimary~d   wave   ethnic~y   min   max   ethnic~2 |
                                 |-----------------------------------------------------|
                              1. | 1010010261      1          8     8    16          . |
                              2. | 1010010261      2          .     8    16          . |
                              3. | 1010010261      3         16     8    16          . |
                                 |-----------------------------------------------------|
                              4. | 1010020501      1         30    30    30         30 |
                              5. | 1010020501      2          .    30    30         30 |
                              6. | 1010020501      3         30    30    30         30 |
                                 |-----------------------------------------------------|
                              7. | 1010020521      1          9     7     9          . |
                              8. | 1010020521      2          .     7     9          . |
                              9. | 1010020521      3          7     7     9          . |
                                 |-----------------------------------------------------|
                             10. | 1010030371      1          3     3     3          3 |
                             11. | 1010030371      2          .     3     3          3 |
                             12. | 1010030371      3          3     3     3          3 |
                                 |-----------------------------------------------------|
                             13. | 1010040271      1         16    16    16         16 |
                             14. | 1010040271      2          .    16    16         16 |
                             15. | 1010040271      3         16    16    16         16 |
                                 |-----------------------------------------------------|
                             16. | 1010060111      1         16    16    16         16 |
                             17. | 1010060111      2          .    16    16         16 |
                             18. | 1010060111      3         16    16    16         16 |
                                 |-----------------------------------------------------|
                             19. | 1010060561      1         17    17    17         17 |
                             20. | 1010060561      2          .    17    17         17 |
                             21. | 1010060561      3         17    17    17         17 |
                                 +-----------------------------------------------------+

                            Comment


                            • #15
                              Hi Nick Cox,
                              Thanks soo much! it is working.
                              But to prefill response for ethnicity2 whose min is not equal to the maximum?
                              Is it okay to use the formulae below;
                              replace ethnicity2= min if ethnicity2==.

                              to prefill ethnicity2 using the min value for respondent who have missing ethnicity2?

                              Your advice sir.

                              I appreciate your time.

                              Thank you.

                              Comment

                              Working...
                              X