Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Filling missing strings in panel data

    Hello,

    I want to fill missing strings by using the last observation.
    For values I can do it with a command like
    Code:
     replace dummy=dummy[_n-1] if dummy==.
    However, if i want to do it with strings, Stata reports r(109) - type mismatch. Hence my question is how to do the filling with strings in a dataset like this:
    Code:
    * Example generated by -dataex-.    To install: ssc install    dataex
    clear
    input str22 countryname str3 ifs    float q_date_all
    "Dominican Republic" "DOM" 212
    ""                   ""    213
    ""                   ""    214
    ""                   ""    215
    "Dominican Republic" "DOM" 216
    ""                   ""    217
    ""                   ""    218
    ""                   ""    219
    "Ecuador"            "ECU"  60
    ""                   ""     61
    ""                   ""     62
    ""                   ""     63
    "Ecuador"            "ECU"  64
    end
    format %tq q_date_all

    E.g. I want to fill the countryname for the 2-4 line with "Dominican Republic" from line 1, lines 6-8 with line 5, etc. Thanks you for your help in advance!
    Last edited by Manni Hitzel; 05 Dec 2016, 03:53.

  • #2
    All is explained at http://www.stata.com/support/faqs/da...issing-values/ but simply

    Code:
     
     dummy==.
    should be

    Code:
     
     dummy == ""
    or

    Code:
    missing(dummy)

    Comment


    • #3
      Thanks a lot, the " " was my mistake

      Comment


      • #4
        I have a more complicate issue
        I would like to fill in only time independent variables, and obviously leave time-dependent values alone, even if missing.
        the following returns an invalid syntax:

        by(id): replace somevar=somevar[_n-1] if somevar[_n-1]==somevar[_n-2]

        the following workaround..works

        gen somevar2 = somevar[_n-1] if id==id[_n-1]
        replace somevar=var2 if somevar==.
        drop somevar2


        but cannot run for strings. When writing a foreach loop, I therefore have to prefix a capture, run for numeric and repeat for strings
        However, the code cannot discriminate time-fixed from time-varying variables which happen to have missing values
        If I add a test for that:

        capture noisily replace `var'=`var'2 if `var'==. & `var'[_n-1] == `var'2[_n-1]

        the id with 2 observations (which perhaps are the source of the first error?) will have missing time fixed variables in the second observation and still there can be time-fixed variable having consecutively the same values

        Is there a way to do this? carryforward doesn't change the issue i suppose

        Thanks!

        Comment


        • #5
          The first statement fails, I guess, because you are using option-like syntax where it doesn't apply.

          Code:
          by id: replace somevar=somevar[_n-1] if somevar[_n-1]==somevar[_n-2]
          would be legal if data were in the right sort order.

          Code:
          capture noisily replace `var'=`var'2 if `var'==. & `var'[_n-1] == `var'2[_n-1]
          is a little difficult to discuss because you don't define your local macro.

          But your context is clearly panel data management and for that a by: prefix (explicitly or implicitly) is essential unless you want to try to re-invent it the hard way.

          You can check for the number of distinct non-missing values and then use the sole non-missing string when there is one.

          Code:
          egen tag = tag(id somevar) 
          egen ndistinct = total(tag), by(id) 
          
          * if somevar is string, non-missing values are sorted to the end of each block 
          bysort id (somevar) : replace somevar = somevar[_N] if missing(somevar) & ndistinct == 1

          Comment


          • #6
            The syntax error in your first attempt arises because you have enclosed id in parentheses. If you go -by id:...-, the command will run. It may not, however, be what you want. When _n == 1, both somevar[_n-1] and somevar[_n-2] are non-existent, and Stata will treat those as missing values. Consequently they will be equal. The -if- condition therefore will always replace the first value by a missing value. If the second value is also missing, then the second observation will, again, be replaced by a missing value, and, as you can see, we are setting up for a cascade of missing values down the line, which is not what you want.

            It sounds like you want to loop over a bunch of variables, some string and some numeric. You want to fill in missing values with preceding values if the two preceding values agreed (which is not the same as saying that these variables are time invariant), and you want to leave non-missing values as they are. Your data must contain some sort of time variable that orders the data within id. I'll assume that variable is called year.
            Code:
            foreach v of varlist whatever {
                display `"`v'"'
                by id (year): replace `v' = `v'[_n-1] if missing(`v') ///
                    & `v'[_n-1] == `v'[_n-2]
            }
            Note that the use of the -missing()- function works with both string and numeric variables, so a single line of code covers both types.

            Added: crossed with Nick's response, which makes essentially the same points.

            Comment


            • #7
              Thank you, this is incredibly useful as usual.
              The context, which I should have explicated earlier, is merging many datasets some crosssectional and other longitudinal to obtain panel data for survival and/or nested case control/case cohort analysis... Typo aside, the first statement would indeed generate missingness as described.

              The lines

              egen tag = tag(id somevar) egen ndistinct = total(tag), by(id) are quite nifty: although the subsequent bysort id (somevar) : replace somevar = somevar[_N] if missing(somevar) & ndistinct == 1 wouldn't work for me, I can add

              egen avsomevar= mean(ndistinct)


              and then proceeded with


              foreach var of varlist firstvar-lastvar {
              egen tag`var' = tag(id `var')
              egen nm`var' = total(tag`var'), by(id)
              egen av`var'= mean(nm`var')
              drop tag`var' nm`var'
              }
              sum av*

              *time-dependent var should yield av`var' >1 and time-independent av`var'<1, assuming each var is coming from one dataset or recorded at one visit

              foreach var of varlist firstvar-lastvar {
              sort id `var'
              by id: carryforward `var' if missing(`var') & av`var'<=1 , replace
              drop av`var'
              }


              this appears to run correctly, albeit it took ~8 minutes. So in general I was wondering, many people might have the same need, perhaps there is a better way to do this?
              Thanks again.

              Comment


              • #8
                wouldn't work for me
                is not something I can reply to in detail. There is no example dataset or error report which would clarify that claim.

                Coincidentally, I was working on a stripolate program earlier today. It's not been banged on at all really, but a first test script may show what it does. Most of the code is syntax and problem checks.

                By design at present it will only allow interpolation (within groups) with just one distinct non-missing value present. I'm imagining a force option that will allow override of that with forward or backward options.


                Code:
                *! stripolate 1.0.0 NJC 14dec2016
                * mipolate 1.2.0 NJC 2sep2015
                * 1.1.0 NJC 27aug2015
                * 1.0.0 NJC 20jul2015
                * ipolate 1.3.3  21sep2004
                program stripolate, byable(onecall) sort
                    version 10          
                    syntax varlist(min=2 max=2) [if] [in], ///
                    GENerate(string) ///
                    [ BY(varlist)    ///
                    Forward          ///
                    Backward         ///
                    Groupwise ]
                
                    // syntax checks
                    tokenize `varlist'
                    args stry x
                
                    capture confirm string var `stry'
                    if _rc {
                        di as err "{p}stripolate is for interpolation of " ///
                        "string variables only: try mipolate (SSC)?{p_end}"
                        exit _rc
                    }
                
                    capture confirm numeric var `x'
                    if _rc {
                        di as err "{p}stripolate is for interpolation with " ///
                        "respect to a numeric variable{p_end}"
                        exit _rc
                    }
                
                    if _by() {
                        if "`by'" != "" {
                            di as err /*
                            */ "option by() may not be combined with by prefix"
                            exit 190
                        }
                        local by "`_byvars'"
                    }
                
                    local nopts : word count `forward' `backward' `groupwise'
                
                    if `nopts' != 1 {
                        di as err "must specify one interpolation method"
                        exit 198
                    }
                
                    confirm new var `generate'
                
                    quietly {
                        // anything to do?
                        marksample touse, novarlist  
                        replace `touse' = 0 if `x' >= .
                        count if `touse'
                        if r(N) == 0 error 2000
                
                        count if missing(`stry') & `touse'
                        if r(N) == 0 {
                            di as txt "{p}nothing to do; " ///
                            "no missing `stry' and/or missing `x'{p_end}"
                            exit 0
                        }
                
                        // uniqueness check
                        tempvar OK z  
                        bysort `touse' `by' `x' (`stry') : ///
                            gen byte `OK' = missing(`stry') | `stry' == `stry'[_N]
                        count if `touse' & !`OK'
                        if r(N) {
                            di as err "different string values for same `x'"
                            exit 498
                        }
                
                        if "`forward'`backward'" != "" {
                            clonevar `z' = `stry'
                
                            if "`forward'" != "" {
                                bysort `touse' `by' (`x') : ///
                                replace `z' = `z'[_n-1] if `touse' & missing(`z' )
                            }
                            else {
                                tempvar negx
                                gen double `negx' = -`x'
                                bysort `touse' `by' (`negx'): ///
                                replace `z' = `z'[_n-1] if `touse' & missing(`z')
                            }
                        }
                
                        if "`groupwise'" != "" {
                            clonevar `z' = `stry'
                            bysort `touse' `by' (`stry') : ///
                            replace `z' = `z'[_N] if missing(`z') & `touse'  
                        }
                
                        rename `z' `generate'
                        compress `generate'
                        count if missing(`generate')
                    }
                
                    if r(N) > 0 {
                        if r(N) != 1 local pl "s"
                        di as txt "(" r(N) `" missing value`pl' generated)"'
                    }
                end
                Code:
                clear
                set obs 10
                gen id = cond(_n <= 5, 1, 2)
                bysort id: gen time = _n
                gen foo = "A" if inlist(_n, 2, 4)
                replace foo = "B" if inlist(_n, 6, 8)
                list, sepby(id)  
                
                stripolate foo time, by(id) gen(barf) forward
                stripolate foo time, by(id) gen(barb) backward
                stripolate foo time, by(id) gen(barg) groupwise
                
                list, sepby(id)
                
                     +--------------------------------------+
                     | id   time   foo   barf   barb   barg |
                     |--------------------------------------|
                  1. |  1      1                   A      A |
                  2. |  1      2     A      A      A      A |
                  3. |  1      3            A      A      A |
                  4. |  1      4     A      A      A      A |
                  5. |  1      5            A             A |
                     |--------------------------------------|
                  6. |  2      1     B      B      B      B |
                  7. |  2      2            B      B      B |
                  8. |  2      3     B      B      B      B |
                  9. |  2      4            B             B |
                 10. |  2      5            B             B |
                     +--------------------------------------+
                Last edited by Nick Cox; 14 Dec 2016, 12:19.

                Comment


                • #9
                  Thank you very much for addressing this likely common need. I will look at stripolate to the best of my limited resoruces

                  replace somevar = somevar[_N] if missing(somevar) & ndistinct == 1


                  wouldn't work for me simply because some time-dependent variables only have one value for some individuals, and time invariant variable can have one value or be all missing ( there shoudl be not more than one value though...unless there's mistakes). So to try to distinguish time-fixed and time-variant vars I have to look both at the average missingness and variability within groups

                  Thanks again.

                  Comment


                  • #10
                    Thanks for coming back on that but I still don't understand. If there is one just distinct non-missing value, the idea should work.

                    Consider this. For id 1, the one non-missing value is spread to all observations. For id 2, the procedure doesn't go through.

                    Code:
                    clear 
                    set obs 10 
                    gen id = cond(_n <= 5, 1, 2)  
                    gen t = _n 
                    gen foo = "A" in 3 
                    replace foo = "B" in 7 
                    replace foo = "C" in 9 
                    list 
                    
                         +---------------+
                         | id    t   foo |
                         |---------------|
                      1. |  1    1       |
                      2. |  1    2       |
                      3. |  1    3     A |
                      4. |  1    4       |
                      5. |  1    5       |
                         |---------------|
                      6. |  2    6       |
                      7. |  2    7     B |
                      8. |  2    8       |
                      9. |  2    9     C |
                     10. |  2   10       |
                         +---------------+
                    
                    egen tag = tag(foo) 
                    egen ndistinct = total(tag), by(id) 
                    
                    bysort id (foo) : replace foo = foo[_N] if missing(foo) & ndistinct == 1 
                    
                    sort id t 
                    list, sepby(id) 
                    
                         +--------------------------------+
                         | id    t   foo   tag   ndisti~t |
                         |--------------------------------|
                      1. |  1    1     A     0          1 |
                      2. |  1    2     A     0          1 |
                      3. |  1    3     A     1          1 |
                      4. |  1    4     A     0          1 |
                      5. |  1    5     A     0          1 |
                         |--------------------------------|
                      6. |  2    6           0          2 |
                      7. |  2    7     B     1          2 |
                      8. |  2    8           0          2 |
                      9. |  2    9     C     1          2 |
                     10. |  2   10           0          2 |
                         +--------------------------------+

                    Comment


                    • #11
                      Yes, but if you only had id==1 with only one obs filled in, the code cannot discriminate if foo is time varying, or time -invariant, so it would fill in either case. Thus I had to look at the totality of the observations, assuming (hoping) that
                      av`var'>1 for time varying
                      av`var'<=1 for time invariant
                      then obviously I went to check the codebooks, and out of several hundred vars, it looks like in fact only two time varying vars with extreme missingness have av`var'<=1
                      It would be expedient if this and other important properties could be coded in the name or otherwise intrinsically associated to a var.
                      Again, I thank you very much. No matter how many good manuals I follow, I seem to be always behind the curve.
                      I have the feeling many of us who have to be our own statisticians can exhaust so much effort in manipulating data before even getting to the analyses.

                      Comment


                      • #12
                        Sorry, but I can't push this any further, as I see no code examples with data here that could possibly tell me where my code is wrong or incomplete or doesn't do what you want.

                        Word pictures really don't help much on the whole.

                        Incidentally, I am not a statistician. I have no formal qualifications whatsoever in statistics. I agree that data management is crucial. I spend a lot of my time on it.

                        Comment


                        • #13
                          The stripolate program is now posted on SSC as part of the mipolate package. See http://www.statalist.org/forums/foru...-interpolation

                          Comment


                          • #14
                            You can also try fillmissing (from SSC) for filling missing values by groups with last, first, any, previous, next, mean, median, or other options. Details are provided in the help file and here https://fintechprofessor.com/2019/12...lues-in-stata/
                            Code:
                            ssc install fillmissing
                            help fillmissing
                            Regards
                            --------------------------------------------------
                            Attaullah Shah, PhD.
                            Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
                            FinTechProfessor.com
                            https://asdocx.com
                            Check out my asdoc program, which sends outputs to MS Word.
                            For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

                            Comment

                            Working...
                            X