Filling missing strings in panel data

Manni Hitzel

Join Date: Dec 2016

Posts: 7
#1

Filling missing strings in panel data

05 Dec 2016, 03:51

Hello,

I want to fill missing strings by using the last observation.
For values I can do it with a command like

Code:

replace dummy=dummy[_n-1] if dummy==.

However, if i want to do it with strings, Stata reports r(109) - type mismatch. Hence my question is how to do the filling with strings in a dataset like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str22 countryname str3 ifs float q_date_all "Dominican Republic" "DOM" 212 "" "" 213 "" "" 214 "" "" 215 "Dominican Republic" "DOM" 216 "" "" 217 "" "" 218 "" "" 219 "Ecuador" "ECU" 60 "" "" 61 "" "" 62 "" "" 63 "Ecuador" "ECU" 64 end format %tq q_date_all

E.g. I want to fill the countryname for the 2-4 line with "Dominican Republic" from line 1, lines 6-8 with line 5, etc. Thanks you for your help in advance!

Last edited by Manni Hitzel; 05 Dec 2016, 03:53.
Tags: missing, panel data
Nick Cox

Join Date: Mar 2014

Posts: 35724
#2

05 Dec 2016, 04:23

All is explained at http://www.stata.com/support/faqs/da...issing-values/ but simply

Code:

dummy==.

should be

Code:

dummy == ""

or

Code:

missing(dummy)
Comment
Manni Hitzel

Join Date: Dec 2016

Posts: 7
#3

05 Dec 2016, 05:26

Thanks a lot, the " " was my mistake
Comment
Nazzarena

Join Date: Aug 2014

Posts: 60
#4

12 Dec 2016, 12:24

I have a more complicate issue
I would like to fill in only time independent variables, and obviously leave time-dependent values alone, even if missing.
the following returns an invalid syntax:

by(id): replace somevar=somevar[_n-1] if somevar[_n-1]==somevar[_n-2]

the following workaround..works

gen somevar2 = somevar[_n-1] if id==id[_n-1]
replace somevar=var2 if somevar==.
drop somevar2

but cannot run for strings. When writing a foreach loop, I therefore have to prefix a capture, run for numeric and repeat for strings
However, the code cannot discriminate time-fixed from time-varying variables which happen to have missing values
If I add a test for that:

capture noisily replace `var'=`var'2 if `var'==. & `var'[_n-1] == `var'2[_n-1]

the id with 2 observations (which perhaps are the source of the first error?) will have missing time fixed variables in the second observation and still there can be time-fixed variable having consecutively the same values

Is there a way to do this? carryforward doesn't change the issue i suppose

Thanks!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

12 Dec 2016, 12:44

The first statement fails, I guess, because you are using option-like syntax where it doesn't apply.

Code:

by id: replace somevar=somevar[_n-1] if somevar[_n-1]==somevar[_n-2]

would be legal if data were in the right sort order.

Code:

capture noisily replace `var'=`var'2 if `var'==. & `var'[_n-1] == `var'2[_n-1]

is a little difficult to discuss because you don't define your local macro.

But your context is clearly panel data management and for that a by: prefix (explicitly or implicitly) is essential unless you want to try to re-invent it the hard way.

You can check for the number of distinct non-missing values and then use the sole non-missing string when there is one.

Code:

egen tag = tag(id somevar) egen ndistinct = total(tag), by(id) * if somevar is string, non-missing values are sorted to the end of each block bysort id (somevar) : replace somevar = somevar[_N] if missing(somevar) & ndistinct == 1
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#6

12 Dec 2016, 13:03

The syntax error in your first attempt arises because you have enclosed id in parentheses. If you go -by id:...-, the command will run. It may not, however, be what you want. When _n == 1, both somevar[_n-1] and somevar[_n-2] are non-existent, and Stata will treat those as missing values. Consequently they will be equal. The -if- condition therefore will always replace the first value by a missing value. If the second value is also missing, then the second observation will, again, be replaced by a missing value, and, as you can see, we are setting up for a cascade of missing values down the line, which is not what you want.

It sounds like you want to loop over a bunch of variables, some string and some numeric. You want to fill in missing values with preceding values if the two preceding values agreed (which is not the same as saying that these variables are time invariant), and you want to leave non-missing values as they are. Your data must contain some sort of time variable that orders the data within id. I'll assume that variable is called year.

Code:

foreach v of varlist whatever { display `"`v'"' by id (year): replace `v' = `v'[_n-1] if missing(`v') /// & `v'[_n-1] == `v'[_n-2] }

Note that the use of the -missing()- function works with both string and numeric variables, so a single line of code covers both types.

Added: crossed with Nick's response, which makes essentially the same points.
Comment
Nazzarena

Join Date: Aug 2014

Posts: 60
#7

14 Dec 2016, 12:01

Thank you, this is incredibly useful as usual.
The context, which I should have explicated earlier, is merging many datasets some crosssectional and other longitudinal to obtain panel data for survival and/or nested case control/case cohort analysis... Typo aside, the first statement would indeed generate missingness as described.

The lines

egen tag = tag(id somevar) egen ndistinct = total(tag), by(id) are quite nifty: although the subsequent bysort id (somevar) : replace somevar = somevar[_N] if missing(somevar) & ndistinct == 1 wouldn't work for me, I can add

egen avsomevar= mean(ndistinct)

and then proceeded with

foreach var of varlist firstvar-lastvar {
egen tag`var' = tag(id `var')
egen nm`var' = total(tag`var'), by(id)
egen av`var'= mean(nm`var')
drop tag`var' nm`var'
}
sum av*

*time-dependent var should yield av`var' >1 and time-independent av`var'<1, assuming each var is coming from one dataset or recorded at one visit

foreach var of varlist firstvar-lastvar {
sort id `var'
by id: carryforward `var' if missing(`var') & av`var'<=1 , replace
drop av`var'
}

this appears to run correctly, albeit it took ~8 minutes. So in general I was wondering, many people might have the same need, perhaps there is a better way to do this?
Thanks again.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

14 Dec 2016, 12:08

wouldn't work for me

is not something I can reply to in detail. There is no example dataset or error report which would clarify that claim.

Coincidentally, I was working on a stripolate program earlier today. It's not been banged on at all really, but a first test script may show what it does. Most of the code is syntax and problem checks.

By design at present it will only allow interpolation (within groups) with just one distinct non-missing value present. I'm imagining a force option that will allow override of that with forward or backward options.

Code:

*! stripolate 1.0.0 NJC 14dec2016
* mipolate 1.2.0 NJC 2sep2015
* 1.1.0 NJC 27aug2015
* 1.0.0 NJC 20jul2015
* ipolate 1.3.3  21sep2004
program stripolate, byable(onecall) sort
    version 10          
    syntax varlist(min=2 max=2) [if] [in], ///
    GENerate(string) ///
    [ BY(varlist)    ///
    Forward          ///
    Backward         ///
    Groupwise ]

    // syntax checks
    tokenize `varlist'
    args stry x

    capture confirm string var `stry'
    if _rc {
        di as err "{p}stripolate is for interpolation of " ///
        "string variables only: try mipolate (SSC)?{p_end}"
        exit _rc
    }

    capture confirm numeric var `x'
    if _rc {
        di as err "{p}stripolate is for interpolation with " ///
        "respect to a numeric variable{p_end}"
        exit _rc
    }

    if _by() {
        if "`by'" != "" {
            di as err /*
            */ "option by() may not be combined with by prefix"
            exit 190
        }
        local by "`_byvars'"
    }

    local nopts : word count `forward' `backward' `groupwise'

    if `nopts' != 1 {
        di as err "must specify one interpolation method"
        exit 198
    }

    confirm new var `generate'

    quietly {
        // anything to do?
        marksample touse, novarlist  
        replace `touse' = 0 if `x' >= .
        count if `touse'
        if r(N) == 0 error 2000

        count if missing(`stry') & `touse'
        if r(N) == 0 {
            di as txt "{p}nothing to do; " ///
            "no missing `stry' and/or missing `x'{p_end}"
            exit 0
        }

        // uniqueness check
        tempvar OK z  
        bysort `touse' `by' `x' (`stry') : ///
            gen byte `OK' = missing(`stry') | `stry' == `stry'[_N]
        count if `touse' & !`OK'
        if r(N) {
            di as err "different string values for same `x'"
            exit 498
        }

        if "`forward'`backward'" != "" {
            clonevar `z' = `stry'

            if "`forward'" != "" {
                bysort `touse' `by' (`x') : ///
                replace `z' = `z'[_n-1] if `touse' & missing(`z' )
            }
            else {
                tempvar negx
                gen double `negx' = -`x'
                bysort `touse' `by' (`negx'): ///
                replace `z' = `z'[_n-1] if `touse' & missing(`z')
            }
        }

        if "`groupwise'" != "" {
            clonevar `z' = `stry'
            bysort `touse' `by' (`stry') : ///
            replace `z' = `z'[_N] if missing(`z') & `touse'  
        }

        rename `z' `generate'
        compress `generate'
        count if missing(`generate')
    }

    if r(N) > 0 {
        if r(N) != 1 local pl "s"
        di as txt "(" r(N) `" missing value`pl' generated)"'
    }
end

Code:

clear
set obs 10
gen id = cond(_n <= 5, 1, 2)
bysort id: gen time = _n
gen foo = "A" if inlist(_n, 2, 4)
replace foo = "B" if inlist(_n, 6, 8)
list, sepby(id)  

stripolate foo time, by(id) gen(barf) forward
stripolate foo time, by(id) gen(barb) backward
stripolate foo time, by(id) gen(barg) groupwise

list, sepby(id)

     +--------------------------------------+
     | id   time   foo   barf   barb   barg |
     |--------------------------------------|
  1. |  1      1                   A      A |
  2. |  1      2     A      A      A      A |
  3. |  1      3            A      A      A |
  4. |  1      4     A      A      A      A |
  5. |  1      5            A             A |
     |--------------------------------------|
  6. |  2      1     B      B      B      B |
  7. |  2      2            B      B      B |
  8. |  2      3     B      B      B      B |
  9. |  2      4            B             B |
 10. |  2      5            B             B |
     +--------------------------------------+

Last edited by Nick Cox; 14 Dec 2016, 12:19.

Comment

Nazzarena

Join Date: Aug 2014

Posts: 60
#9

14 Dec 2016, 13:46

Thank you very much for addressing this likely common need. I will look at stripolate to the best of my limited resoruces

replace somevar = somevar[_N] if missing(somevar) & ndistinct == 1

wouldn't work for me simply because some time-dependent variables only have one value for some individuals, and time invariant variable can have one value or be all missing ( there shoudl be not more than one value though...unless there's mistakes). So to try to distinguish time-fixed and time-variant vars I have to look both at the average missingness and variability within groups

Thanks again.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

#10

14 Dec 2016, 13:56

Thanks for coming back on that but I still don't understand. If there is one just distinct non-missing value, the idea should work.

Consider this. For id 1, the one non-missing value is spread to all observations. For id 2, the procedure doesn't go through.

Code:

clear 
set obs 10 
gen id = cond(_n <= 5, 1, 2)  
gen t = _n 
gen foo = "A" in 3 
replace foo = "B" in 7 
replace foo = "C" in 9 
list 

     +---------------+
     | id    t   foo |
     |---------------|
  1. |  1    1       |
  2. |  1    2       |
  3. |  1    3     A |
  4. |  1    4       |
  5. |  1    5       |
     |---------------|
  6. |  2    6       |
  7. |  2    7     B |
  8. |  2    8       |
  9. |  2    9     C |
 10. |  2   10       |
     +---------------+

egen tag = tag(foo) 
egen ndistinct = total(tag), by(id) 

bysort id (foo) : replace foo = foo[_N] if missing(foo) & ndistinct == 1 

sort id t 
list, sepby(id) 

     +--------------------------------+
     | id    t   foo   tag   ndisti~t |
     |--------------------------------|
  1. |  1    1     A     0          1 |
  2. |  1    2     A     0          1 |
  3. |  1    3     A     1          1 |
  4. |  1    4     A     0          1 |
  5. |  1    5     A     0          1 |
     |--------------------------------|
  6. |  2    6           0          2 |
  7. |  2    7     B     1          2 |
  8. |  2    8           0          2 |
  9. |  2    9     C     1          2 |
 10. |  2   10           0          2 |
     +--------------------------------+

Comment

Nazzarena

Join Date: Aug 2014

Posts: 60
#11

14 Dec 2016, 15:36

Yes, but if you only had id==1 with only one obs filled in, the code cannot discriminate if foo is time varying, or time -invariant, so it would fill in either case. Thus I had to look at the totality of the observations, assuming (hoping) that
av`var'>1 for time varying
av`var'<=1 for time invariant
then obviously I went to check the codebooks, and out of several hundred vars, it looks like in fact only two time varying vars with extreme missingness have av`var'<=1
It would be expedient if this and other important properties could be coded in the name or otherwise intrinsically associated to a var.
Again, I thank you very much. No matter how many good manuals I follow, I seem to be always behind the curve.
I have the feeling many of us who have to be our own statisticians can exhaust so much effort in manipulating data before even getting to the analyses.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#12

14 Dec 2016, 15:47

Sorry, but I can't push this any further, as I see no code examples with data here that could possibly tell me where my code is wrong or incomplete or doesn't do what you want.

Word pictures really don't help much on the whole.

Incidentally, I am not a statistician. I have no formal qualifications whatsoever in statistics. I agree that data management is crucial. I spend a lot of my time on it.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#13

21 Dec 2016, 12:22

The stripolate program is now posted on SSC as part of the mipolate package. See http://www.statalist.org/forums/foru...-interpolation
Comment
Attaullah Shah

Join Date: Aug 2014

Posts: 1669
#14

19 Nov 2020, 22:23

You can also try fillmissing (from SSC) for filling missing values by groups with last, first, any, previous, next, mean, median, or other options. Details are provided in the help file and here https://fintechprofessor.com/2019/12...lues-in-stata/

Code:

ssc install fillmissing help fillmissing

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Comment

Announcement