Fill up values by first non-missing in group

Milan Quentel

Join Date: Nov 2016

Posts: 52
#1

Fill up values by first non-missing in group

09 Jan 2017, 06:34

I would like to fill up values for a variable, say number, with the first (and only) non-missing number in the same group (captured by the group identifier id) such that

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte id double number 1 23 1 . 1 . 2 12 2 . 3 5 3 . 3 . end

will be completed to

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte id double number 1 23 1 23 1 23 2 12 2 12 3 5 3 5 3 5 end

I do realize that this question has been asked and answered before. I have found some useful initial help here and here. I ended up with three different versions of code:

Code:

bysort id : replace number = number[_n-1] if missing(number) & _n > 1 bysort id : replace number = number[1] bysort id (number): replace number=number[_N]

However, neither of these versions would produce the result that I am looking for (see table above). For some ids (say id=1 and id=3) it worked, for others it did not (I got only missing number values for id=2 despite the fact that the number was nonmissing for id=2 in the original data).

Do you have some other suggestions I could try? What could be possible potential sources of error? What additional information could I look for / provide to make the error hunting feasible?

Many thanks,
Milan
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35782
#2

09 Jan 2017, 06:58

You cite the FAQ http://www.stata.com/support/faqs/da...issing-values/ but what you say makes limited sense otherwise.

Let's peel off first

Code:

bysort id (number): replace number=number[_N]

That would sort the highest value observed in each block (which could be missing) to the end of each block and use that to overwrite all values in the same block, regardless of whether values were missing. That's not a good idea at all. It's not a solution for your stated problem.

Code:

bysort id: replace number=number[1] bysort id : replace number = number[_n-1] if missing(number) & _n > 1

are both solutions for your stated problem.

You seem to be implying that they don't work for your example, but that's not so.

Code:

. clear . input byte id double number id number 1. 1 23 2. 1 . 3. 1 . 4. 2 12 5. 2 . 6. 3 5 7. 3 . 8. 3 . 9. end . . bysort id : replace number = number[_n-1] if missing(number) & _n > 1 (5 real changes made) . l, sepby(id) +-------------+ | id number | |-------------| 1. | 1 23 | 2. | 1 23 | 3. | 1 23 | |-------------| 4. | 2 12 | 5. | 2 12 | |-------------| 6. | 3 5 | 7. | 3 5 | 8. | 3 5 | +-------------+

I can't see any reason to doubt a standard method for what it fairly claims to do, copy downwards. Naturally, that method won't copy upwards as well, but that's documented. See also mipolate (SSC).
Comment
Milan Quentel

Join Date: Nov 2016

Posts: 52
#3

09 Jan 2017, 07:17

Sorry, I should have clarified:

The above code is only pseudo-code that I wrote to generalise the problem. Should I upload some actual observations? I have tried the first version because it was recommended in the other linked thread. As to the other two versions, for my data they do not work properly.

Thank you for mentioning mipolate (SSC). However, it would not work in my case because my id is a string variable.

Copy downwards (with carryforward (SSC)) would not work because sometimes I have all values missing and

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte id double number 1 23 1 . 1 . 2 . 2 . 3 5 3 . 3 . end

would become

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte id double number 1 23 1 23 1 23 2 23 2 23 3 5 3 5 3 5 end

(Probably should have included this case in the pseudo code.)

Thank you for the help and sorry for all possible confusion.

Last edited by Milan Quentel; 09 Jan 2017, 07:20.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35782
#4

09 Jan 2017, 07:31

So, you're now claiming that the problem is not the examples you showed --- but the examples you didn't show us. Do show us at least one you don't understand.

Also, it seems that your identifier is string. But that's irrelevant to the techniques mentioned here or used by mipolate (SSC). Applying within a framework by stringid: (or whatever) will work. Did you try it?

I am not responsible for carryforward (SSC), but you would certainly get nonsense results if you ignored the identifiers. That makes no sense. Also, no program whatsoever can validly interpolate from nothing else whatsoever. Your example for id = 2 shows no information.

Incidentally the code you attribute to http://www.stata.com/statalist/archi.../msg01239.html is there used for a completely different purpose.

Last edited by Nick Cox; 09 Jan 2017, 07:45.
Comment

Milan Quentel

Join Date: Nov 2016
Posts: 52

09 Jan 2017, 08:24

So here is an example from my data where

Code:

bysort company : replace turnover = turnover[_n-1] if missing(turnover) & _n > 1

does not work. Note that company is the id (in string format) and turnover is the number to be replicated.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str144 company double turnover str10 DMC_id
"2016 GCG LIMITED" 1968878.5682201385 "P113682000"
"2016 GCG LIMITED"                  . "P113682000"
"2016 GCG LIMITED"                  . "P012939285"
"2016 GCG LIMITED"                  . "P012939285"
"2016 GCG LIMITED"                  . "P097765326"
"2016 GCG LIMITED"                  . "P355973038"
"2016 GCG LIMITED"                  . "P048664390"
end

is completed as

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str144 company double turnover str10 DMC_id
"2016 GCG LIMITED"                  . "P113682000"
"2016 GCG LIMITED"                  . "P012939285"
"2016 GCG LIMITED"                  . "P097765326"
"2016 GCG LIMITED" 1968878.5682201385 "P113682000"
"2016 GCG LIMITED" 1968878.5682201385 "P048664390"
"2016 GCG LIMITED" 1968878.5682201385 "P012939285"
"2016 GCG LIMITED" 1968878.5682201385 "P355973038"
end

Have you / has anyone seen something like that before? What could be potential reasons?

Thank you also, Nick Cox, for your suggestion to use mipolate together with by. I tried

Code:

bysort company : mipolate company turnover, gen(ly1)

which fails because company is a string variable and

Code:

bysort company : mipolate turnover, gen(ly1)

which fails because I need to specify at least two variables.

I also tried encoding my string id variable and then using mipolate but that, too, produced rather weird results.

Could you please explain how mipolate could be used together with the by option?

Thank you for all your help.

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35782

09 Jan 2017, 08:41

Your example works for me:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str144 company double turnover str10 DMC_id
"2016 GCG LIMITED" 1968878.5682201385 "P113682000"
"2016 GCG LIMITED"                  . "P113682000"
"2016 GCG LIMITED"                  . "P012939285"
"2016 GCG LIMITED"                  . "P012939285"
"2016 GCG LIMITED"                  . "P097765326"
"2016 GCG LIMITED"                  . "P355973038"
"2016 GCG LIMITED"                  . "P048664390"
end

bysort company : replace turnover = turnover[_n-1] if missing(turnover) 

list 

     +-------------------------------------------+
     |          company    turnover       DMC_id |
     |-------------------------------------------|
  1. | 2016 GCG LIMITED   1968878.6   P113682000 |
  2. | 2016 GCG LIMITED   1968878.6   P113682000 |
  3. | 2016 GCG LIMITED   1968878.6   P012939285 |
  4. | 2016 GCG LIMITED   1968878.6   P012939285 |
  5. | 2016 GCG LIMITED   1968878.6   P097765326 |
     |-------------------------------------------|
  6. | 2016 GCG LIMITED   1968878.6   P355973038 |
  7. | 2016 GCG LIMITED   1968878.6   P048664390 |
     +-------------------------------------------+

Your attempts at using mipolate both fail because you don't meet its requirements. The variable named after mipolate is that you are interpolating and that must be numeric and can't be the identifier you use to define groups. You need a sequence variable even if it is artificial.

This would work, for example,

Code:

 
 * Example generated by -dataex-. To install: ssc install dataex clear input str144 company double turnover str10 DMC_id "2016 GCG LIMITED" 1968878.5682201385 "P113682000" "2016 GCG LIMITED"                  . "P113682000" "2016 GCG LIMITED"                  . "P012939285" "2016 GCG LIMITED"                  . "P012939285" "2016 GCG LIMITED"                  . "P097765326" "2016 GCG LIMITED"                  . "P355973038" "2016 GCG LIMITED"                  . "P048664390" end bysort company : gen seq = _n 

by company: mipolate turnover seq, gen(turnover2) forward 

list 

     +-------------------------------------------------------------+
     |          company    turnover       DMC_id   seq   turnover2 |
     |-------------------------------------------------------------|
  1. | 2016 GCG LIMITED   1968878.6   P113682000     1   1968878.6 |
  2. | 2016 GCG LIMITED           .   P113682000     2   1968878.6 |
  3. | 2016 GCG LIMITED           .   P012939285     3   1968878.6 |
  4. | 2016 GCG LIMITED           .   P012939285     4   1968878.6 |
  5. | 2016 GCG LIMITED           .   P097765326     5   1968878.6 |
     |-------------------------------------------------------------|
  6. | 2016 GCG LIMITED           .   P355973038     6   1968878.6 |
  7. | 2016 GCG LIMITED           .   P048664390     7   1968878.6 |
     +-------------------------------------------------------------+

Comment

Milan Quentel

Join Date: Nov 2016

Posts: 52
#7

09 Jan 2017, 10:12

I have no idea why the example works if taken on its own but doesn't work within my larger dataset.

The idea with the sequence variable, bysort and mipolate, however, works perfectly fine. Thank you very much!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35782
#8

09 Jan 2017, 10:19

Test for equality of strings that you think should be equal. Small differences of spelling or punctuation or hidden characters are easily fatal.
Comment
Milan Quentel

Join Date: Nov 2016

Posts: 52
#9

09 Jan 2017, 10:23

Thank you for the advice. Fortunately, I can guarantee that the identifying string is equal because of the way it was constructed.

One more question: What happens if the variable to be replicated (in my example turnover) is a string variable? Is there any way to deal with such a situation within the mipolate command?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35782
#10

09 Jan 2017, 10:28

There is a related solution, already documented.

http://www.statalist.org/forums/foru...-in-panel-data

See http://www.statalist.org/forums/foru...-interpolation for stripolate
Comment
Milan Quentel

Join Date: Nov 2016

Posts: 52
#11

09 Jan 2017, 14:30

Thank you, very much. Stripolate works perfectly.
Comment
Victoria Lacaze

Join Date: Oct 2016

Posts: 4
#12

19 Jan 2017, 22:59

Hello,

I am working with an unbalanced panel dataset and I used the -fillin- command to create a balanced version of it:

fillin timevar id

Now I need to replace the missing values (the only ones are those of _fillin==1), but after trying the following, I get the "different non-missing values of variable x" error message (r498):

local varlist Exporter Importer Comm distcap contig colony Exp_gdp Exp_pop Imp_gdp Imp_pop ldlock
foreach var in `varlist' {
mipolate `var' id, gen(n`var') groupwise
}

Each id value is unique for each Exporter-Importer-Comm possible combination. The remaining variables are dyadic, so they have unique values for each Exporter-Importer possibility, regardless of Commodity or timevar.
I haven´t used -mipolate- command before, so any comments would be really helpful.

Thanks!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35782
#13

20 Jan 2017, 00:29

See the help for mipolate (SSC):

groupwise specifies that non-missing values be copied to
missing values if, and only if, just one distinct
non-missing value occurs in each group. Thus a group of
values ., 42, ., . qualifies as 42 is not missing and is
the only non-missing value in the group. Hence the missing
values in the group will be replaced with 42 in the new
variable. By the same rules 42, ., 42, . qualifies but 42,
., 43, . does not. Normally, but not necessarily, this
option is used in conjunction with by:, which is how groups
are specified; otherwise the (single) group is the entire
set of observations being used.

So, your data don't satisfy the criterion for this kind of interpolation.

What's most obvious is that you missed out a by() option so you are telling mipolate that there is just one distinct non-missing value for each variable in the entire dataset. That will typically be quite wrong.

I imagine that you need

Code:

by(Exporter Importer Comm)
Comment
Victoria Lacaze

Join Date: Oct 2016

Posts: 4
#14

20 Jan 2017, 18:40

Thank you so much!
Comment
Victoria Lacaze

Join Date: Oct 2016

Posts: 4
#15

20 Jan 2017, 20:32

The code that finally worked for my dataset is:
foreach var in `varlist' {
bysort id: mipolate `var' id, gen(n`var') nearest
replace `var'=n`var' if `var'==.
drop n`var'
}
Thank you again
Comment

Announcement