Using Mata to fill in missing entries

Sebastiaan Maes

Join Date: Apr 2017

Posts: 3
#1

Using Mata to fill in missing entries

03 Apr 2017, 06:07

Let's say I have a dataset containing a thousand variables that have only one nonmissing entry per id. Up till now, I have used the following Stata code to fill in the missing entries, which are set to the variable's unique observation (by id).

Code:

forvalues i = 1/1000 { sort id var_`i' by id (var_`i'): gen newvar_`i' = var_`i'[_N] }

Unfortunately, this procedure is quite slow, especially when (very) large datasets are employed. I was wondering whether it would make sense to rewrite this snippet in Mata; would there be a significant gain in speed?
Tags: None
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#2

03 Apr 2017, 09:16

I'm a bit confused by what your code does, you create a new variable equal to the largest value of a particular variable? Also, which part of the loop takes the longest? The sort or the generate command? You can time this using either the standard timer on/off functionality, or through -timeit- from ssc (e.g. timeit 1: sort id var_`i').
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

03 Apr 2017, 10:28

Welcome to Statalist, Sebastian.

Like Jesse, I don't understand how your original example works. Have you neglected to tell us that your variables are all strings? Because for numeric variables, missing values sort high, and the only nonmissing value will be var[1]. But for strings, null strings sort low and the only nonmissing value will indeed be in var[_N].

Please take the tie to review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post, looking especially at sections 9-12 on how to best pose your question. It would be particularly helpful to post a small hand-made example, perhaps with just a few variables and observations, showing the data before the process and how you expect it to look after the process. In particular, please read FAQ #12 and use dataex and CODE delimiters when posting to Statalist.

Added in edit: it is not clear to me that this problem at all needs Mata. You might consider posting an expanded explanation of your problem, following the guidance in the FAQ, to the General Forum where it will get more attention.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#4

03 Apr 2017, 10:48

The typical "trick" in Stata to be able to target a specific observation using explicit subscripting is to use an observation index variable (whose value is the same as _n) and use summarize. Something like:

Code:

sysuse auto, clear gen long obs = _n sum obs if make == "Ford Fiesta", meanonly dis price[`r(min)']

Of course this only works if what you are looking for occurs only once in the data. Anyway, this technique won't work here because the observation must be located within id groups. You can adapt the technique to work by groups, and the fastest solution will be to use a running sum to carry forward the observation index where the condition is true to the last observation within the id group.

Code:

* fake data, 3 id, 10 variables, obs between 500 and 1000 clear set seed 13243 set obs 3 gen id = _n gen N = runiformint(500, 1000) expand N forvalues i=1/10 { bysort id: gen pick1 = runiformint(1, N) qui by id: gen v`i' = runiform() if pick1[1] == _n drop pick1 } bysort id: gen long obs = _n by id: sum forvalues i=1/10 { by id: gen pick = sum(obs / !mi(v`i')) by id: gen vfill`i' = v`i'[pick[_N]] drop pick }
Comment
Sebastiaan Maes

Join Date: Apr 2017

Posts: 3
#5

04 Apr 2017, 03:33

Thank you for your answers. I forgot to mention that before I enter the loop I replace all missing values by zeros (these thousand variables are positive by construction). But as you suggest, this would be an equivalent but more efficient way:

Code:

forvalues i = 1/1000 { sort id var_`i' by id (var_`i'): gen newvar_`i' = var_`i'[1] }

Thanks for your elaborate example, Robert. I will try to implement this later today and will post some benchmarking results.
Comment
Sebastiaan Maes

Join Date: Apr 2017

Posts: 3
#6

05 Apr 2017, 04:49

A quick follow-up: In my application (and using Stata 13 MP4) Robert's solution is about 5 times faster than what I proposed. Many thanks again!
Comment

Announcement

Using Mata to fill in missing entries

Comment

Comment

Comment

Comment

Comment