Forcing Loop to ignore 'no observation error'

Lisa Feliks

Join Date: Apr 2016

Posts: 11
#1

Forcing Loop to ignore 'no observation error'

12 Apr 2016, 11:08

Hi!

I am trying to do a regression imputation on my educational variable (I know that multiple imputation is better, but due to time limits of the deadline I have to use this one).
I got some help writing the loop, because I found it really hard. I works for the most part, but I keep getting an error.

First:

egen lw = group(S003 wave)
summ lw
local Nlw = r(max)
forvalues i = 1/`Nlw' {
summ educ if lw==`i'
if (r(N) <= 20) {
break
}
regress educ X047 X003 female if lw==`i'
predict educ_hat if lw==`i', xb
replace educ = educ_hat if missing(educ) & lw==`i'
drop educ_hat
}

When I run it, I get the error 'no observations'. I looked it up and saw that many others solved it by using 'capture'. I tried it, but it doesn't seem to work as I get the error 'no last observations' (when capture is placed before regress).

The second part of the syntax gets the same error:

egen land = group(S003)
summ land
local Nland = r(max)
forvalues i = 1/`Nland' {
summ educ if land==`i'
if (r(N) <= 20) {
break
}
regress educ (c.X047 c.X003 c.female)##c.wave if land==`i'
predict educ_hat if land==`i', xb
replace educ = educ_hat if missing(educ) & land==`i'
drop educ_hat
}

The first syntax should impute missing values on education when not a complete country (S003) have missing values on this variable. The second syntax should impute education for respondents in countries where nobody has a value on this variable.

I really hope someone is able to help me!

Best,

Lisa
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35633

12 Apr 2016, 11:19

You want to exclude groups with <= 20 non-missing values of educ from your regressions. You can do that outside the main loop. Here's a sketch.

Code:

egen npresent = count(educ), by(S003)
egen land = group(S003) if npresent > 20 
su land, meanonly 

forvalues i = 1/`r(max)' { 
    regress educ (c.X047 c.X003 c.female)##c.wave if land==`i'
    predict educ_hat if land==`i', xb
    replace educ = educ_hat if missing(educ) & land==`i'
    drop educ_hat
}

This doesn't account for missings in your predictors. That code might start

Code:

 
egen npresent1 = rownonmiss(educ X047 X003 female wave) 
egen npresent2 = total(npresent1 == 5), by(S003) 
egen land = group(S003) if npresent2 > 20

Comment

Lisa Feliks

Join Date: Apr 2016

Posts: 11
#3

12 Apr 2016, 11:40

Thank you for your quick reply, Nick.

I think that the less than 20 non-missing values is not that important as I don't think that there are any groups with less than 20. Unfortunately, the first syntax still gave the error r(2000) 'no observations'. I also ran is without the < 20, and it still gave the error. It does run for some time, but it stops at the same place every time.

Oddly, when using the second code before the loop, I get a syntax invalid error (in the end). Not sure what happened there.

Any thoughts?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35633
#4

12 Apr 2016, 11:49

Something may be string that should not be. Try

Code:

describe educ X047 X003 female wave summarize educ X047 X003 female wave
Comment
Lisa Feliks

Join Date: Apr 2016

Posts: 11
#5

12 Apr 2016, 12:02

I'm afraid not.
I ran the syntax you gave me with capture before the regress. Than I get the error: last estimates not found r(301);
When I run it with capture before predict it says: educ_hat not found r(111);

It's always on te exact same spot.

variable name type format label variable label
-------------------------------------------------------------------------------------------------------------------------------------------------------
educ float %9.0g
X047 byte %8.0g X047 Scale of incomes
X003 int %8.0g X003 Age
female byte %9.0g female RECODE of X001 (Sex)
wave float %9.0g wavel

.
. summarize educ X047 X003 female wave

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
educ | 400103 4.740372 2.183134 1 8
X047 | 366965 4.654714 2.347977 1 10
X003 | 472528 42.40025 16.59306 18 108
female | 473392 .5256257 .4993434 0 1
wave | 473780 4.253424 1.291519 2 6
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35633

12 Apr 2016, 12:09

How many parameters are you estimating there? You have lots of interactions.

This code is more defensive:

Code:

egen npresent1 = rownonmiss(educ X047 X003 female wave) 
egen npresent2 = total(npresent1 == 5), by(S003) 
egen land = group(S003) if npresent2 > 20

forvalues i = 1/`r(max)' { 
    capture regress educ (c.X047 c.X003 c.female)##c.wave if land==`i'
    if _rc == 0 { 
        predict educ_hat if land==`i', xb
        replace educ = educ_hat if missing(educ) & land==`i'
        drop educ_hat
    }
}

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#7

12 Apr 2016, 12:11

The number of observations in the data set for these variables doesn't tell us what's happening in the loop. Each time through the loop, all that is relevant is the observations for the current value of variable land. Remember also that for the purposes of any regression command, an observation only counts if it has non-missing values for every variable in the command. If even one variable is missing, the observation is not usable. So, you need to find out which value of land is causing the problem. Something like this:

Code:

by land: egen n_observations= total(!missing(educ, X047, X003, female, wave)) levelsof land if n_observations == 0

Actually, although you won't get the "no observations" error message, you also won't get any usable results if n_observations < 6 (because there will be no denominator degrees of freedom left). So you may as well deal with those while you're at it. Taking it a step further, even with 6 observations, although you will get coefficients and standard errors, it's hard to take that kind of analysis seriously. So you might want to consider filtering out at some more stringent level for n_observations.
Comment
Lisa Feliks

Join Date: Apr 2016

Posts: 11
#8

12 Apr 2016, 12:42

I am doing a 3 level mixed multilevel model. I have cross-sectional data on 6 time points (which I call wave, because it's short).
All individuals are nested in a wave*country, which are nested in countries. In the model I run I have no interactions.

The assistant professor that helped me did the interaction with wave to get a value for the other waves in which education is missing (most of the countries participated multiple times).

The syntax didn't run, but when I added 'break' it worked without an error! (Put it in a box, as a dubble check)

Code:

egen npresent2 = total(npresent1 == 5), by(S003) egen land = group(S003) if npresent2 > 20 summ land forvalues i = 1/`r(max)' { capture regress educ (c.X047 c.X003 c.female)##c.wave if land==`i' if _rc == 0 { break predict educ_hat if land==`i', xb replace educ = educ_hat if missing(educ) & land==`i' drop educ_hat } }

I did get some interesting summarize, namely a negative value for minimum. It this possible (the original variable goes from 1 to 8)?

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
educ | 462450 4.671427 2.095803 -.6107745 10.06834

I think it worked for (almost) all, as I only have 10,000 respondents with missing values (first I had 80,000). I know I'm not able to impute for every missing value because X047 is missing for some respondents. And as I have 470000 respondents, 10,000 seems reasonable. However, I do have 1,124 respondents left that have a missing on education, but no missing on female, X003 and X047.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35633
#9

12 Apr 2016, 12:57

I don't understand the problem with the code I suggested, nor your fix.

if _rc == 0 and break are incompatible as if you want a break it's with a non-zero error code.

Negative values don't surprise me. You're fitting a hyperplane and nothing constrains predictions to the observed range. Presumably you are also replacing with real numbers when you have integer codes.

Last edited by Nick Cox; 12 Apr 2016, 13:25.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#10

12 Apr 2016, 13:01

So, if I understand #8, the purpose of this code is to impute a value for the variable educ when it is missing, using a regression on variables X047, X003, female, and wave. You want to later use this data to run a multi-level model.

The inclusion of -break- in the code should not have affected the running of this code: all it does is enable you to interrupt Stata by pressing the Break key. I am totally baffled why this made any difference at all. I am almost certain that if we saw the exact code that was run before and after there would be some other change that actually made the difference.

Anyway, the code as written should accomplish your stated purpose, at least for those observations where none of X047 X003, female, and wave are not missing, and where no other problems are encountered in carrying out the regression. Since your variable educ now contains imputed values, it is not surprising that some of them are out of the range of the original variable. That is not a problem.

All of that said, this way of handling missing data is discouraged today. Single imputations of missing values result in misleading underestimation of the variation in the data, which in turn leads to incorrect estimates of the models built on them. If you believe that a regression-based imputation of missing values is reasonable for your data, you should be doing multiple imputation. Stata has a whole suite of commands (-help mi-) for doing multiple imputation, and you should familiarize yourself with the corresponding manual section. It is something of a heavy lift, to be sure, but if you are doing this work for purposes of publication or for a dissertation, I would think that reviewers/committee will not let you get away with single imputation in this way. I think it is fair to say that single imputation can be worse than doing nothing at all about missing data and just analyzing complete cases.

Now, multiple imputation adds substantially to the amount of time it takes to do the computations, and multi-level models, particularly if they are non-linear, can be very time consuming on large data sets even without that. So I would certainly agree with using your single-imputation approach as a first step, and getting the multi-level model set up, running, and correct using that first. But once you have that working, you really should move to an MI-based approach, or perhaps just stick with analyzing complete cases. See http://www.statisticalhorizons.com/w...ap-Allison.pdf for a very good review of different approaches to missing data and their strengths and weaknesses.
Comment
Lisa Feliks

Join Date: Apr 2016

Posts: 11
#11

12 Apr 2016, 13:26

I was writing a respons when you gave a perfect sum of my problem, Clyde.

To come back to you and Nick, I ran the syntax without break and it worked. I have no idea why it didn't work before. I think stata was tired of me trying running things .

I am aware of the underestimation. I am writing my research master thesis at the moment. My deadline is in a month, and therefore I am not able to do multiple imputation. When I decide to publish, I will indeed use multiple imputation.

(my laptop battery is empty, I will continue my response in half an hour)
Comment
Lisa Feliks

Join Date: Apr 2016

Posts: 11
#12

12 Apr 2016, 14:08

Clyde, I was wondering what stringent level for n_observations you would advice (that is instead of the 20, right?).

I run my model with the imputed education, so thank you Nick for the syntax! However, I am quite surprised how different my results are compared to the results with partly imputed/no imputation of education.

I might consider just sticking with the complete cases. And use MI when publishing. But I have to talk to my supervisor about this decision. It is just really difficult to decide because if I only do my analyses on cases that have no missing on education, I will loose quite some respondents/countries.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#13

12 Apr 2016, 14:17

My remark in #7 about a more stringent threshold for n_observations was made before I was aware that the regressions were being used for single imputation purposes. So I was thinking more about issues of statistical power. But modeling for imputation is a variant of predictive modeling. So additional issues like over-fitting come in to play. There are various rules of thumb about that and I'm not sure how much science there is behind any of them. But people will say that you need a minimum of a certain number of observations per predictor. With 4 predictors in the model, depending on whose rule you like, you would want anywhere from 40 (10 obs per predictor) to 200 (50 obs per predictor) observations for each group. Given the large size of your data set, that doesn't actually seem like all that much to ask, although I can imagine that some specific values of the variable land might have fewer observations than that.
Comment
Lisa Feliks

Join Date: Apr 2016

Posts: 11
#14

12 Apr 2016, 14:31

I changed the 20 to 200 and I don't think anything changed. So I guess that is nice.

Can I ask you something different. My supervisor made a comment about the education as it is causing this many problems. Most of my hypotheses are on the country*wave level and education is a control variable (the literature states that it has a stronger effect than the macro variables, therefore more important). He said, as we are mostly interested in the macro level effects between countries, not including education as a control variable would also be a possibility. Not sure if this is your expertise, but I was curious to hear what you think. I seems like an easy way out, but maybe a better idea that this imputation? (or maybe only incluse the education non missing cases).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#15

12 Apr 2016, 14:48

Well, I'm an epidemiologist, so this is not my area of expertise. If your hypothesis is of the form "antecedent X is associated with outcome Y" and if education is associated with both X and Y, then omitting it will cause bias. You've indicated that it's strongly associated with outcome Y, at least according to the literature. So the question is whether it is also associated with antecedent X. If it is, then you have no choice really but to keep it in the model. If it's not you can omit it: the price you pay is that the residual variance of your outcome is larger than it needs to be, which could reduce your statistical power. These are general principles, not specific to any particular discipline.

By the way, the issue of omitted variable bias (aka confounding) is a sample-level issue. So even though the literature says education is associated with outcome Y, if there is something about the sampling in your data that causes Y to be independent of education, then, that, too eliminates the necessity of including education.
Comment

Announcement

Forcing Loop to ignore 'no observation error'

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment