Aggregating several variables with some missing observations

Sam Volpe

Join Date: Jun 2021

Posts: 36
#1

Aggregating several variables with some missing observations

11 Jul 2021, 21:26

Hello Statalisters,

I am creating an index with several made up of several indicators and some of these indicators have missing observations. If I aggregate 3 variables a, b, and c, even if a and b are totally complete, if c has any missing observations, the aggregate variable seems to completely ignore the data from a and b, and that observation year is only counted as missing. I am losing so much information because of this.
When I was looking up how to deal with missing observations, I saw the command collapse repeatedly mentioned, but I don't think it's relevant here, since I am not looking to only use summary statistics.
So far I am using averaging with arbitary weights. I even just added all the variables together to see if the problem was not the method but the numbers themselves and they were just balancing each other out, but even when I simply add variables with all positive observations or missing observations, the missing variable seems to overpower any other information.

How would you recommend aggregating variables/creating a composite variable when some observations are missing?
Tags: aggregation, averaging, Composite variable, missing observations, missing values
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17729
#2

11 Jul 2021, 23:26

Sam:
as per FAQ, please show exactly what you typed and what Stata gave you back. Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Sam Volpe

Join Date: Jun 2021

Posts: 36
#3

12 Jul 2021, 11:46

Hi Carlo,

my bad. Here is the general concept of code I'm using:

gen newvar = weight*a + weight*b + weight*c
where a, b, c are existing variables

example code:
gen societal_division = (1/3)*v2clpolcl + (1/3)*v2cacamps + (1/3)*v2peapsecon

And here are the results. As you can see for row 112, I would expect societal_division to come to -2.145.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35761
#4

12 Jul 2021, 12:29

Carlo Lazzaro was alluding to https://www.statalist.org/forums/help#stata which explains our longstanding request to use dataex. Even when easily readable, as your image is to me, images have a big disadvantage: more experienced users can't copy and paste to give you workable code (or indeed, experiment themselves to derive workable code). Again, please see the link for the explanation.

When you ask for

(1/3) some number + (1/3) another number + (1/3) missing

Stata does not read your mind as wanting it to do instead

(1/2) some number + (1/2) another number

It's following your instruction and missing values mess up your calculation. What you need is better done as

Code:

egen societal_division = rowmean(v2clpolcl v2cacamps v2peapsecon)

which will ignore missing values to the extent possible. ("Missing values" is more precise than "missing observations" as in Stata an observation is, in other terms, an entire row or record with values for one or more variables in the dataset.)

Last edited by Nick Cox; 12 Jul 2021, 12:34.
1 like
Comment
Eric Collins

Join Date: Jul 2021

Posts: 5
#5

12 Jul 2021, 12:31

Sam: I was able to recreate your problem using the auto.dta dataset.

Code:

sysuse auto.dta keep mpg rep78 headroom replace headroom = . gen example = mpg + rep78 + headroom

produces

Code:

(74 missing values generated)

However, when I used egen, it ignored the missing values

Code:

egen example2 = rowtotal(mpg rep78 headroom)

If you need to add arbitrary weights, you could store that in another variable and do

Code:

egen varname = rowtotal(varlist of weighted vars)
Comment
Sam Volpe

Join Date: Jun 2021

Posts: 36
#6

13 Jul 2021, 12:29

Hi Nick Cox

Though it may not seem like it from the rudimentary level of my Stata questions, I cannot tell you how many of your responses and pages and blog posts I've read. I apologize for not following proper Statalist procedure in this post, and I won't make the same mistake again. I really appreciate your advice on how to better communicate with Stata.
2 likes
Comment
Sam Volpe

Join Date: Jun 2021

Posts: 36
#7

13 Jul 2021, 12:31

Eric Collins thank you so much for going out of your way to recreate my problem. Knowing this code that you have so clearly demonstrated is going to really help me going forward. Best wishes to you!
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30154
#8

13 Jul 2021, 13:49

Sam Volpe I hope you realize that by using the solution(s) suggested above you are implicitly treating the absence of a value for a given variable to the value of that variable being zero. Perhaps the meaning of your variables is such that this is a reasonable thing to do. But think about that before proceeding. If your variables are not defined in such a way that variable = 0 is equivalent (in substantive real world terms) to variable = missing value, then you are mangling your index.
2 likes
Comment
Sam Volpe

Join Date: Jun 2021

Posts: 36
#9

15 Jul 2021, 12:59

Clyde Schechter I have noticed that, with

Code:

egen varname = rowtotal(varlist)

. But with Nick Cox's solution,

Code:

egen varname = rowmean(varlist)

, this doesn't seem to be the case, the description says it ignores missing values. Am I interpreting the distinction correctly? However, with either method, I cannot alter the weights of each variable. So it appears I should follow Eric Collins' solution to create new variables, where I multiply each variable by the weight I prefer, and then rowmean them?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30154
#10

15 Jul 2021, 14:13

Part of your post is cut off, after the first code block, so I don't really know quite what you are asking about.

Say we have three variables, and their values in some observation are 3, 4, and missing. The result of -rowtotal- will be 3+4 = 7. Notice that this is the same result you would get if the third value were 0 instead of missing. As for -rowmean-, it calculates (3 + 4)/(1+1) = 3.5, whereas if the third value were 0 instead of missing, you would get (3 + 4 + 0)/(1+1+1) = 2.333... So rowmean ignores missing values in that sense. But that sense is a little misleading. What if the third value, instead of being missing, were 3.5. Then -rowmean()- would calculate (3 + 4 + 3.5)/(1+1+1) = 10.5/3 = 3.5--the same result we got with a missing value. So, -rowmean- implicitly treats all missing values as if they were equal to the mean of the non-missing values.

Eric Collins is correct that to do a weighted calculation with either -rowtotal- or -rowmean- you have to first calculate weighted variables. That doesn't change any of the considerations from the previous paragraph, though. A weighted missing value is still a missing value. So -rowtotal- still treats missing as equivalent to zero, and -rowmean- treats missing as equivalent to the mean values of the non-missing weighted variables.

The main point is this: missing values are always a problem. There are no good solutions to missing data; one always seeks the least bad solution. Now, if the variable going into your index really are different imperfect indicators of a common underlying construct, then it may be reasonable to believe that the mean of the weighted values of the non-missing variables is a reasonable proxy for the actual value we would see if it weren't missing. In fact, explicitly adopting this belief and using it has a name: it is called ipsative mean imputation, and it is widely used. When the underlying assumption that the weighted mean values of the different variables should be roughly equal for most observations is plausible, it can be reasonable to use it. Of course, it is just an approximation, and it also has the drawback of resulting in an index variable whose variance is artificially reduced. But it can be workable.
1 like
Comment
Sam Volpe

Join Date: Jun 2021

Posts: 36
#11

15 Jul 2021, 14:32

Clyde Schechter Even if you didn't see my whole comment, you answered every question I had. Thank you, that's very interesting. I feel better knowing that this is a commonly frustrating problem and that there are names for how to deal with it & think about it. I think knowing the term ipsative mean imputation will help me while I'm writing my paper about this, in my justifications. I'll do further research on that. Your explanation is very clear and thorough, I appreciate you walking me through this.
Comment

Announcement

Aggregating several variables with some missing observations

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment