Standardize around mean of subset

paulvonhippel

Join Date: Apr 2014

Posts: 502
#1

Standardize around mean of subset

18 Jan 2017, 20:55

I have two variables, score and year, representing test scores for year=2005,...,2013. If I want to standardize the test scores the simplest ways to do it is
egen zscore = std(score)
but that standardizes them using the mean and SD of all years together. I would prefer to standardize them using the mean and SD of the first year, 2005. What's the simplest way to do that.

Bonus round: my data are multiply imputed. Is there still a simple alternative to this?
mi passive: egen zscore = std(score)
Tags: None
paulvonhippel

Join Date: Apr 2014

Posts: 502
#2

18 Jan 2017, 21:26

Wait, the simple version of my question, but not the mi version, is answered here: http://www.stata.com/statalist/archi.../msg00010.html
Without mi, the answer looks like this:
su score if year==2005
egen zscore = (score - r(mean)) / r(sd)

In mi data, I can standardize using the mean and sd of a single imputed dataset like this.
mi xeq: su score if year==2005
mi passive: egen zscore = (score - r(mean)) / r(sd)
But standardizing around the mean and SD of the multiple imputations is trickier. Suggestions appreciated.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#3

18 Jan 2017, 22:52

su score if year==2005
egen zscore = (score - r(mean)) / r(sd)

In the second line use generate instead of egen. What is really interesting is the question you pose.

In mi data, I can standardize using the mean and sd of a single imputed dataset like this.
mi xeq: su score if year==2005
mi passive: [e]gen zscore = (score - r(mean)) / r(sd)
But standardizing around the mean and SD of the multiple imputations is trickier. Suggestions appreciated.

Technically, this will not work, because r(mean) and r(sd) will contain the respective values form the last imputed dataset only. Also the non-imputed dataset is used by mi xeq if not specified otherwise. I do not think this is what you want. I am not sure what you want, however, and I am not sure if it would be valid either. I hope to learn something here, too.

You could say you want a point estimate of the MI mean and the MI SD. Using Rubin's rules it easy to get those. The mean will be given by a simple summarize, since the mean of the M means is just the mean over the n x M observations. Assuming that the SD is approximately normal, we can calculate the mean of the M dataset specific SDs as

Code:

mi query local M = r(M) scalar sd_mi = 0 mi xeq 1/`M' : summarize score if (year == 2005) ; scalar sd_mi = sd_mi + r(sd) scalar sd_mi = sd/`M'

However, is this really the way we would standardize the variables? Think about the way the combined estimation results will be obtained later. You would run the estimation command on each of the imputed datasets, then combine coefficients and standard errors. If we use the procedure outlined above, then for each individual estimation the coeffciets will not be based on a standardized (mean 0, unit SD) variable. Should this not be the case? Intuitively I would standardize the variable using the dataset specific mean and SD

Code:

mi query local M = r(m) generate zscore = . mi xeq 1/`M' : summarize score if (year == 2005) ; replace zscore = (score-r(mean))/r(sd)

(Note that it would be much harder to replace the original score variable. Just do not do it and create a new one, as shown above. Or ask for more details if interested in replacing the orginal.)

What do you think?

Best
Daniel

Last edited by daniel klein; 18 Jan 2017, 22:55.
1 like
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#4

19 Jan 2017, 09:55

Your idea of standardizing each imputed dataset separately makes a lot of sense. Then how about this?

mi xeq: summarize score if year==2005; replace zscore = (score - r(mean)) / r(sd)

This is working for me. By default mi xeq iterates through each dataset, starting with the non-imputed dataset (m=0) and continuing through all the imputed datasets (m=1,2,....). It doesn't just run on m=0 as you implied earlier. I bring this up only because it explains why my code above works.
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#5

19 Jan 2017, 10:06

Wait, I'm sorry. My solution didn't quite work:

mi xeq: summarize score if year==2005; replace zscore = (score - r(mean)) / r(sd)

It appears the values of r(mean) and r(sd) are not being refreshed each time through the loop.

Hm....
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#6

19 Jan 2017, 10:30

I did phrase this one poorly. I did not want to imply that mi xeq uses only the m=0 dataset, but that it includes this dataset. This is relevant when you are adding scalars, as in the first approach that I have shown, where you want only the values from the complete datasets. With your code, it does not matter whether m=0 is used or not (aside from a few seconds that could be saved by skipping it).

Anyway, what makes you think that your code does not work? Can you show the complete code (including the generate command that you use in the beginning) and (part of the) output that leads to your conclusion?

Best
Daniel
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#7

19 Jan 2017, 11:21

OK, now I think I've got this. I was wrong to say before that the values of r(mean) and r(sd) do not refresh each time through the loop. They do. My mistake was that I hadn't registered the standardized variable as passive. Once I do that, the solution works.

Let me illustrate with a reproducible example. The HSB data are available on the web:

use http://www.ats.ucla.edu/stat/stata/s...ation/hsb2_mar, clear

They contain incomplete reading and math scores for boys and girls. I impute them as follows:

mi set wide
mi register imputed read write female
mi impute mvn read write female, add(3)

Now I want to standardize the variable read around the mean for girls. The standardized variable will be called zread. The first thing I need to do is register zread as passive. But I can't do that because it doesn't exist yet. So I simultaneously initialize it and register it as passive, like this:

mi passive: gen zread = .

This is the step that I missed before. It seems a little convoluted and I wonder if there is a way around it. If you run describe, you can see that in addition to a variable named zread, there are now variables named _1_zread, _2_zread, _3_zread, just waiting to be filled with passively imputed values. Now the rest of my solution works:

mi xeq: summ read if female==0; return list; gen zread = (read - r(mean)) / r(sd)

The "return list" verifies that the values of r(mean) and r(sd) refresh each time through the loop. I was wrong before to say they don't.

To verify my solution, I separately check the mean of the standardized variable for boys and for girls. For girls the mean is 0; for boys it is -0.03. That's what I expected.

mi estimate: mean zread if female==0
mi estimate: mean zread if female==1
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#8

19 Jan 2017, 11:25

While I now have a working solution, I continue to think that all the mi register and mi xeq business adds an unnecessary layer of complexity, which frequently trips me up.

The imputation tools in SAS work quite nicely without that layer. This is one of the few situations where SAS is simpler and more elegant than Stata.
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#9

19 Jan 2017, 11:29

Here's a minor correction to my example. The mi xeq line should use replace instead of gen, since the variable zread was previously generated. The corrected code follows:

use http://www.ats.ucla.edu/stat/stata/s...ation/hsb2_mar, clear
mi set wide
mi register imputed read write female
mi impute mvn read write female, add(3)

mi passive: gen zread = .
mi xeq: summ read if female==0; return list; replace zread = (read - r(mean)) / r(sd)
mi estimate: mean zread if female==0
mi estimate: mean zread if female==1
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#10

19 Jan 2017, 11:49

Unfortunately, my solution is still not quite right. When I run the following, the last two lines show that the standardized variable does not have a mean of 0 for girls, as it should. It has a mean of 0 in the non-imputed dataset (m=0) but not in the others. I'm not sure what's going on here. I swear this was working a minute ago.

use http://www.ats.ucla.edu/stat/stata/s...ation/hsb2_mar, clear
mi set wide
mi register imputed read write female
mi impute mvn read write female, add(3)

mi passive: gen zread = .
mi xeq: summ read if female==1; replace zread = (read - r(mean)) / r(sd)

mi estimate: mean zread if female==1
mi xeq: summ zread if female==1
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#11

19 Jan 2017, 11:51

Unfortunately, my solution is still not quite right. When I run the following, the last two lines show that the standardized variable does not have a mean of 0 for girls, as it should. It has a mean of 0 in the non-imputed dataset (m=0) but not in the others. I'm not sure what's going on here. I swear this was working a minute ago.

use "http://www.ats.ucla.edu/stat/stata/seminars/missing_data/Multiple_imputation/hsb2_mar", clear
mi set wide
mi register imputed read write female
mi impute mvn read write female, add(3)

mi passive: gen zread = .
mi xeq: summ read if female==1; return list; replace zread = (read - r(mean)) / r(sd)

mi estimate: mean zread if female==1
mi xeq: summ zread if female==1
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#12

19 Jan 2017, 12:44

Paul, the problem with the current approach is that the z* variables are super-varying and shoud therefore not be registered at all. They should also be stored in flong or flongsep style. I did not make this clear in my first answer, Here is your example modified

Code:

use "http://www.ats.ucla.edu/stat/stata/seminars/missing_data/Multiple_imputation/hsb2_mar", clear mi set flong // not wide or mlong mi register imputed read write female mi impute mvn read write female, add(3) gen zread = . // do not register zread mi xeq: summ read if female==1; return list; replace zread = (read - r(mean)) / r(sd) mi estimate: mean zread if female==1 mi xeq: summ zread if female==1

I do agree that Stata's mi suit seems clumsy at times and it makes some things harder to do. On the other hand, this hole registering and extra commands can help avoid mistakes that are easily made without noticing.

Best
Daniel
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#13

19 Jan 2017, 13:35

Thank you. I'm not sure I get the idea of a "super-varying" variable. From the documentation: "in the wide and mlong styles, there is simply no place to store super-varying values." How is that true here? In wide format there's a column called zread, and columns called _1_zread, _2_zread, 3_zread. What's the problem? Why does it matter if zread != _1_zread in some rows?
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#14

20 Jan 2017, 00:41

Stata considers it a likely error if a registered variable differs across the M datasets in the complete observations. This makes a lot of sense, since we would normally not expect this to happen.

Concerning mlong style, the problem is that you cannot store the M different values that arise from standardizing using M means and respective SD in the complete observations because these are not repeated. Hence, only one value per variable can be stored for the complete cases. For the wide style I think you are raising a valid point. Here the complete cases are repeated and could be assigned different values. It would be interesting to hear from StataCorp why this should pose a problem.

[EDIT]
It is likely an error in the documentation. Technically, it seems that there is no problem. You can stay in wide style. The important thing is not to register the zread variable. Watch

Code:

use "http://www.ats.ucla.edu/stat/stata/seminars/missing_data/Multiple_imputation/hsb2_mar", clear mi set wide mi register imputed read write female mi impute mvn read write female, add(3) gen zread = . // do not register zread mi xeq: summ read if female!=0; return list; replace zread = (read - r(mean)) / r(sd) mi estimate: mean zread if female!=0 mi xeq: summ zread if female!=0

I have replaces female==1 with female!=0 because female has missing values that are imputed using the multivariate normal approach which results in values other than 0 and 1. This should not make much of a difference for the example.
[/EDIT]

Best
Daniel

Last edited by daniel klein; 20 Jan 2017, 00:47.
2 likes
Comment

Announcement

Standardize around mean of subset

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment