Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Standardize around mean of subset

    I have two variables, score and year, representing test scores for year=2005,...,2013. If I want to standardize the test scores the simplest ways to do it is
    egen zscore = std(score)
    but that standardizes them using the mean and SD of all years together. I would prefer to standardize them using the mean and SD of the first year, 2005. What's the simplest way to do that.

    Bonus round: my data are multiply imputed. Is there still a simple alternative to this?
    mi passive: egen zscore = std(score)

  • #2
    Wait, the simple version of my question, but not the mi version, is answered here: http://www.stata.com/statalist/archi.../msg00010.html
    Without mi, the answer looks like this:
    su score if year==2005
    egen zscore = (score - r(mean)) / r(sd)

    In mi data, I can standardize using the mean and sd of a single imputed dataset like this.
    mi xeq: su score if year==2005
    mi passive: egen zscore = (score - r(mean)) / r(sd)
    But standardizing around the mean and SD of the multiple imputations is trickier. Suggestions appreciated.

    Comment


    • #3
      su score if year==2005
      egen zscore = (score - r(mean)) / r(sd)
      In the second line use generate instead of egen. What is really interesting is the question you pose.

      In mi data, I can standardize using the mean and sd of a single imputed dataset like this.
      mi xeq: su score if year==2005
      mi passive: [e]gen zscore = (score - r(mean)) / r(sd)
      But standardizing around the mean and SD of the multiple imputations is trickier. Suggestions appreciated.
      Technically, this will not work, because r(mean) and r(sd) will contain the respective values form the last imputed dataset only. Also the non-imputed dataset is used by mi xeq if not specified otherwise. I do not think this is what you want. I am not sure what you want, however, and I am not sure if it would be valid either. I hope to learn something here, too.

      You could say you want a point estimate of the MI mean and the MI SD. Using Rubin's rules it easy to get those. The mean will be given by a simple summarize, since the mean of the M means is just the mean over the n x M observations. Assuming that the SD is approximately normal, we can calculate the mean of the M dataset specific SDs as

      Code:
      mi query
      local M = r(M)
      scalar sd_mi = 0
      mi xeq 1/`M' : summarize score if (year == 2005) ; scalar sd_mi = sd_mi + r(sd)
      scalar sd_mi = sd/`M'
      However, is this really the way we would standardize the variables? Think about the way the combined estimation results will be obtained later. You would run the estimation command on each of the imputed datasets, then combine coefficients and standard errors. If we use the procedure outlined above, then for each individual estimation the coeffciets will not be based on a standardized (mean 0, unit SD) variable. Should this not be the case? Intuitively I would standardize the variable using the dataset specific mean and SD

      Code:
      mi query
      local M = r(m)
      generate zscore = .
      mi xeq 1/`M' : summarize score if (year == 2005) ; replace zscore = (score-r(mean))/r(sd)
      (Note that it would be much harder to replace the original score variable. Just do not do it and create a new one, as shown above. Or ask for more details if interested in replacing the orginal.)

      What do you think?

      Best
      Daniel
      Last edited by daniel klein; 18 Jan 2017, 22:55.

      Comment


      • #4
        Your idea of standardizing each imputed dataset separately makes a lot of sense. Then how about this?

        mi xeq: summarize score if year==2005; replace zscore = (score - r(mean)) / r(sd)

        This is working for me. By default mi xeq iterates through each dataset, starting with the non-imputed dataset (m=0) and continuing through all the imputed datasets (m=1,2,....). It doesn't just run on m=0 as you implied earlier. I bring this up only because it explains why my code above works.

        Comment


        • #5
          Wait, I'm sorry. My solution didn't quite work:

          mi xeq: summarize score if year==2005; replace zscore = (score - r(mean)) / r(sd)

          It appears the values of r(mean) and r(sd) are not being refreshed each time through the loop.

          Hm....

          Comment


          • #6
            I did phrase this one poorly. I did not want to imply that mi xeq uses only the m=0 dataset, but that it includes this dataset. This is relevant when you are adding scalars, as in the first approach that I have shown, where you want only the values from the complete datasets. With your code, it does not matter whether m=0 is used or not (aside from a few seconds that could be saved by skipping it).

            Anyway, what makes you think that your code does not work? Can you show the complete code (including the generate command that you use in the beginning) and (part of the) output that leads to your conclusion?

            Best
            Daniel

            Comment


            • #7
              OK, now I think I've got this. I was wrong to say before that the values of r(mean) and r(sd) do not refresh each time through the loop. They do. My mistake was that I hadn't registered the standardized variable as passive. Once I do that, the solution works.

              Let me illustrate with a reproducible example. The HSB data are available on the web:

              use http://www.ats.ucla.edu/stat/stata/s...ation/hsb2_mar, clear

              They contain incomplete reading and math scores for boys and girls. I impute them as follows:

              mi set wide
              mi register imputed read write female
              mi impute mvn read write female, add(3)


              Now I want to standardize the variable read around the mean for girls. The standardized variable will be called zread. The first thing I need to do is register zread as passive. But I can't do that because it doesn't exist yet. So I simultaneously initialize it and register it as passive, like this:

              mi passive: gen zread = .

              This is the step that I missed before. It seems a little convoluted and I wonder if there is a way around it. If you run describe, you can see that in addition to a variable named zread, there are now variables named _1_zread, _2_zread, _3_zread, just waiting to be filled with passively imputed values. Now the rest of my solution works:

              mi xeq: summ read if female==0; return list; gen zread = (read - r(mean)) / r(sd)

              The "return list" verifies that the values of r(mean) and r(sd) refresh each time through the loop. I was wrong before to say they don't.

              To verify my solution, I separately check the mean of the standardized variable for boys and for girls. For girls the mean is 0; for boys it is -0.03. That's what I expected.

              mi estimate: mean zread if female==0
              mi estimate: mean zread if female==1











              Comment


              • #8
                While I now have a working solution, I continue to think that all the mi register and mi xeq business adds an unnecessary layer of complexity, which frequently trips me up.

                The imputation tools in SAS work quite nicely without that layer. This is one of the few situations where SAS is simpler and more elegant than Stata.

                Comment


                • #9
                  Here's a minor correction to my example. The mi xeq line should use replace instead of gen, since the variable zread was previously generated. The corrected code follows:

                  use http://www.ats.ucla.edu/stat/stata/s...ation/hsb2_mar, clear
                  mi set wide
                  mi register imputed read write female
                  mi impute mvn read write female, add(3)

                  mi passive: gen zread = .
                  mi xeq: summ read if female==0; return list; replace zread = (read - r(mean)) / r(sd)
                  mi estimate: mean zread if female==0
                  mi estimate: mean zread if female==1

                  Comment


                  • #10
                    Unfortunately, my solution is still not quite right. When I run the following, the last two lines show that the standardized variable does not have a mean of 0 for girls, as it should. It has a mean of 0 in the non-imputed dataset (m=0) but not in the others. I'm not sure what's going on here. I swear this was working a minute ago.

                    use http://www.ats.ucla.edu/stat/stata/s...ation/hsb2_mar, clear
                    mi set wide
                    mi register imputed read write female
                    mi impute mvn read write female, add(3)

                    mi passive: gen zread = .
                    mi xeq: summ read if female==1; replace zread = (read - r(mean)) / r(sd)

                    mi estimate: mean zread if female==1
                    mi xeq: summ zread if female==1



                    Comment


                    • #11
                      Unfortunately, my solution is still not quite right. When I run the following, the last two lines show that the standardized variable does not have a mean of 0 for girls, as it should. It has a mean of 0 in the non-imputed dataset (m=0) but not in the others. I'm not sure what's going on here. I swear this was working a minute ago.

                      use "http://www.ats.ucla.edu/stat/stata/seminars/missing_data/Multiple_imputation/hsb2_mar", clear
                      mi set wide
                      mi register imputed read write female
                      mi impute mvn read write female, add(3)

                      mi passive: gen zread = .
                      mi xeq: summ read if female==1; return list; replace zread = (read - r(mean)) / r(sd)

                      mi estimate: mean zread if female==1
                      mi xeq: summ zread if female==1

                      Comment


                      • #12
                        Paul, the problem with the current approach is that the z* variables are super-varying and shoud therefore not be registered at all. They should also be stored in flong or flongsep style. I did not make this clear in my first answer, Here is your example modified

                        Code:
                        use "http://www.ats.ucla.edu/stat/stata/seminars/missing_data/Multiple_imputation/hsb2_mar", clear
                        mi set flong // not wide or mlong
                        mi register imputed read write female
                        mi impute mvn read write female, add(3)
                        
                        gen zread = . // do not register zread
                        mi xeq: summ read if female==1; return list; replace zread = (read - r(mean)) / r(sd)
                        
                        mi estimate: mean zread if female==1
                        mi xeq: summ zread if female==1
                        I do agree that Stata's mi suit seems clumsy at times and it makes some things harder to do. On the other hand, this hole registering and extra commands can help avoid mistakes that are easily made without noticing.

                        Best
                        Daniel

                        Comment


                        • #13
                          Thank you. I'm not sure I get the idea of a "super-varying" variable. From the documentation: "in the wide and mlong styles, there is simply no place to store super-varying values." How is that true here? In wide format there's a column called zread, and columns called _1_zread, _2_zread, 3_zread. What's the problem? Why does it matter if zread != _1_zread in some rows?

                          Comment


                          • #14
                            Stata considers it a likely error if a registered variable differs across the M datasets in the complete observations. This makes a lot of sense, since we would normally not expect this to happen.

                            Concerning mlong style, the problem is that you cannot store the M different values that arise from standardizing using M means and respective SD in the complete observations because these are not repeated. Hence, only one value per variable can be stored for the complete cases. For the wide style I think you are raising a valid point. Here the complete cases are repeated and could be assigned different values. It would be interesting to hear from StataCorp why this should pose a problem.

                            [EDIT]
                            It is likely an error in the documentation. Technically, it seems that there is no problem. You can stay in wide style. The important thing is not to register the zread variable. Watch

                            Code:
                            use "http://www.ats.ucla.edu/stat/stata/seminars/missing_data/Multiple_imputation/hsb2_mar", clear
                            mi set wide
                            mi register imputed read write female
                            mi impute mvn read write female, add(3)
                            
                            gen zread = . // do not register zread
                            mi xeq: summ read if female!=0; return list; replace zread = (read - r(mean)) / r(sd)
                            
                            mi estimate: mean zread if female!=0
                            mi xeq: summ zread if female!=0
                            I have replaces female==1 with female!=0 because female has missing values that are imputed using the multivariate normal approach which results in values other than 0 and 1. This should not make much of a difference for the example.
                            [/EDIT]


                            Best
                            Daniel
                            Last edited by daniel klein; 20 Jan 2017, 00:47.

                            Comment

                            Working...
                            X