Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Egen calculation, means/sd

    Hi everyone,

    I'm trying to manually calculate a value for each household in a survey. Using three dummy variables in the dataset (X, Y, Z), I'd like to generate a new variable that takes on the following value for each household:

    Code:
    egen index = 0.02*((X-mean(X))/sd(X)) + 0.05*((Y-mean(Y))/sd(Y)) + 0.04*((Z-mean(Z))/sd(Z))
    where 0.02, 0.05, and 0.04 are constants I've defined; X, Y, and Z are either 0 or 1, depending on the household; and the means and standard deviation should be the weighted ones obtained using sum with analytic weights:

    Code:
    summarize X Y Z [aw=wgt]
    However, the egen code returns "unknown function" errors.

    Thank you.


    EDIT:

    As an aside, the reason that I’m doing this is because I used factor analysis on an older dataset to calculate the first principal components (which are the 0.02, 0.05, and 0.04).

    I'm now trying to caculate index scores (which is the equation above) for households in a more recent survey, but using the principal components I obtained for the older survey. In the older survey I simply used the "predict" function after having run the PCA analysis to calculate the scores.

    Please let me know if there is an easier solution to this problem. I'm trying to avoid having to merge the two files, which would be a bit tedious given how differently the datasets are organized.
    Last edited by Olivier Hoya; 26 Aug 2016, 11:01.

  • #2
    Unlike -gen-, -egen- does not handle arbitrary complex expressions to the right of the equal sign. You have to make separate variables with the means and sds of X, Y, and Z, first, and then combine them. Well, actually, it's a bit simpler than that because you don't really need the means and sds, as you use them only to do standardization to 0 mean sd 1. Thee's a separate -egen- command for that. Try this:

    Code:
    foreach v of varlist X, Y, Z {
        egen `v'_std = std(`v') // NOTE std(), NOT sd()
    }
    gen index = 0.02*X_std + 0.05*Y_std + 0.04*Z_std // NOTE gen, NOT egen

    Comment


    • #3
      Thank you so much for the input, which has gotten quite a bit along the way. I adapted your code slightly to

      Code:
      foreach v of global <macro name here> {
                  ...
      }
      Since I've defined the variables in a global macro. However, a small issue remains: the mean, sd, and std values that egen derives are not weighted.

      For example,

      Code:
      egen X_mean = mean(X)
      generates the mean value from

      Code:
      mean X
      as opposed to the weighted mean from

      Code:
      summarize X [aw=wgt]
      Similarly, the std() option in egen gives me the standardized value based on estimated mean/sd, not the weighted ones. Any solution to this issue?

      Thank you.
      Last edited by Olivier Hoya; 26 Aug 2016, 11:50.

      Comment


      • #4
        The question is morphing visibly. In #1 there were 3 variables; now all of a sudden there are 20-odd. No matter. Your main loop is something like


        Code:
        quietly foreach v of global whatever { 
             su `v' [aw=wgt] 
             gen `v'_std = (`v' - r(mean)) / r(sd) 
        }
        followed by whatever it is that you want to do (I've lost track of that).

        Comment


        • #5
          While we're at it, I'll add my customary plea against the use of global macros here.

          When you define or modify a global macro, that action reverberates through every program currently in memory that uses a global macro of the same name. If that other program was depending on the contents of that global macro being whatever it was before your action, then you have broken that program. The consequences of that are potentially far-reaching and such bugs are almost inevitably difficult to track down and fix. Similarly if one of those other programs chooses to modify its global macro of that name, that change will appear in your current program as well, without any warning, until something bizarre happens as a result. And it is very hard to know why or how that happened. Moreover, unless you stop and run -program dir-, you won't even know what other programs are in memory that might cause the problem!

          For this reason, it is best to avoid global macros as much as possible. Local macros do not suffer this problem because a local macro of a given name in one program or do file or command-window session has nothing to do with similarly named local macros elsewhere. They are, in a word, local!

          People offer two common reasons for preferring global macros and using them for routine tasks like holding lists of variables to analyze in a loop:

          1. The do file is long, the global macro is defined in one place, but it is used in another place far away, and you don't want to have to run all the code in between when you are developing/testing your code.

          2. There is a list of variables, or other parameters of your analysis, that you want to use in several do-files of a project and you want to be sure that the last is defined exactly the same way in each.

          The response to #2 is easy: use the -include- command. It's one of the nifitiest, yet least-known features of Stata.

          The response to #1 has several parts:

          a) If your do-file is so long that this is really a problem, your do-file probably needs to be restructured any way. Break your problem up into shorter segments in separate do files and have a top-level do-file that calls them in proper sequence. This is better programming practice from a number of respects.

          b) Maybe the do-file isn't' physically that long, but it includes some very computationally intensive code in the middle that takes a long time to run. Fair enough. If you're in the developmental/testing stages, just comment out the slow stuff. As long as it doesn't redefine the macros involved, that will cause no problems.

          c) If a) really doesn't appeal to you and b) doesn't appeal or doesn't apply, you can always just copy the local macro definitions over just before the place you want to use them.

          There are some situations where a global macro is really needed and a local just won't do, precisely because its permanence and ability to access it from several different places is critical. But such situations are uncommon. I've been using Stata since 1994 and have only had to resort to global macros a handful of times.

          Comment


          • #6
            Dear Nick and Clyde, thank you very much for the assistance.

            The solution proposed by Nick worked perfectly. And you are right that the question did change slightly -- I initially thought it was easier to describe the problem in terms of three variables (X, Y and Z), since the same solution applied whether it was three or twenty variables. Sorry about any confusion.

            And thank you, Clyde, for the very helpful information about global vs local macros. I will be sure to use local macros in the future to avoid any memory problems.

            Comment

            Working...
            X