Normal scores transformation

nasermakarem

Join Date: Aug 2014

Posts: 3
#1

Normal scores transformation

18 Aug 2014, 11:19

Hello,
I need to do normal scores transformation (Van der Waerden) but I cannot find the command! Is it doable in stata?
Tags: None
ben earnhart

Join Date: May 2014

Posts: 1027
#2

18 Aug 2014, 11:28

Dunno Van der Waerden, but try:

Code:

egen newvar=std(oldvar)
1 like
Comment
nasermakarem

Join Date: Aug 2014

Posts: 3
#3

18 Aug 2014, 11:36

That was super quick! Thank you so much Ben!
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#4

18 Aug 2014, 11:39

BTW: try help egen. You will see a list of some of the most important and useful Stata functions.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3445
#5

18 Aug 2014, 12:49

The egen function std() won't give you the normal scores transformation for the Van der Waerden test (http://en.wikipedia.org/wiki/Van_der_Waerden_test).

Here is an example of how to compute these values:

Code:

sysuse auto, clear gen byte miss = missing(rep78, price) bysort miss rep78 : egen rank = rank(price) if miss == 0 by miss rep78 : gen pp = rank / ( _N + 1 ) gen normscore = invnormal(pp)

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#6

18 Aug 2014, 13:19

Ah, so I stand corrected. If I understand things right, *if* his variable were normally distributed, then egen newvar=std(oldvar) would work. But if it's not normally distributed, then need to use the more complex approach.
Comment

Maarten Buis

Join Date: Mar 2014
Posts: 3445

19 Aug 2014, 01:23

Unfortunately, not quite. Below I created a variable by drawing from a normal distribution, but our results are somewhat different. The difference is that normal scores are forced to follow a normal distribution, while standardized values maintain the randomness (including the deviations) that occur when you draw random samples. Which one is right depends on what nasermakarem wants to do with this variable.

Code:

. // create some example data
. clear

. set seed 123

. set obs 10
obs was 0, now 10

. gen x = rnormal()

.
. egen ben = std(x)

. egen rank = rank(x)

. gen pp = rank/( _N + 1 )

. gen maarten = invnormal(pp)

. drop rank pp

. list

     +-----------------------------------+
     |         x         ben     maarten |
     |-----------------------------------|
  1. |   2.08619     1.56843    1.335178 |
  2. | -.3528706   -.6759967   -.6045853 |
  3. |  .3006571   -.0746197   -.1141853 |
  4. |  .8069299    .3912531    .3487557 |
  5. |  .1201693   -.2407048   -.3487557 |
     |-----------------------------------|
  6. |  .9289025    .5034924    .6045854 |
  7. |  1.670093    1.185537    .9084579 |
  8. | -1.552079   -1.779509   -1.335178 |
  9. |  .5268717    .1335432    .1141853 |
 10. | -.7173868   -1.011425   -.9084579 |
     +-----------------------------------+

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------

Comment

nasermakarem

Join Date: Aug 2014

Posts: 3
#8

19 Aug 2014, 02:28

I really appreciate your input pals. Let me give you the whole story so that you might be able to help me better. I am dealing with a huge dataset with ~24000 observation. I need to run a regression but diagnostics indicate that I have heteroscedasticity, autocorrelation, non-normality, and non-linearity. To solve the last two, I suppose I need to do data transformation, and as I have negative and zero values, the right transformation is narmal scores. Now does Ben suit me or Maarten?
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3445
#9

19 Aug 2014, 06:27

Non-normality refers to non-normality of the residuals not non-normality of the marginal distribution of the dependent variable. So neither transformation will solve your problem:
Ben's won't change the distribution at all, it is just a linear transformation of the existing variable. As a consequence it will only change the mean and standard deviation but not the underlying distribution.

Mine changes the marginal distribution of the dependent variable to a normal distribution, but what you want to be normal is the residuals.

Moreover, with that many observations you don't need to worry about the normality of the residuals, e.g.: http://www.talkstats.com/showthread....ht=#post155817

Non-linearity of the effect is a problem that has a lot higher priority than normality of the residuals, but often I find it easier to address that with transformations of the independent variables (I like linear splines, see help mkspline, but there are many alternatives).

Once you have solved the non-lineartiy problem, you should inspect the residuals again for possible heteroscedasticity and autocorrelation.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment

Announcement