Z-Transformation for Population Data

Stephan Krayter

Join Date: Dec 2018

Posts: 8
#1

Z-Transformation for Population Data

20 Aug 2020, 05:59

In Excel you can do the calculation of the standard deviation in two different ways in order to perform a Z-Transformation. Once as a formula for sample data and once as a formula for population data. The difference is divided by N or divided by N-1. So this brings minimally different results of the standard deviation and with this the Z-Scores afterwards.

In Stata, the default formula within egen newvar=std (oldvar) is set to a treatment as sample data. I learned this by comparing both Excel versions to the Stata version. The Stata Z transformation matches the Excel Z transformation for sample data. But I can’t find a way to change this within the egen command options. For my purposes, I need the Z-transformation for population data. A workaround would be to just import the calculated Excel values into stata or live with minimal differences, but both are not the perfect solution.

Maybe someone knows a solution for my problem. I would appreciate the help.

Best, Stephan
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35698

20 Aug 2020, 07:09

There is no such option in egen, But the sensible alternative really isn't relying on another program to do it.

You could clone the code for std() and write your own egen function (under a different name).

Or you could just calculate your own correction factor. Here is a way to do that.

Code:

. sysuse auto , clear
(1978 Automobile Data)

. egen z = std(mpg), by(rep78)

. egen count = count(mpg), by(rep78)

. gen z_pop = z * sqrt(count / (count - 1))

. 
. egen sd = sd(mpg), by(rep78) 

. gen sd_pop = sd * sqrt((count - 1) / count)

. tabdisp rep78, c(count sd sd_pop)

----------------------------------------------
Repair    |
Record    |
1978      |      count          sd      sd_pop
----------+-----------------------------------
        1 |          2     4.24264           3
        2 |          8    3.758324     3.51559
        3 |         30    4.141325    4.071718
        4 |         18     4.93487    4.795831
        5 |         11    8.732385    8.326002
        . |          5     5.07937    4.543127
----------------------------------------------

. 
. 
. list mpg if rep78 == 1

     +-----+
     | mpg |
     |-----|
 40. |  24 |
 48. |  18 |
     +-----+

. list mpg count z* sd* if rep78 == 1 

     +----------------------------------------------------+
     | mpg   count           z   z_pop        sd   sd_pop |
     |----------------------------------------------------|
 40. |  24       2    .7071068       1   4.24264        3 |
 48. |  18       2   -.7071068      -1   4.24264        3 |
     +----------------------------------------------------+

.

Code:

The last example allows a check. In the auto data there are two observations for repair record 1. So by mental arithmetic, the mean of 18 and 24 is 21, each deviation has absolute value 3 and so the root mean square deviation is just 3 (using sample or if you prefer population size in the denominator). Hence each standardized value is 1 or -1. That mental arithmetic result matches the result of the two-line correction earlier.

Note that it is a good idea to use count() as it ignores missing values, what you want here.

Comment

Stephan Krayter

Join Date: Dec 2018

Posts: 8
#3

03 Sep 2020, 03:10

Sorry for returning late to you, but thank you very much. That helped a lot.
Comment

Announcement