I am analyzing data from 30 districts, focusing on 8 questions (c01 to c08). Here is the dataset I am working on,

aafaque ali

Join Date: Jul 2024
Posts: 17

I am analyzing data from 30 districts, focusing on 8 questions (c01 to c08). Here is the dataset I am working on,

14 Jul 2024, 05:41

district	c01	c02	c03	c04	c05	c06	c07	c08	Population
District 1	0	1	0	0	0	1	0	0	892
District 2	0	0	0	1	0	1	0	0	138
District 3	0	1	0	1	0	0	0	1	923
District 4	1	1	0	1	1	0	0	0	887
District 5	0	1	1	0	1	1	0	0	514
District 6	1	0	1	1	1	0	1	1	578
District 7	0	0	0	1	1	0	1	0	393
District 8	1	0	0	1	0	1	0	0	566
District 9	0	0	0	0	1	0	0	1	514
District 10	1	0	0	0	0	0	0	1	770
District 11	0	0	0	0	1	1	1	1	207
District 12	1	1	0	1	0	0	0	1	625
District 13	0	1	0	0	0	1	1	0	550
District 14	1	0	1	1	0	0	0	0	596
District 15	1	1	0	0	1	0	0	1	250
District 16	1	1	1	0	1	1	0	0	481
District 17	1	1	1	0	1	0	1	0	553
District 18	1	0	1	1	0	0	1	1	652
District 19	1	0	0	0	0	1	0	0	503
District 20	0	1	1	1	0	0	0	1	234
District 21	1	1	0	1	0	1	0	1	883
District 22	1	1	0	0	0	1	0	0	344
District 23	1	1	0	0	1	0	0	0	238
District 24	1	1	1	1	0	1	0	1	944
District 25	1	0	1	1	1	0	1	0	940
District 26	0	0	0	1	0	1	1	0	730
District 27	0	0	0	0	1	1	0	0	890
District 28	1	1	1	1	1	0	0	0	916
District 29	0	0	0	1	0	0	1	1	959
District 30	0	1	0	1	0	1	0	1	404

stata
Code
recode c01 (1=1 "1") (2=0 "0"), generate(binary_c01)
recode c02 (1=1 "1") (2=0 "0"), generate(binary_c02)
recode c03 (1=1 "1") (2=0 "0"), generate(binary_c03)
recode c04 (1=1 "1") (2=0 "0"), generate(binary_c04)
recode c05 (1=1 "1") (2=0 "0"), generate(binary_c05)
recode c06 (1=1 "1") (2=0 "0"), generate(binary_c06)
recode c07 (1=1 "1") (2=0 "0"), generate(binary_c07)
recode c08 (1=1 "1") (2=0 "0"), generate(binary_c08)
egen total_sum = rowtotal ( binary_c01 binary_c02 binary_c03 binary_c04 binary_c05 binary_c06 binary_c07 binary_c08)
gen R_Sum = total_sum / 8
collapse (mean) R_Sum , by(district)
I saved the dataset, which includes columns for district and R_Sum, and then created an Excel file with an additional column for Population. This resulted in a dataset with three columns: district, R_Sum, and Population. I then imported this updated file into Stata. Further given commands work on imported file.
Code:
gen aggregate_population = R_Sum * Population
**Normalized R_Sum**
egen R_Sum_min = min(R_Sum)
egen R_Sum_max = max(R_Sum)
gen R_Sum_normalized = (R_Sum - R_Sum_min) / (R_Sum_max - R_Sum_min)
**Normalized aggregate_population**
egen aggregate_population_min = min(aggregate_population)
egen aggregate_population_max = max(aggregate_population)
gen aggregate_population_normalized = (aggregate_population - aggregate_population_min) / (aggregate_population_max - aggregate_population_min)
**Listing**
list district R_Sum aggregate_population R_Sum_normalized aggregate_population_normalized
End
The code runs without errors, and I observe the following:

One district has a normalized aggregate population value of 1.
One district has a normalized aggregate population value of 0.
The remaining districts have values between 0 and 1 after normalization.

I understand that normalization should result in values between 0 and 1, but here one district is exactly "1" and the other is exactly "0", also I want to confirm if this is the correct approach for normalizing aggregate population values.

Any advice or improvements to my approach would be greatly appreciated.

Thank you for your assistance

Tags: None

aafaque ali

Join Date: Jul 2024

Posts: 17
#2

14 Jul 2024, 05:45

Is it possible to analyze this dataset using the Z-score method instead of normalization? and what if we get the value of a district greater than 2? Clyde Schechter

Last edited by aafaque ali; 14 Jul 2024, 05:48.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30063

14 Jul 2024, 11:04

Re #1.

I saved the dataset, which includes columns for district and R_Sum, and then created an Excel file with an additional column for Population. This resulted in a dataset with three columns: district, R_Sum, and Population. I then imported this updated file into Stata. Further given commands work on imported file.

This is bad data management practice. It opens up too many opportunities for errors to creep in. Also, the work done within Excel is undocumentable and so you do not have an audit trail for your work. If you are just playing around for fun, I suppose that's fine. But if there is some serious purpose to your work and you intend for somebody to take the results seriously, this approach is not acceptable.

Moreover, since you only have one observation for each district, the -collapse- command does not change the values of R_sum. The mean of a single number is just that same number. So all you needed to do instead of -collapse- followed by a diversion into undocumented data management in Excel is -keep district R_sum Population.

Re #2.
Yes, of course, you can do it using Z-scores instead.

It makes no difference. The Z-score is just a re-scaled re-located version of the normalized variable in your original approach. In fact, if you try it both ways and correlate the two variables you will see that r = 1.0 exactly. So any further analysis done either way will produce the same results--regression coefficients will change by a scale factor, and the constant term will differ, but any substantive conclusions will be absolutely identical. The choice between them is simply a matter of taste. My own preference is the 0-1 normalization because I think it is easier for people to understand and interpret then the z-score, but reasonable people can differ about that. Look at the following to see a simple illustration of this:

Code:

. clear*

.
. sysuse auto
(1978 automobile data)

.
. summ mpg

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         mpg |         74     21.2973    5.785503         12         41

. gen mpg_norm = (mpg - `r(min)')/(`r(max)' - `r(min)')

. gen mpg_z = (mpg - `r(mean)')/(`r(sd)')

.
. display (`r(max)'-`r(min)')/`r(sd)'
5.0125285

.
.
. graph twoway scatter mpg_norm mpg_z

. corr mpg_norm mpg_z
(obs=74)

             | mpg_norm    mpg_z
-------------+------------------
    mpg_norm |   1.0000
       mpg_z |   1.0000   1.0000


.
. regress price mpg_norm

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =     20.26
       Model |   139449477         1   139449477   Prob > F        =    0.0000
    Residual |   495615919        72  6883554.43   R-squared       =    0.2196
-------------+----------------------------------   Adj R-squared   =    0.2087
       Total |   635065396        73  8699525.97   Root MSE        =    2623.7

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
    mpg_norm |  -6927.936   1539.224    -4.50   0.000    -9996.323   -3859.549
       _cons |   8386.329    580.115    14.46   0.000      7229.89    9542.767
------------------------------------------------------------------------------

. estimates store normalized

. regress price mpg_z

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =     20.26
       Model |   139449471         1   139449471   Prob > F        =    0.0000
    Residual |   495615925        72  6883554.52   R-squared       =    0.2196
-------------+----------------------------------   Adj R-squared   =    0.2087
       Total |   635065396        73  8699525.97   Root MSE        =    2623.7

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       mpg_z |  -1382.124   307.0753    -4.50   0.000    -1994.268   -769.9805
       _cons |   6165.257   304.9935    20.21   0.000     5557.263     6773.25
------------------------------------------------------------------------------

. estimates store zscores

.
. suest normalized zscores

Simultaneous results for normalized, zscores                Number of obs = 74

----------------------------------------------------------------------------------
                 |               Robust
                 | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-----------------+----------------------------------------------------------------
normalized_mean  |
        mpg_norm |  -6927.936   1655.377    -4.19   0.000    -10172.42   -3683.456
           _cons |   8386.329   709.8317    11.81   0.000     6995.084    9777.573
-----------------+----------------------------------------------------------------
normalized_lnvar |
           _cons |   15.74465   .2176038    72.35   0.000     15.31815    16.17114
-----------------+----------------------------------------------------------------
zscores_mean     |
           mpg_z |  -1382.124   330.2479    -4.19   0.000    -2029.398   -734.8499
           _cons |   6165.257   302.8973    20.35   0.000     5571.589    6758.924
-----------------+----------------------------------------------------------------
zscores_lnvar    |
           _cons |   15.74465   .2176038    72.35   0.000     15.31815    16.17114
----------------------------------------------------------------------------------

. display _b[normalized_mean:mpg_norm]/_b[zscores_mean:mpg_z]
5.0125287

Notice that that max - min difference is 5.0125285 times the standard deviation, and this same scale factor (other than rounding error in the 7th decimal place) distinguishes the coefficients when the two transformed mpg variables are used in the same regression. Test statistics are the same either way.

The choice between them is simply a matter of taste. My own preference is the 0-1 normalization because I think it is easier for people to understand and interpret then the z-score, but reasonable people can differ about that. In truth, if the original variable has units that people will generally recognize and understand, I don't do either of these transformations--I use the variable as is.

As for what to do if you get a z-score > 2, do nothing! Unless you have reason to believe that the original data value is an error (in which case you should either fix it, or replace it by missing value), a z-score > 2 calls for no special handling at all. Just check the original data point to see if it is correct.

Comment

aafaque ali

Join Date: Jul 2024
Posts: 17

14 Jul 2024, 14:17

Dear Clyde Schechter , here is an updated query, I am analyzing data from 30 districts, focusing on 8 questions (c01 to c08). Here is the dataset I am working on

district	c01	c02	c03	c04	c05	c06	c07	c08
District 1	2	1	2	2	2	1	2	2
District 2	2	2	2	1	2	1	2	2
District 3	2	1	2	1	2	2	2	1
District 4	1	1	2	1	1	2	2	2
District 5	2	1	1	2	1	1	2	2
District 6	1	2	1	1	1	2	1	1
District 7	2	2	2	1	1	2	1	2
District 8	1	2	2	1	2	1	2	2
District 9	2	2	2	2	1	2	2	1
District 10	1	2	2	2	2	2	2	1
District 11	2	2	2	2	1	1	1	1
District 12	1	1	2	1	2	2	2	1
District 13	2	1	2	2	2	1	1	2
District 14	1	2	1	1	2	2	2	2
District 15	1	1	2	2	1	2	2	1
District 16	1	1	1	2	1	1	2	2
District 17	1	1	1	2	1	2	1	2
District 18	1	2	1	1	2	2	1	1
District 19	1	2	2	2	2	1	2	2
District 22	2	1	1	1	2	2	2	1
District 21	1	1	2	1	2	1	2	1
District 22	1	1	2	2	2	1	2	2
District 23	1	1	2	2	1	2	2	2
District 24	1	1	1	1	2	1	2	1
District 25	1	2	1	1	1	2	1	2
District 26	2	2	2	1	2	1	1	2
District 27	2	2	2	2	1	1	2	2
District 28	1	1	1	1	1	2	2	2
District 29	2	2	2	1	2	2	1	1
District 30	2	1	2	1	2	1	2	1

Population
892
138
923
887
514
578
393
566
514
770
207
625
550
596
250
481
553
652
503
234
883
344
238
944
940
730
890
916
959
404

Code:
gen aggregate_population = R_Sum * Population
**Normalized R_Sum**
egen R_Sum_min = min(R_Sum)
egen R_Sum_max = max(R_Sum)
gen R_Sum_normalized = (R_Sum - R_Sum_min) / (R_Sum_max - R_Sum_min)
**Normalized aggregate_population**
egen aggregate_population_min = min(aggregate_population)
egen aggregate_population_max = max(aggregate_population)
gen aggregate_population_normalized = (aggregate_population - aggregate_population_min) / (aggregate_population_max - aggregate_population_min)
**Listing**
list district R_Sum aggregate_population R_Sum_normalized aggregate_population_normalized
End
The code runs without errors, and I observe the following:

One district has a normalized aggregate population value of 1.
One district has a normalized aggregate population value of 0.
The remaining districts have values between 0 and 1 after normalization.

Comment

aafaque ali

Join Date: Jul 2024

Posts: 17
#5

14 Jul 2024, 14:20

which approach is better for this work?
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#6

14 Jul 2024, 15:06

before we can meaningfully help you, you need to present your dataset properly, using the dataex command. You also need to read the FAQ to learn how to present code so that we can replicate your issue. When I ask questions, I almost always present my data first and clearly explain the thing that I'm confused on or not understanding. I say what I want Stata to do, or I say what it is doing that I don't want it to do.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#7

14 Jul 2024, 17:33

Re #4 and #5, please see my responses in #3. Nothing you say in #4 and #5 changes what I said there.
Comment

Announcement

I am analyzing data from 30 districts, focusing on 8 questions (c01 to c08). Here is the dataset I am working on,

Comment

Comment

Comment

Comment

Comment

Comment