Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • I am analyzing data from 30 districts, focusing on 8 questions (c01 to c08). Here is the dataset I am working on,

    district c01 c02 c03 c04 c05 c06 c07 c08 Population
    District 1 0 1 0 0 0 1 0 0 892
    District 2 0 0 0 1 0 1 0 0 138
    District 3 0 1 0 1 0 0 0 1 923
    District 4 1 1 0 1 1 0 0 0 887
    District 5 0 1 1 0 1 1 0 0 514
    District 6 1 0 1 1 1 0 1 1 578
    District 7 0 0 0 1 1 0 1 0 393
    District 8 1 0 0 1 0 1 0 0 566
    District 9 0 0 0 0 1 0 0 1 514
    District 10 1 0 0 0 0 0 0 1 770
    District 11 0 0 0 0 1 1 1 1 207
    District 12 1 1 0 1 0 0 0 1 625
    District 13 0 1 0 0 0 1 1 0 550
    District 14 1 0 1 1 0 0 0 0 596
    District 15 1 1 0 0 1 0 0 1 250
    District 16 1 1 1 0 1 1 0 0 481
    District 17 1 1 1 0 1 0 1 0 553
    District 18 1 0 1 1 0 0 1 1 652
    District 19 1 0 0 0 0 1 0 0 503
    District 20 0 1 1 1 0 0 0 1 234
    District 21 1 1 0 1 0 1 0 1 883
    District 22 1 1 0 0 0 1 0 0 344
    District 23 1 1 0 0 1 0 0 0 238
    District 24 1 1 1 1 0 1 0 1 944
    District 25 1 0 1 1 1 0 1 0 940
    District 26 0 0 0 1 0 1 1 0 730
    District 27 0 0 0 0 1 1 0 0 890
    District 28 1 1 1 1 1 0 0 0 916
    District 29 0 0 0 1 0 0 1 1 959
    District 30 0 1 0 1 0 1 0 1 404
    stata
    Code

    recode c01 (1=1 "1") (2=0 "0"), generate(binary_c01)
    recode c02 (1=1 "1") (2=0 "0"), generate(binary_c02)
    recode c03 (1=1 "1") (2=0 "0"), generate(binary_c03)
    recode c04 (1=1 "1") (2=0 "0"), generate(binary_c04)
    recode c05 (1=1 "1") (2=0 "0"), generate(binary_c05)
    recode c06 (1=1 "1") (2=0 "0"), generate(binary_c06)
    recode c07 (1=1 "1") (2=0 "0"), generate(binary_c07)
    recode c08 (1=1 "1") (2=0 "0"), generate(binary_c08)
    egen total_sum = rowtotal ( binary_c01 binary_c02 binary_c03 binary_c04 binary_c05 binary_c06 binary_c07 binary_c08)
    gen R_Sum = total_sum / 8
    collapse (mean) R_Sum , by(district)
    I saved the dataset, which includes columns for district and R_Sum, and then created an Excel file with an additional column for Population. This resulted in a dataset with three columns: district, R_Sum, and Population. I then imported this updated file into Stata. Further given commands work on imported file.
    Code:
    gen aggregate_population = R_Sum * Population
    **Normalized R_Sum**
    egen R_Sum_min = min(R_Sum)
    egen R_Sum_max = max(R_Sum)
    gen R_Sum_normalized = (R_Sum - R_Sum_min) / (R_Sum_max - R_Sum_min)
    **Normalized aggregate_population**
    egen aggregate_population_min = min(aggregate_population)
    egen aggregate_population_max = max(aggregate_population)
    gen aggregate_population_normalized = (aggregate_population - aggregate_population_min) / (aggregate_population_max - aggregate_population_min)
    **Listing**
    list district R_Sum aggregate_population R_Sum_normalized aggregate_population_normalized
    End
    The code runs without errors, and I observe the following:
    • One district has a normalized aggregate population value of 1.
    • One district has a normalized aggregate population value of 0.
    • The remaining districts have values between 0 and 1 after normalization.
    I understand that normalization should result in values between 0 and 1, but here one district is exactly "1" and the other is exactly "0", also I want to confirm if this is the correct approach for normalizing aggregate population values.

    Any advice or improvements to my approach would be greatly appreciated.

    Thank you for your assistance

  • #2
    Is it possible to analyze this dataset using the Z-score method instead of normalization? and what if we get the value of a district greater than 2? Clyde Schechter
    Last edited by aafaque ali; 14 Jul 2024, 05:48.

    Comment


    • #3
      Re #1.
      I saved the dataset, which includes columns for district and R_Sum, and then created an Excel file with an additional column for Population. This resulted in a dataset with three columns: district, R_Sum, and Population. I then imported this updated file into Stata. Further given commands work on imported file.


      This is bad data management practice. It opens up too many opportunities for errors to creep in. Also, the work done within Excel is undocumentable and so you do not have an audit trail for your work. If you are just playing around for fun, I suppose that's fine. But if there is some serious purpose to your work and you intend for somebody to take the results seriously, this approach is not acceptable.

      Moreover, since you only have one observation for each district, the -collapse- command does not change the values of R_sum. The mean of a single number is just that same number. So all you needed to do instead of -collapse- followed by a diversion into undocumented data management in Excel is -keep district R_sum Population.

      Re #2.
      Yes, of course, you can do it using Z-scores instead.

      It makes no difference. The Z-score is just a re-scaled re-located version of the normalized variable in your original approach. In fact, if you try it both ways and correlate the two variables you will see that r = 1.0 exactly. So any further analysis done either way will produce the same results--regression coefficients will change by a scale factor, and the constant term will differ, but any substantive conclusions will be absolutely identical. The choice between them is simply a matter of taste. My own preference is the 0-1 normalization because I think it is easier for people to understand and interpret then the z-score, but reasonable people can differ about that. Look at the following to see a simple illustration of this:
      Code:
      . clear*
      
      .
      . sysuse auto
      (1978 automobile data)
      
      .
      . summ mpg
      
          Variable |        Obs        Mean    Std. dev.       Min        Max
      -------------+---------------------------------------------------------
               mpg |         74     21.2973    5.785503         12         41
      
      . gen mpg_norm = (mpg - `r(min)')/(`r(max)' - `r(min)')
      
      . gen mpg_z = (mpg - `r(mean)')/(`r(sd)')
      
      .
      . display (`r(max)'-`r(min)')/`r(sd)'
      5.0125285
      
      .
      .
      . graph twoway scatter mpg_norm mpg_z
      
      . corr mpg_norm mpg_z
      (obs=74)
      
                   | mpg_norm    mpg_z
      -------------+------------------
          mpg_norm |   1.0000
             mpg_z |   1.0000   1.0000
      
      
      .
      . regress price mpg_norm
      
            Source |       SS           df       MS      Number of obs   =        74
      -------------+----------------------------------   F(1, 72)        =     20.26
             Model |   139449477         1   139449477   Prob > F        =    0.0000
          Residual |   495615919        72  6883554.43   R-squared       =    0.2196
      -------------+----------------------------------   Adj R-squared   =    0.2087
             Total |   635065396        73  8699525.97   Root MSE        =    2623.7
      
      ------------------------------------------------------------------------------
             price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
          mpg_norm |  -6927.936   1539.224    -4.50   0.000    -9996.323   -3859.549
             _cons |   8386.329    580.115    14.46   0.000      7229.89    9542.767
      ------------------------------------------------------------------------------
      
      . estimates store normalized
      
      . regress price mpg_z
      
            Source |       SS           df       MS      Number of obs   =        74
      -------------+----------------------------------   F(1, 72)        =     20.26
             Model |   139449471         1   139449471   Prob > F        =    0.0000
          Residual |   495615925        72  6883554.52   R-squared       =    0.2196
      -------------+----------------------------------   Adj R-squared   =    0.2087
             Total |   635065396        73  8699525.97   Root MSE        =    2623.7
      
      ------------------------------------------------------------------------------
             price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
             mpg_z |  -1382.124   307.0753    -4.50   0.000    -1994.268   -769.9805
             _cons |   6165.257   304.9935    20.21   0.000     5557.263     6773.25
      ------------------------------------------------------------------------------
      
      . estimates store zscores
      
      .
      . suest normalized zscores
      
      Simultaneous results for normalized, zscores                Number of obs = 74
      
      ----------------------------------------------------------------------------------
                       |               Robust
                       | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
      -----------------+----------------------------------------------------------------
      normalized_mean  |
              mpg_norm |  -6927.936   1655.377    -4.19   0.000    -10172.42   -3683.456
                 _cons |   8386.329   709.8317    11.81   0.000     6995.084    9777.573
      -----------------+----------------------------------------------------------------
      normalized_lnvar |
                 _cons |   15.74465   .2176038    72.35   0.000     15.31815    16.17114
      -----------------+----------------------------------------------------------------
      zscores_mean     |
                 mpg_z |  -1382.124   330.2479    -4.19   0.000    -2029.398   -734.8499
                 _cons |   6165.257   302.8973    20.35   0.000     5571.589    6758.924
      -----------------+----------------------------------------------------------------
      zscores_lnvar    |
                 _cons |   15.74465   .2176038    72.35   0.000     15.31815    16.17114
      ----------------------------------------------------------------------------------
      
      . display _b[normalized_mean:mpg_norm]/_b[zscores_mean:mpg_z]
      5.0125287
      Notice that that max - min difference is 5.0125285 times the standard deviation, and this same scale factor (other than rounding error in the 7th decimal place) distinguishes the coefficients when the two transformed mpg variables are used in the same regression. Test statistics are the same either way.

      The choice between them is simply a matter of taste. My own preference is the 0-1 normalization because I think it is easier for people to understand and interpret then the z-score, but reasonable people can differ about that. In truth, if the original variable has units that people will generally recognize and understand, I don't do either of these transformations--I use the variable as is.

      As for what to do if you get a z-score > 2, do nothing! Unless you have reason to believe that the original data value is an error (in which case you should either fix it, or replace it by missing value), a z-score > 2 calls for no special handling at all. Just check the original data point to see if it is correct.

      Comment


      • #4
        Dear Clyde Schechter , here is an updated query, I am analyzing data from 30 districts, focusing on 8 questions (c01 to c08). Here is the dataset I am working on

        district c01 c02 c03 c04 c05 c06 c07 c08
        District 1 2 1 2 2 2 1 2 2
        District 2 2 2 2 1 2 1 2 2
        District 3 2 1 2 1 2 2 2 1
        District 4 1 1 2 1 1 2 2 2
        District 5 2 1 1 2 1 1 2 2
        District 6 1 2 1 1 1 2 1 1
        District 7 2 2 2 1 1 2 1 2
        District 8 1 2 2 1 2 1 2 2
        District 9 2 2 2 2 1 2 2 1
        District 10 1 2 2 2 2 2 2 1
        District 11 2 2 2 2 1 1 1 1
        District 12 1 1 2 1 2 2 2 1
        District 13 2 1 2 2 2 1 1 2
        District 14 1 2 1 1 2 2 2 2
        District 15 1 1 2 2 1 2 2 1
        District 16 1 1 1 2 1 1 2 2
        District 17 1 1 1 2 1 2 1 2
        District 18 1 2 1 1 2 2 1 1
        District 19 1 2 2 2 2 1 2 2
        District 22 2 1 1 1 2 2 2 1
        District 21 1 1 2 1 2 1 2 1
        District 22 1 1 2 2 2 1 2 2
        District 23 1 1 2 2 1 2 2 2
        District 24 1 1 1 1 2 1 2 1
        District 25 1 2 1 1 1 2 1 2
        District 26 2 2 2 1 2 1 1 2
        District 27 2 2 2 2 1 1 2 2
        District 28 1 1 1 1 1 2 2 2
        District 29 2 2 2 1 2 2 1 1
        District 30 2 1 2 1 2 1 2 1
        stata
        Code

        recode c01 (1=1 "1") (2=0 "0"), generate(binary_c01)
        recode c02 (1=1 "1") (2=0 "0"), generate(binary_c02)
        recode c03 (1=1 "1") (2=0 "0"), generate(binary_c03)
        recode c04 (1=1 "1") (2=0 "0"), generate(binary_c04)
        recode c05 (1=1 "1") (2=0 "0"), generate(binary_c05)
        recode c06 (1=1 "1") (2=0 "0"), generate(binary_c06)
        recode c07 (1=1 "1") (2=0 "0"), generate(binary_c07)
        recode c08 (1=1 "1") (2=0 "0"), generate(binary_c08)
        egen total_sum = rowtotal ( binary_c01 binary_c02 binary_c03 binary_c04 binary_c05 binary_c06 binary_c07 binary_c08)
        gen R_Sum = total_sum / 8
        collapse (mean) R_Sum , by(district)
        I extracted the resulted dataset in Excel sheet, which includes columns for district and R_Sum, and then created an Excel file with an additional column for Population ( data for Population is given). This resulted in a dataset with three columns: district, R_Sum, and Population. I then imported this updated file into Stata. Further given commands work on imported file.
        Population
        892
        138
        923
        887
        514
        578
        393
        566
        514
        770
        207
        625
        550
        596
        250
        481
        553
        652
        503
        234
        883
        344
        238
        944
        940
        730
        890
        916
        959
        404
        Code:
        gen aggregate_population = R_Sum * Population
        **Normalized R_Sum**
        egen R_Sum_min = min(R_Sum)
        egen R_Sum_max = max(R_Sum)
        gen R_Sum_normalized = (R_Sum - R_Sum_min) / (R_Sum_max - R_Sum_min)
        **Normalized aggregate_population**
        egen aggregate_population_min = min(aggregate_population)
        egen aggregate_population_max = max(aggregate_population)
        gen aggregate_population_normalized = (aggregate_population - aggregate_population_min) / (aggregate_population_max - aggregate_population_min)
        **Listing**
        list district R_Sum aggregate_population R_Sum_normalized aggregate_population_normalized
        End
        The code runs without errors, and I observe the following:
        • One district has a normalized aggregate population value of 1.
        • One district has a normalized aggregate population value of 0.
        • The remaining districts have values between 0 and 1 after normalization.
        I understand that normalization should result in values between 0 and 1, but here one district is exactly "1" and the other is exactly "0", also I want to confirm if this is the correct approach for normalizing aggregate population values.

        Any advice or improvements to my approach would be greatly appreciated.

        Thank you for your assistance

        Comment


        • #5
          which approach is better for this work?

          Comment


          • #6
            before we can meaningfully help you, you need to present your dataset properly, using the dataex command. You also need to read the FAQ to learn how to present code so that we can replicate your issue. When I ask questions, I almost always present my data first and clearly explain the thing that I'm confused on or not understanding. I say what I want Stata to do, or I say what it is doing that I don't want it to do.

            Comment


            • #7
              Re #4 and #5, please see my responses in #3. Nothing you say in #4 and #5 changes what I said there.

              Comment

              Working...
              X