Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Different n, R² and RMSE: Choosing between A) means in dataset / B) dataset of means

    Hello,
    this might be easily answered: Which of the following options must I choose for a "correct" bivariate regression/scatter plot?
    n, R² and RMSE differ between option A and B.

    Option A: Variable containing mean values (in a full dataset with different ids)
    Starting with a panel dataset, I build the mean of each year of a variable and applied aaplot by Nick Cox, available from SSC.
    So the dataset is sorted by year (called "relyear") and the newly generated variable contains the same mean for each id but the means differ by year.
    Code:
    bys relyear: egen relgdp = mean(gdpcapPPP11)
    gen lnrelgdp = ln(relgdp)
    aaplot lnrelgdp relyear
    Results A: R²=96.6% n=8178 RMSE=0.1721926

    Option B: Dataset solely containing unique mean values (with corresponding variable averaged by)
    Instead of plotting the variables of interest in my "full" dataset, I collapsed the dataset and then applied aaplot (see above) again for comparison.
    Code:
    preserve
    collapse (mean) gdpcapPPP11, by(relyear)
    gen ln_gdp=ln(gdpcapPPP11)
    aaplot ln_gdp relyear
    restore
    Results B:
    R²=92.7% n=181 RMSE=0.3360293

    I tried this with other variables, too and R² does not always fall from A to B. Of course, the variables ln_gdp and lnrelgdp contain the same values. Yet, the results are different. So: A or B?
    Thank you!

    ***************************
    If needed:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(relyear id) double gdpcapPPP11 float lnrelgdp
    -40  75                 . 7.214939
    -40   2                 . 7.214939
    -40 135                 . 7.214939
    -40  17 238.8545623609962 7.214939
    -40 214                 . 7.214939
    -40  41 1429.997975378749 7.214939
    -40 201                 . 7.214939
    -40  68 4377.438877160824 7.214939
    -40  40 1038.932546183966 7.214939
    -40 173 376.0451764821386 7.214939
    -40 189 696.2769275188422 7.214939
    -40 176                 . 7.214939
    -39  40 903.1257682167955 7.279641
    -39 189 692.4529850621859 7.279641
    -39 135                 . 7.279641
    -39   2                 . 7.279641
    -39  68 4974.194553731857 7.279641
    -39 176                 . 7.279641
    -39 214                 . 7.279641
    -39 201                 . 7.279641
    -39  17  245.198582661663 7.279641
    -39  75                 . 7.279641
    -39  41 1510.258194599068 7.279641
    -39 173 377.5727516869297 7.279641
    -38 135                 . 7.341884
    -38  17 256.7771744027178 7.341884
    -38 176                 . 7.341884
    -38 201                 . 7.341884
    -38 214                 . 7.341884
    -38  75                 . 7.341884
    -38  68 5285.614365754863 7.341884
    -38 173 389.0707432639699 7.341884
    -38  41 1547.834960294038 7.341884
    -38 189 715.3433022316342 7.341884
    -38   2                 . 7.341884
    -38  40 1067.059379239463 7.341884
    -37  68 5537.479760591066 7.357561
    -37 214                 . 7.357561
    -37  75                 . 7.357561
    -37 201                 . 7.357561
    -37  41 1446.477872611546 7.357561
    -37   2                 . 7.357561
    -37 135                 . 7.357561
    -37 189 690.1190825834448 7.357561
    -37  40 1094.006461863386 7.357561
    -37  17 250.1178520375817 7.357561
    -37 173 389.8434767407718 7.357561
    -37 176                 . 7.357561
    -36  68 5706.515462131701 7.370363
    -36 189 659.7443148239921 7.370363
    -36  41 1461.712348810607 7.370363
    -36  17 252.2618318136373 7.370363
    -36 214                 . 7.370363
    -36   2                 . 7.370363
    -36 201                 . 7.370363
    -36  75                 . 7.370363
    -36  40 1039.402910306577 7.370363
    -36 176                 . 7.370363
    -36 135                 . 7.370363
    -36 173 409.6221260042836 7.370363
    -35  17 257.9150775352896 7.036851
    -35  30                 . 7.036851
    -35   3                 . 7.036851
    -35  40 1021.608540068463 7.036851
    -35 189 651.0954092153226 7.036851
    -35  14 214.1369544700441 7.036851
    -35 176                 . 7.036851
    -35 190 349.0495545398963 7.036851
    -35 110 1273.484095797152 7.036851
    -35 129                 . 7.036851
    -35 173 432.2722889993854 7.036851
    -35 147 267.0736069476994 7.036851
    -35  75                 . 7.036851
    -35 108                 . 7.036851
    -35  73                 . 7.036851
    -35   2                 . 7.036851
    -35 201                 . 7.036851
    -35  68 6080.802586051083 7.036851
    -35  41 1474.627346963577 7.036851
    -35 143 1299.949248550549 7.036851
    -35 135                 . 7.036851
    -35 150                 . 7.036851
    -35 167 331.5780066816225 7.036851
    -35 214                 . 7.036851
    -34  30                 .  7.04516
    -34  68 6237.257473927168  7.04516
    -34   2                 .  7.04516
    -34  17 255.2815312488197  7.04516
    -34   3                 .  7.04516
    -34 135                 .  7.04516
    -34 176                 .  7.04516
    -34 129                 .  7.04516
    -34  73                 .  7.04516
    -34 201                 .  7.04516
    -34 143 1276.285209772381  7.04516
    -34 110 1276.445370545105  7.04516
    -34 190 387.3569051296755  7.04516
    -34 173 432.7347275014108  7.04516
    -34 167 310.6909401923209  7.04516
    -34  41 1454.059022341608  7.04516
    end
    format %ty relyear
    label values id id
    label def id 2 "AFG", modify
    label def id 3 "AGO", modify
    label def id 14 "BDI", modify
    label def id 17 "BFA", modify
    label def id 30 "BTN", modify
    label def id 40 "COD", modify
    label def id 41 "COG", modify
    label def id 68 "GAB", modify
    label def id 73 "GIN", modify
    label def id 75 "GNB", modify
    label def id 108 "LAO", modify
    label def id 110 "LBR", modify
    label def id 129 "MLI", modify
    label def id 135 "MOZ", modify
    label def id 143 "NGA", modify
    label def id 147 "NPL", modify
    label def id 150 "OMN", modify
    label def id 167 "RWA", modify
    label def id 173 "SLE", modify
    label def id 176 "SOM", modify
    label def id 189 "TCD", modify
    label def id 190 "TGO", modify
    label def id 201 "UGA", modify
    label def id 214 "YEM", modify

  • #2
    Option A will include the same country multiple times, while option B will include the country only once. If you do a regression on means then your unit of analysis is typically the higher level unit (the country), though exceptions may exist. If you want to country to be the unit of analysis, then you want each country to appear only once in your dataset, i.e. option B. However, this is only my guess of you want. To really answer your own question, you need to ask yourself, what do you want the unit of analysis to be for your regression: country or country-year.
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment

    Working...
    X