Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regression reporting more observations than in dataset

    I am facing a peculiar issue in Stata (18.1) where the number of observations being reported in the regression output is much larger than the number of observations in the dataset, as well as the actual number of observations being used in the estimation. I am using a member-level household survey dataset for India. For example, when I use -describe-, Stata reports I have 513,366 observations and 170 variables. I opened the data viewer and this reports the same. When running basic summary statistics, I also get the same number of observations.

    I am pasting the data excerpt below of the key variables in the estimation:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str9 hhid byte(per_serialno relation_head) int age float female long total_expenditure_edu_amt float lang_discrepancy byte whether_same_grade
    "100001201" 1 1 44 0     . 0 .
    "100001201" 2 2 42 1     . 0 .
    "100001201" 3 5 20 1 29460 1 0
    "100001201" 4 5 18 1 51750 1 0
    "100001202" 1 1 43 0     . 0 .
    "100001202" 2 2 38 1     . 0 .
    "100001202" 3 5 19 0 26940 1 0
    "100001202" 4 7 65 0     . 0 .
    "100001202" 5 7 61 1     . 0 .
    "100001301" 1 1 32 0     . 0 .
    "100001301" 2 2 30 1     . 0 .
    "100001301" 3 5  5 0  6780 1 0
    "100001302" 1 1 46 0     . 0 .
    "100001302" 2 2 40 1     . 0 .
    "100001302" 3 5 16 0  3060 1 0
    "100001302" 4 5 14 1  2130 1 0
    "100001303" 1 1 46 0     . 0 .
    "100001303" 2 2 40 1     . 0 .
    "100001303" 3 5 16 0  3060 1 0
    "100001303" 4 5 14 1  2130 1 0
    "100001304" 1 1 82 1     . 0 .
    "100001304" 2 3 48 0     . 0 .
    "100001304" 3 4 45 1     . 0 .
    "100001304" 4 6 10 0  9190 1 0
    "100001304" 5 4 42 1     . 0 .
    "100001401" 1 1 58 0     . 0 .
    "100001401" 2 2 52 1     . 0 .
    "100001401" 3 5 29 0     . 0 .
    "100001402" 1 1 38 0     . 0 .
    "100001402" 2 2 34 1     . 0 .
    "100001402" 3 8 46 1     . 0 .
    "100011101" 1 1 65 0     . 0 .
    "100011101" 2 2 58 1     . 0 .
    "100011101" 3 5 25 0     . 0 .
    "100011101" 4 5 23 0     . 0 .
    "100011101" 5 5 21 0     . 0 .
    "100011101" 6 5 22 1     . 0 .
    "100011101" 7 5 19 1  7400 0 0
    "100011201" 1 1 56 0     . 0 .
    "100011201" 2 2 54 1     . 0 .
    "100011201" 3 5 23 0     . 0 .
    "100011201" 4 5 21 0 13000 1 0
    "100011201" 5 5 19 0  4470 0 0
    "100011201" 6 5 16 0  3270 0 0
    "100011201" 7 5 20 1  6370 0 0
    "100011201" 8 5 12 1   580 0 0
    "100011301" 1 1 36 0     . 0 .
    "100011301" 2 2 30 1     . 0 .
    "100011301" 3 5 12 0  3650 0 0
    "100011301" 4 5 10 0   630 0 0
    "100011301" 5 5  7 0   610 0 0
    "100011302" 1 1 45 0     . 0 .
    "100011302" 2 2 40 1     . 0 .
    "100011302" 3 5 15 0  2190 0 0
    "100011302" 4 5 13 1   460 0 0
    "100011302" 5 8 40 0     . 0 .
    "100011303" 1 1 65 0     . 0 .
    "100011303" 2 2 60 1     . 0 .
    "100011303" 3 5 27 0     . 0 .
    "100011303" 4 5 22 0     . 0 .
    "100011303" 5 5 18 0  9550 1 0
    "100011303" 6 5 12 0   660 0 0
    "100011304" 1 1 45 0     . 0 .
    "100011304" 2 2 40 1     . 0 .
    "100011304" 3 5 20 0  8970 1 0
    "100011304" 4 5 18 1  7460 0 0
    "100011304" 5 5 14 1   680 0 0
    "100011401" 1 1 45 0     . 0 .
    "100011401" 2 2 39 1     . 0 .
    "100011402" 1 1 60 0     . 0 .
    "100011402" 2 2 51 1     . 0 .
    "100011402" 3 5 21 1     . 0 .
    "100021201" 1 1 50 0     . 0 .
    "100021201" 2 2 45 1     . 0 .
    "100021201" 3 5 20 0 93000 1 0
    "100021201" 4 7 95 0     . 0 .
    "100021202" 1 1 45 0     . 0 .
    "100021202" 2 2 42 1     . 0 .
    "100021202" 3 5 19 0 74000 1 0
    "100021301" 1 1 42 0     . 0 .
    "100021301" 2 2 38 1     . 0 .
    "100021301" 3 5 12 0 32300 1 0
    "100021301" 4 5  8 0 28800 1 0
    "100021302" 1 1 39 0     . 0 .
    "100021302" 2 2 36 1     . 0 .
    "100021302" 3 5 18 0     . 0 .
    "100021302" 4 5 14 1 18700 0 0
    "100021303" 1 1 42 0     . 0 .
    "100021303" 2 2 38 1     . 0 .
    "100021303" 3 5  8 1 26500 1 0
    "100021303" 4 5  4 1 24000 1 0
    "100021303" 5 5  2 0     . 0 .
    "100021304" 1 1 40 0     . 0 .
    "100021304" 2 2 38 1     . 0 .
    "100021304" 3 5 13 1 37000 1 0
    "100021304" 4 5  9 1 33500 1 0
    "100021304" 5 5  5 0 19000 1 0
    "100021401" 1 1 60 1     . 0 .
    "100021401" 2 5 30 0     . 0 .
    "100021402" 1 1 40 0     . 0 .
    end
    label values lang_discrepancy disc
    label def disc 0 "No", modify
    label def disc 1 "Yes", modify
    label values whether_same_grade grade
    However, when I run
    Code:
    reg whether_same_grade i.lang_discrepancy i.female age [fweight=rounded_weight], cluster(hhid)
    in my dataset, I get a regression output that reports 274642225 observations. It however reports correctly that I have 83,939 clusters in hhid. I tried cutting down the number of variables and running a very simple version of this, but I am not quite sure where this number is coming from. Any ideas on what might be causing this?

    Thanks in advance!
    Anirudh

  • #2
    Because you used a "frequency weight" (fweight). If a case has a frequency weight of "10", it'd be consider as 10 cases. See this example:

    Code:
    sysuse auto, clear
    
    reg mpg price
    
    quietly summarize rep78
    display r(sum)
    reg mpg price [fweight=rep78]
    Results:

    Code:
    . reg mpg price
    
          Source |       SS           df       MS      Number of obs   =        74
    -------------+----------------------------------   F(1, 72)        =     20.26
           Model |  536.541807         1  536.541807   Prob > F        =    0.0000
        Residual |  1906.91765        72  26.4849674   R-squared       =    0.2196
    -------------+----------------------------------   Adj R-squared   =    0.2087
           Total |  2443.45946        73  33.4720474   Root MSE        =    5.1464
    
    ------------------------------------------------------------------------------
             mpg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
           price |  -.0009192   .0002042    -4.50   0.000    -.0013263   -.0005121
           _cons |   26.96417   1.393952    19.34   0.000     24.18538    29.74297
    ------------------------------------------------------------------------------
    
    .
    . quietly summarize rep78
    
    . display r(sum)
    235
    
    . reg mpg price [fweight=rep78]
    
          Source |       SS           df       MS      Number of obs   =       235
    -------------+----------------------------------   F(1, 233)       =     62.05
           Model |  1996.63469         1  1996.63469   Prob > F        =    0.0000
        Residual |  7497.09297       233  32.1763647   R-squared       =    0.2103
    -------------+----------------------------------   Adj R-squared   =    0.2069
           Total |  9493.72766       234  40.5714857   Root MSE        =    5.6724
    
    ------------------------------------------------------------------------------
             mpg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
           price |  -.0010481    .000133    -7.88   0.000    -.0013102   -.0007859
           _cons |    28.4131   .8982004    31.63   0.000     26.64347    30.18273
    ------------------------------------------------------------------------------
    The original data has 74 cases. And if we use rep78 as frequency weight, there will be 235 cases, which is the same as the total sum of the variable rep78 (235).

    In short, your output is not peculiar. But do check the survey technical documentation carefully to ensure your analysis incorporates their complex weighting method correctly.

    Good luck.

    Comment

    Working...
    X