Regression reporting more observations than in dataset

Anirudh Tagat

Join Date: Apr 2014
Posts: 5

Regression reporting more observations than in dataset

28 Jun 2023, 03:00

I am facing a peculiar issue in Stata (18.1) where the number of observations being reported in the regression output is much larger than the number of observations in the dataset, as well as the actual number of observations being used in the estimation. I am using a member-level household survey dataset for India. For example, when I use -describe-, Stata reports I have 513,366 observations and 170 variables. I opened the data viewer and this reports the same. When running basic summary statistics, I also get the same number of observations.

I am pasting the data excerpt below of the key variables in the estimation:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str9 hhid byte(per_serialno relation_head) int age float female long total_expenditure_edu_amt float lang_discrepancy byte whether_same_grade
"100001201" 1 1 44 0     . 0 .
"100001201" 2 2 42 1     . 0 .
"100001201" 3 5 20 1 29460 1 0
"100001201" 4 5 18 1 51750 1 0
"100001202" 1 1 43 0     . 0 .
"100001202" 2 2 38 1     . 0 .
"100001202" 3 5 19 0 26940 1 0
"100001202" 4 7 65 0     . 0 .
"100001202" 5 7 61 1     . 0 .
"100001301" 1 1 32 0     . 0 .
"100001301" 2 2 30 1     . 0 .
"100001301" 3 5  5 0  6780 1 0
"100001302" 1 1 46 0     . 0 .
"100001302" 2 2 40 1     . 0 .
"100001302" 3 5 16 0  3060 1 0
"100001302" 4 5 14 1  2130 1 0
"100001303" 1 1 46 0     . 0 .
"100001303" 2 2 40 1     . 0 .
"100001303" 3 5 16 0  3060 1 0
"100001303" 4 5 14 1  2130 1 0
"100001304" 1 1 82 1     . 0 .
"100001304" 2 3 48 0     . 0 .
"100001304" 3 4 45 1     . 0 .
"100001304" 4 6 10 0  9190 1 0
"100001304" 5 4 42 1     . 0 .
"100001401" 1 1 58 0     . 0 .
"100001401" 2 2 52 1     . 0 .
"100001401" 3 5 29 0     . 0 .
"100001402" 1 1 38 0     . 0 .
"100001402" 2 2 34 1     . 0 .
"100001402" 3 8 46 1     . 0 .
"100011101" 1 1 65 0     . 0 .
"100011101" 2 2 58 1     . 0 .
"100011101" 3 5 25 0     . 0 .
"100011101" 4 5 23 0     . 0 .
"100011101" 5 5 21 0     . 0 .
"100011101" 6 5 22 1     . 0 .
"100011101" 7 5 19 1  7400 0 0
"100011201" 1 1 56 0     . 0 .
"100011201" 2 2 54 1     . 0 .
"100011201" 3 5 23 0     . 0 .
"100011201" 4 5 21 0 13000 1 0
"100011201" 5 5 19 0  4470 0 0
"100011201" 6 5 16 0  3270 0 0
"100011201" 7 5 20 1  6370 0 0
"100011201" 8 5 12 1   580 0 0
"100011301" 1 1 36 0     . 0 .
"100011301" 2 2 30 1     . 0 .
"100011301" 3 5 12 0  3650 0 0
"100011301" 4 5 10 0   630 0 0
"100011301" 5 5  7 0   610 0 0
"100011302" 1 1 45 0     . 0 .
"100011302" 2 2 40 1     . 0 .
"100011302" 3 5 15 0  2190 0 0
"100011302" 4 5 13 1   460 0 0
"100011302" 5 8 40 0     . 0 .
"100011303" 1 1 65 0     . 0 .
"100011303" 2 2 60 1     . 0 .
"100011303" 3 5 27 0     . 0 .
"100011303" 4 5 22 0     . 0 .
"100011303" 5 5 18 0  9550 1 0
"100011303" 6 5 12 0   660 0 0
"100011304" 1 1 45 0     . 0 .
"100011304" 2 2 40 1     . 0 .
"100011304" 3 5 20 0  8970 1 0
"100011304" 4 5 18 1  7460 0 0
"100011304" 5 5 14 1   680 0 0
"100011401" 1 1 45 0     . 0 .
"100011401" 2 2 39 1     . 0 .
"100011402" 1 1 60 0     . 0 .
"100011402" 2 2 51 1     . 0 .
"100011402" 3 5 21 1     . 0 .
"100021201" 1 1 50 0     . 0 .
"100021201" 2 2 45 1     . 0 .
"100021201" 3 5 20 0 93000 1 0
"100021201" 4 7 95 0     . 0 .
"100021202" 1 1 45 0     . 0 .
"100021202" 2 2 42 1     . 0 .
"100021202" 3 5 19 0 74000 1 0
"100021301" 1 1 42 0     . 0 .
"100021301" 2 2 38 1     . 0 .
"100021301" 3 5 12 0 32300 1 0
"100021301" 4 5  8 0 28800 1 0
"100021302" 1 1 39 0     . 0 .
"100021302" 2 2 36 1     . 0 .
"100021302" 3 5 18 0     . 0 .
"100021302" 4 5 14 1 18700 0 0
"100021303" 1 1 42 0     . 0 .
"100021303" 2 2 38 1     . 0 .
"100021303" 3 5  8 1 26500 1 0
"100021303" 4 5  4 1 24000 1 0
"100021303" 5 5  2 0     . 0 .
"100021304" 1 1 40 0     . 0 .
"100021304" 2 2 38 1     . 0 .
"100021304" 3 5 13 1 37000 1 0
"100021304" 4 5  9 1 33500 1 0
"100021304" 5 5  5 0 19000 1 0
"100021401" 1 1 60 1     . 0 .
"100021401" 2 5 30 0     . 0 .
"100021402" 1 1 40 0     . 0 .
end
label values lang_discrepancy disc
label def disc 0 "No", modify
label def disc 1 "Yes", modify
label values whether_same_grade grade

However, when I run

Code:

reg whether_same_grade i.lang_discrepancy i.female age [fweight=rounded_weight], cluster(hhid)

in my dataset, I get a regression output that reports 274642225 observations. It however reports correctly that I have 83,939 clusters in hhid. I tried cutting down the number of variables and running a very simple version of this, but I am not quite sure where this number is coming from. Any ideas on what might be causing this?

Thanks in advance!
Anirudh

Tags: None

Ken Chui

Join Date: Aug 2014
Posts: 1060

28 Jun 2023, 07:37

Because you used a "frequency weight" (fweight). If a case has a frequency weight of "10", it'd be consider as 10 cases. See this example:

Code:

sysuse auto, clear

reg mpg price

quietly summarize rep78
display r(sum)
reg mpg price [fweight=rep78]

Results:

Code:

. reg mpg price

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =     20.26
       Model |  536.541807         1  536.541807   Prob > F        =    0.0000
    Residual |  1906.91765        72  26.4849674   R-squared       =    0.2196
-------------+----------------------------------   Adj R-squared   =    0.2087
       Total |  2443.45946        73  33.4720474   Root MSE        =    5.1464

------------------------------------------------------------------------------
         mpg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       price |  -.0009192   .0002042    -4.50   0.000    -.0013263   -.0005121
       _cons |   26.96417   1.393952    19.34   0.000     24.18538    29.74297
------------------------------------------------------------------------------

.
. quietly summarize rep78

. display r(sum)
235

. reg mpg price [fweight=rep78]

      Source |       SS           df       MS      Number of obs   =       235
-------------+----------------------------------   F(1, 233)       =     62.05
       Model |  1996.63469         1  1996.63469   Prob > F        =    0.0000
    Residual |  7497.09297       233  32.1763647   R-squared       =    0.2103
-------------+----------------------------------   Adj R-squared   =    0.2069
       Total |  9493.72766       234  40.5714857   Root MSE        =    5.6724

------------------------------------------------------------------------------
         mpg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       price |  -.0010481    .000133    -7.88   0.000    -.0013102   -.0007859
       _cons |    28.4131   .8982004    31.63   0.000     26.64347    30.18273
------------------------------------------------------------------------------

The original data has 74 cases. And if we use rep78 as frequency weight, there will be 235 cases, which is the same as the total sum of the variable rep78 (235).

In short, your output is not peculiar. But do check the survey technical documentation carefully to ensure your analysis incorporates their complex weighting method correctly.

Good luck.

Announcement

Regression reporting more observations than in dataset

Comment