Regression reporting more observations than in dataset

Anirudh Tagat

Join Date: Apr 2014
Posts: 5

Regression reporting more observations than in dataset

28 Jun 2023, 01:49

I am facing a peculiar issue in Stata (18.1) where the number of observations being reported in the regression output is much larger than the number of observations in the dataset, as well as the actual number of observations being used in the estimation. I am using a member-level household survey dataset for India. For example, when I use -describe-, Stata reports I have 513,366 observations and 170 variables. I opened the data viewer and this reports the same. When running basic summary statistics, I also get the same number of observations.

I am pasting the data excerpt below of the key variables in the estimation:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str9 hhid byte(per_serialno relation_head) int age float female long total_expenditure_edu_amt float lang_discrepancy byte whether_same_grade
"100001201" 1 1 44 0     . 0 .
"100001201" 2 2 42 1     . 0 .
"100001201" 3 5 20 1 29460 1 0
"100001201" 4 5 18 1 51750 1 0
"100001202" 1 1 43 0     . 0 .
"100001202" 2 2 38 1     . 0 .
"100001202" 3 5 19 0 26940 1 0
"100001202" 4 7 65 0     . 0 .
"100001202" 5 7 61 1     . 0 .
"100001301" 1 1 32 0     . 0 .
"100001301" 2 2 30 1     . 0 .
"100001301" 3 5  5 0  6780 1 0
"100001302" 1 1 46 0     . 0 .
"100001302" 2 2 40 1     . 0 .
"100001302" 3 5 16 0  3060 1 0
"100001302" 4 5 14 1  2130 1 0
"100001303" 1 1 46 0     . 0 .
"100001303" 2 2 40 1     . 0 .
"100001303" 3 5 16 0  3060 1 0
"100001303" 4 5 14 1  2130 1 0
"100001304" 1 1 82 1     . 0 .
"100001304" 2 3 48 0     . 0 .
"100001304" 3 4 45 1     . 0 .
"100001304" 4 6 10 0  9190 1 0
"100001304" 5 4 42 1     . 0 .
"100001401" 1 1 58 0     . 0 .
"100001401" 2 2 52 1     . 0 .
"100001401" 3 5 29 0     . 0 .
"100001402" 1 1 38 0     . 0 .
"100001402" 2 2 34 1     . 0 .
"100001402" 3 8 46 1     . 0 .
"100011101" 1 1 65 0     . 0 .
"100011101" 2 2 58 1     . 0 .
"100011101" 3 5 25 0     . 0 .
"100011101" 4 5 23 0     . 0 .
"100011101" 5 5 21 0     . 0 .
"100011101" 6 5 22 1     . 0 .
"100011101" 7 5 19 1  7400 0 0
"100011201" 1 1 56 0     . 0 .
"100011201" 2 2 54 1     . 0 .
"100011201" 3 5 23 0     . 0 .
"100011201" 4 5 21 0 13000 1 0
"100011201" 5 5 19 0  4470 0 0
"100011201" 6 5 16 0  3270 0 0
"100011201" 7 5 20 1  6370 0 0
"100011201" 8 5 12 1   580 0 0
"100011301" 1 1 36 0     . 0 .
"100011301" 2 2 30 1     . 0 .
"100011301" 3 5 12 0  3650 0 0
"100011301" 4 5 10 0   630 0 0
"100011301" 5 5  7 0   610 0 0
"100011302" 1 1 45 0     . 0 .
"100011302" 2 2 40 1     . 0 .
"100011302" 3 5 15 0  2190 0 0
"100011302" 4 5 13 1   460 0 0
"100011302" 5 8 40 0     . 0 .
"100011303" 1 1 65 0     . 0 .
"100011303" 2 2 60 1     . 0 .
"100011303" 3 5 27 0     . 0 .
"100011303" 4 5 22 0     . 0 .
"100011303" 5 5 18 0  9550 1 0
"100011303" 6 5 12 0   660 0 0
"100011304" 1 1 45 0     . 0 .
"100011304" 2 2 40 1     . 0 .
"100011304" 3 5 20 0  8970 1 0
"100011304" 4 5 18 1  7460 0 0
"100011304" 5 5 14 1   680 0 0
"100011401" 1 1 45 0     . 0 .
"100011401" 2 2 39 1     . 0 .
"100011402" 1 1 60 0     . 0 .
"100011402" 2 2 51 1     . 0 .
"100011402" 3 5 21 1     . 0 .
"100021201" 1 1 50 0     . 0 .
"100021201" 2 2 45 1     . 0 .
"100021201" 3 5 20 0 93000 1 0
"100021201" 4 7 95 0     . 0 .
"100021202" 1 1 45 0     . 0 .
"100021202" 2 2 42 1     . 0 .
"100021202" 3 5 19 0 74000 1 0
"100021301" 1 1 42 0     . 0 .
"100021301" 2 2 38 1     . 0 .
"100021301" 3 5 12 0 32300 1 0
"100021301" 4 5  8 0 28800 1 0
"100021302" 1 1 39 0     . 0 .
"100021302" 2 2 36 1     . 0 .
"100021302" 3 5 18 0     . 0 .
"100021302" 4 5 14 1 18700 0 0
"100021303" 1 1 42 0     . 0 .
"100021303" 2 2 38 1     . 0 .
"100021303" 3 5  8 1 26500 1 0
"100021303" 4 5  4 1 24000 1 0
"100021303" 5 5  2 0     . 0 .
"100021304" 1 1 40 0     . 0 .
"100021304" 2 2 38 1     . 0 .
"100021304" 3 5 13 1 37000 1 0
"100021304" 4 5  9 1 33500 1 0
"100021304" 5 5  5 0 19000 1 0
"100021401" 1 1 60 1     . 0 .
"100021401" 2 5 30 0     . 0 .
"100021402" 1 1 40 0     . 0 .
end
label values lang_discrepancy disc
label def disc 0 "No", modify
label def disc 1 "Yes", modify
label values whether_same_grade grade

However, when I run

Code:

reg whether_same_grade i.lang_discrepancy i.female age [fweight=rounded_weight], cluster(hhid)

in my dataset, I get a regression output that reports 274642225 observations. It however reports correctly that I have 83,939 clusters in hhid. I tried cutting down the number of variables and running a very simple version of this, but I am not quite sure where this number is coming from. Any ideas on what might be causing this?

Thanks in advance!
Anirudh

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35693
#2

28 Jun 2023, 02:26

It comes from your use of frequency weights.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3847
#3

28 Jun 2023, 02:36

Here is another guess: If rounded_weight indicates weights rounded to integers, you are probably doing something wrong. Frequency weights should be constructed as integers in the first place; if your weights are not integers, you probably do not want fweights.
1 like
Comment
Anirudh Tagat

Join Date: Apr 2014

Posts: 5
#4

28 Jun 2023, 03:02

Thanks, Nick and Daniel for the suggestion to look at the weights. I will have a closer look at the rounded_weight variable and suspect that it is indeed causing this issue.
Comment

Announcement

Regression reporting more observations than in dataset

Comment

Comment

Comment