Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regression reporting more observations than in dataset

    I am facing a peculiar issue in Stata (18.1) where the number of observations being reported in the regression output is much larger than the number of observations in the dataset, as well as the actual number of observations being used in the estimation. I am using a member-level household survey dataset for India. For example, when I use -describe-, Stata reports I have 513,366 observations and 170 variables. I opened the data viewer and this reports the same. When running basic summary statistics, I also get the same number of observations.

    I am pasting the data excerpt below of the key variables in the estimation:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str9 hhid byte(per_serialno relation_head) int age float female long total_expenditure_edu_amt float lang_discrepancy byte whether_same_grade
    "100001201" 1 1 44 0     . 0 .
    "100001201" 2 2 42 1     . 0 .
    "100001201" 3 5 20 1 29460 1 0
    "100001201" 4 5 18 1 51750 1 0
    "100001202" 1 1 43 0     . 0 .
    "100001202" 2 2 38 1     . 0 .
    "100001202" 3 5 19 0 26940 1 0
    "100001202" 4 7 65 0     . 0 .
    "100001202" 5 7 61 1     . 0 .
    "100001301" 1 1 32 0     . 0 .
    "100001301" 2 2 30 1     . 0 .
    "100001301" 3 5  5 0  6780 1 0
    "100001302" 1 1 46 0     . 0 .
    "100001302" 2 2 40 1     . 0 .
    "100001302" 3 5 16 0  3060 1 0
    "100001302" 4 5 14 1  2130 1 0
    "100001303" 1 1 46 0     . 0 .
    "100001303" 2 2 40 1     . 0 .
    "100001303" 3 5 16 0  3060 1 0
    "100001303" 4 5 14 1  2130 1 0
    "100001304" 1 1 82 1     . 0 .
    "100001304" 2 3 48 0     . 0 .
    "100001304" 3 4 45 1     . 0 .
    "100001304" 4 6 10 0  9190 1 0
    "100001304" 5 4 42 1     . 0 .
    "100001401" 1 1 58 0     . 0 .
    "100001401" 2 2 52 1     . 0 .
    "100001401" 3 5 29 0     . 0 .
    "100001402" 1 1 38 0     . 0 .
    "100001402" 2 2 34 1     . 0 .
    "100001402" 3 8 46 1     . 0 .
    "100011101" 1 1 65 0     . 0 .
    "100011101" 2 2 58 1     . 0 .
    "100011101" 3 5 25 0     . 0 .
    "100011101" 4 5 23 0     . 0 .
    "100011101" 5 5 21 0     . 0 .
    "100011101" 6 5 22 1     . 0 .
    "100011101" 7 5 19 1  7400 0 0
    "100011201" 1 1 56 0     . 0 .
    "100011201" 2 2 54 1     . 0 .
    "100011201" 3 5 23 0     . 0 .
    "100011201" 4 5 21 0 13000 1 0
    "100011201" 5 5 19 0  4470 0 0
    "100011201" 6 5 16 0  3270 0 0
    "100011201" 7 5 20 1  6370 0 0
    "100011201" 8 5 12 1   580 0 0
    "100011301" 1 1 36 0     . 0 .
    "100011301" 2 2 30 1     . 0 .
    "100011301" 3 5 12 0  3650 0 0
    "100011301" 4 5 10 0   630 0 0
    "100011301" 5 5  7 0   610 0 0
    "100011302" 1 1 45 0     . 0 .
    "100011302" 2 2 40 1     . 0 .
    "100011302" 3 5 15 0  2190 0 0
    "100011302" 4 5 13 1   460 0 0
    "100011302" 5 8 40 0     . 0 .
    "100011303" 1 1 65 0     . 0 .
    "100011303" 2 2 60 1     . 0 .
    "100011303" 3 5 27 0     . 0 .
    "100011303" 4 5 22 0     . 0 .
    "100011303" 5 5 18 0  9550 1 0
    "100011303" 6 5 12 0   660 0 0
    "100011304" 1 1 45 0     . 0 .
    "100011304" 2 2 40 1     . 0 .
    "100011304" 3 5 20 0  8970 1 0
    "100011304" 4 5 18 1  7460 0 0
    "100011304" 5 5 14 1   680 0 0
    "100011401" 1 1 45 0     . 0 .
    "100011401" 2 2 39 1     . 0 .
    "100011402" 1 1 60 0     . 0 .
    "100011402" 2 2 51 1     . 0 .
    "100011402" 3 5 21 1     . 0 .
    "100021201" 1 1 50 0     . 0 .
    "100021201" 2 2 45 1     . 0 .
    "100021201" 3 5 20 0 93000 1 0
    "100021201" 4 7 95 0     . 0 .
    "100021202" 1 1 45 0     . 0 .
    "100021202" 2 2 42 1     . 0 .
    "100021202" 3 5 19 0 74000 1 0
    "100021301" 1 1 42 0     . 0 .
    "100021301" 2 2 38 1     . 0 .
    "100021301" 3 5 12 0 32300 1 0
    "100021301" 4 5  8 0 28800 1 0
    "100021302" 1 1 39 0     . 0 .
    "100021302" 2 2 36 1     . 0 .
    "100021302" 3 5 18 0     . 0 .
    "100021302" 4 5 14 1 18700 0 0
    "100021303" 1 1 42 0     . 0 .
    "100021303" 2 2 38 1     . 0 .
    "100021303" 3 5  8 1 26500 1 0
    "100021303" 4 5  4 1 24000 1 0
    "100021303" 5 5  2 0     . 0 .
    "100021304" 1 1 40 0     . 0 .
    "100021304" 2 2 38 1     . 0 .
    "100021304" 3 5 13 1 37000 1 0
    "100021304" 4 5  9 1 33500 1 0
    "100021304" 5 5  5 0 19000 1 0
    "100021401" 1 1 60 1     . 0 .
    "100021401" 2 5 30 0     . 0 .
    "100021402" 1 1 40 0     . 0 .
    end
    label values lang_discrepancy disc
    label def disc 0 "No", modify
    label def disc 1 "Yes", modify
    label values whether_same_grade grade
    However, when I run
    Code:
    reg whether_same_grade i.lang_discrepancy i.female age [fweight=rounded_weight], cluster(hhid)
    in my dataset, I get a regression output that reports 274642225 observations. It however reports correctly that I have 83,939 clusters in hhid. I tried cutting down the number of variables and running a very simple version of this, but I am not quite sure where this number is coming from. Any ideas on what might be causing this?

    Thanks in advance!
    Anirudh

  • #2
    It comes from your use of frequency weights.

    Comment


    • #3
      Here is another guess: If rounded_weight indicates weights rounded to integers, you are probably doing something wrong. Frequency weights should be constructed as integers in the first place; if your weights are not integers, you probably do not want fweights.

      Comment


      • #4
        Thanks, Nick and Daniel for the suggestion to look at the weights. I will have a closer look at the rounded_weight variable and suspect that it is indeed causing this issue.

        Comment

        Working...
        X