Why the number of observation decrease when I increase the sample size?

Phuc Nguyen

Join Date: Mar 2017
Posts: 348

Why the number of observation decrease when I increase the sample size?

30 Aug 2021, 17:46

Hi all,

Today I face a strange situation that the number of observations shrinking when I expand the sample size

In particular, the numbers of observations for variable x1 and x2 in UNITEDS in my samples are

count if x1 != . & inlist(GEOGN, "UNITEDS")

count if x2 != . & inlist(GEOGN, "UNITEDS")

The result for these two variables are the same

Click image for larger version

Name: a.PNG
Views: 1
Size: 2.5 KB
ID: 1625538

Then, I try to run the regression of x2 on x1 for this country (UNITEDS)

Code:

. reghdfe x1 x2 if  inlist(GEOGN, "UNITEDS"), a(TYPE2 INDC32#yr)
(dropped 1013 singleton observations)
note: x2 is probably collinear with the fixed effects (all partialled-out values are close to zero; tol = 1.0e-09)
(MWFE estimator converged in 14 iterations)
note: x2 omitted because of collinearity

HDFE Linear regression                            Number of obs   =     54,409
Absorbing 2 HDFE groups                           F(   0,  47843) =          .
                                                  Prob > F        =          .
                                                  R-squared       =     0.8063
                                                  Adj R-squared   =     0.7797
                                                  Within R-sq.    =     0.0000
                                                  Root MSE        =     0.3916

------------------------------------------------------------------------------
          x1 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          x2 |          0  (omitted)
       _cons |   1.307023   .0016788   778.54   0.000     1.303733    1.310314
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
       TYPE2 |      6131           0        6131     |
   INDC32#yr |       450          15         435     |
-----------------------------------------------------+

However, when I run the regression for the bigger sample (more countries) , the number of observation decrease drastically, can I ask what is the reason behind this shrinking and what should we do in this case?

Code:

. reghdfe x1 x2 if  inlist(GEOGN, "CHINA" "UNITEDS" "INDONESIA" "RUSSIAN" "MEXICO" "JAPAN" "PHILIPPINES" "VIETNAM" "SOUTHKOREA") | inlist(GEOGN,"COLOMBIA" "CANADA" "P
> ERU" "MALAYSIA" "AUSTRALIA" "CHILE" "ECUADOR" "SINGAPORE" "NEWZEALAND"), a(TYPE2 INDC32#yr)
(dropped 194 singleton observations)
(MWFE estimator converged in 14 iterations)

HDFE Linear regression                            Number of obs   =     22,689
Absorbing 2 HDFE groups                           F(   1,  18715) =       0.07
                                                  Prob > F        =     0.7857
                                                  R-squared       =     0.7423
                                                  Adj R-squared   =     0.6876
                                                  Within R-sq.    =     0.0000
                                                  Root MSE        =     0.2734

------------------------------------------------------------------------------
          x1| Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          x2 |   .0160276    .058948     0.27   0.786    -.0995158    .1315709
       _cons |   .7591069   .0458817    16.54   0.000     .6691746    .8490393
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
       TYPE2 |      3614           0        3614     |
   INDC32#yr |       374          15         359     |
-----------------------------------------------------+

Update:
As suggested by Ken Chui, I apply another way to deal with a subsample of countries (https://www.statalist.org/forums/for...st2-in-my-code)

And it turns out that the number of observation for the expanded sample are much bigger

Code:

gen include = 0
foreach ctry in CHINA UNITEDS INDONESIA RUSSIAN MEXICO JAPAN PHILIPPINES ///
                VIETNAM SOUTHKOREA COLOMBIA CANADA PERU MALAYSIA AUSTRALIA ///
                CHILE ECUADOR SINGAPORE NEWZEALAND{
    replace include = 1 if GEOGN == "`ctry'"   
}
reghdfe x1 x2 if include == 1, a(TYPE2 INDC32#yr)

Code:

. reghdfe x1 x2 if include == 1, a(TYPE2 INDC32#yr)
(dropped 2165 singleton observations)
(MWFE estimator converged in 13 iterations)

HDFE Linear regression                            Number of obs   =    232,994
Absorbing 2 HDFE groups                           F(   1, 209389) =      88.97
                                                  Prob > F        =     0.0000
                                                  R-squared       =     0.8183
                                                  Adj R-squared   =     0.7978
                                                  Within R-sq.    =     0.0004
                                                  Root MSE        =     0.3176

------------------------------------------------------------------------------
          x1 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          x2 |   .0282004   .0029897     9.43   0.000     .0223407    .0340601
       _cons |   1.079796   .0023016   469.15   0.000     1.075285    1.084307
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
       TYPE2 |     23169           0       23169     |
   INDC32#yr |       450          15         435     |
-----------------------------------------------------+

Last edited by Phuc Nguyen; 30 Aug 2021, 18:16. Reason: adding update about approach

Tags: None

George Ford

Join Date: Aug 2014

Posts: 3148
#2

31 Aug 2021, 17:46

Sounds like singletons.

Try

Code:

reghdfe x1 x2 if include == 1, a(TYPE2 INDC32#yr) keepsingletons

Not to say that is correct, but see if it solves the N discrepancy.
Comment

Announcement

Why the number of observation decrease when I increase the sample size?

Comment