Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why the number of observation decrease when I increase the sample size?

    Hi all,

    Today I face a strange situation that the number of observations shrinking when I expand the sample size

    In particular, the numbers of observations for variable x1 and x2 in UNITEDS in my samples are

    count if x1 != . & inlist(GEOGN, "UNITEDS")

    count if x2 != . & inlist(GEOGN, "UNITEDS")

    The result for these two variables are the same
    Click image for larger version

Name:	a.PNG
Views:	1
Size:	2.5 KB
ID:	1625538



    Then, I try to run the regression of x2 on x1 for this country (UNITEDS)

    Code:
    . reghdfe x1 x2 if  inlist(GEOGN, "UNITEDS"), a(TYPE2 INDC32#yr)
    (dropped 1013 singleton observations)
    note: x2 is probably collinear with the fixed effects (all partialled-out values are close to zero; tol = 1.0e-09)
    (MWFE estimator converged in 14 iterations)
    note: x2 omitted because of collinearity
    
    HDFE Linear regression                            Number of obs   =     54,409
    Absorbing 2 HDFE groups                           F(   0,  47843) =          .
                                                      Prob > F        =          .
                                                      R-squared       =     0.8063
                                                      Adj R-squared   =     0.7797
                                                      Within R-sq.    =     0.0000
                                                      Root MSE        =     0.3916
    
    ------------------------------------------------------------------------------
              x1 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
              x2 |          0  (omitted)
           _cons |   1.307023   .0016788   778.54   0.000     1.303733    1.310314
    ------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    -----------------------------------------------------+
     Absorbed FE | Categories  - Redundant  = Num. Coefs |
    -------------+---------------------------------------|
           TYPE2 |      6131           0        6131     |
       INDC32#yr |       450          15         435     |
    -----------------------------------------------------+
    However, when I run the regression for the bigger sample (more countries) , the number of observation decrease drastically, can I ask what is the reason behind this shrinking and what should we do in this case?

    Code:
    . reghdfe x1 x2 if  inlist(GEOGN, "CHINA" "UNITEDS" "INDONESIA" "RUSSIAN" "MEXICO" "JAPAN" "PHILIPPINES" "VIETNAM" "SOUTHKOREA") | inlist(GEOGN,"COLOMBIA" "CANADA" "P
    > ERU" "MALAYSIA" "AUSTRALIA" "CHILE" "ECUADOR" "SINGAPORE" "NEWZEALAND"), a(TYPE2 INDC32#yr)
    (dropped 194 singleton observations)
    (MWFE estimator converged in 14 iterations)
    
    HDFE Linear regression                            Number of obs   =     22,689
    Absorbing 2 HDFE groups                           F(   1,  18715) =       0.07
                                                      Prob > F        =     0.7857
                                                      R-squared       =     0.7423
                                                      Adj R-squared   =     0.6876
                                                      Within R-sq.    =     0.0000
                                                      Root MSE        =     0.2734
    
    ------------------------------------------------------------------------------
              x1| Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
              x2 |   .0160276    .058948     0.27   0.786    -.0995158    .1315709
           _cons |   .7591069   .0458817    16.54   0.000     .6691746    .8490393
    ------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    -----------------------------------------------------+
     Absorbed FE | Categories  - Redundant  = Num. Coefs |
    -------------+---------------------------------------|
           TYPE2 |      3614           0        3614     |
       INDC32#yr |       374          15         359     |
    -----------------------------------------------------+
    Update:
    As suggested by Ken Chui, I apply another way to deal with a subsample of countries (https://www.statalist.org/forums/for...st2-in-my-code)

    And it turns out that the number of observation for the expanded sample are much bigger

    Code:
    gen include = 0
    foreach ctry in CHINA UNITEDS INDONESIA RUSSIAN MEXICO JAPAN PHILIPPINES ///
                    VIETNAM SOUTHKOREA COLOMBIA CANADA PERU MALAYSIA AUSTRALIA ///
                    CHILE ECUADOR SINGAPORE NEWZEALAND{
        replace include = 1 if GEOGN == "`ctry'"   
    }
    reghdfe x1 x2 if include == 1, a(TYPE2 INDC32#yr)
    Code:
    . reghdfe x1 x2 if include == 1, a(TYPE2 INDC32#yr)
    (dropped 2165 singleton observations)
    (MWFE estimator converged in 13 iterations)
    
    HDFE Linear regression                            Number of obs   =    232,994
    Absorbing 2 HDFE groups                           F(   1, 209389) =      88.97
                                                      Prob > F        =     0.0000
                                                      R-squared       =     0.8183
                                                      Adj R-squared   =     0.7978
                                                      Within R-sq.    =     0.0004
                                                      Root MSE        =     0.3176
    
    ------------------------------------------------------------------------------
              x1 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
              x2 |   .0282004   .0029897     9.43   0.000     .0223407    .0340601
           _cons |   1.079796   .0023016   469.15   0.000     1.075285    1.084307
    ------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    -----------------------------------------------------+
     Absorbed FE | Categories  - Redundant  = Num. Coefs |
    -------------+---------------------------------------|
           TYPE2 |     23169           0       23169     |
       INDC32#yr |       450          15         435     |
    -----------------------------------------------------+
    Last edited by Phuc Nguyen; 30 Aug 2021, 18:16. Reason: adding update about approach

  • #2
    Sounds like singletons.

    Try
    Code:
    reghdfe x1 x2 if include == 1, a(TYPE2 INDC32#yr) keepsingletons
    Not to say that is correct, but see if it solves the N discrepancy.

    Comment

    Working...
    X