keeping the same number of observations over time

Karen Arulsamy

Join Date: Feb 2021
Posts: 71

keeping the same number of observations over time

18 Nov 2022, 02:47

Hi all,

I am running a diff-in-diff based on individuals who were present in six waves of the data looking seven outcome variables. However, there are differences in the number of observations as not all individuals may respond to a question in every wave. The diff-in-diff is run separately by post time periods (July 2020, Sep 2020 and Jan 2021 compared against a baseline period of 2017-2019) which makes it harder to save the estimation sample and then re-run the other estimations based on a particular estimation sample. Does anyone have any ideas whether it's possible to ensure the same number of observations in a set up like this? In the code below, my goal is to have a constant number of observations between (1) and (2) where I only keep observations for individuals who responded to this particular question for both July 2020 and Jan 2021.

keep if inlist(wave,9,15,16,17,18,19)
bys pidp: keep if _N == 6

global controls_dx agecat_sec educ_ref race_main jbstat_ref cathhincome_ref mhvalue_ref_casenessc marstat_ref nchild015_ref hhsize_ref ///
gor_main imonth_final

(1) diff dfruit_simp, t(treatfem) p(July2020) cov($controls) cluster(pidp)

(2) diff dfruit_simp, t(treatfem) p(Jan2021) cov($controls) cluster(pidp)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long pidp byte wave float(dfruit_simp treatfem July2020 Jan2021 agecat_sec educ_ref race_main jbstat_ref cathhincome_ref mhvalue_ref_casenessc marstat_ref nchild015_ref hhsize_ref gor_main imonth_final)
   76165  9 1 1 0 0 0 . 1 1 3 0 1 1 3  5  4
   76165 15 1 1 1 . 1 . 1 1 3 0 1 1 3  5  7
   76165 16 . 1 . . 1 . 1 1 3 0 1 1 3  5  9
   76165 17 . 1 . . 1 . 1 1 3 0 1 1 3  5 11
   76165 18 1 1 . 1 1 . 1 1 3 0 1 1 3  5  1
   76165 19 . 1 . . 1 . 1 1 3 0 1 1 3  5  3
 1587125  9 1 1 0 0 2 . 0 1 2 0 0 0 2  1  9
 1587125 15 0 1 1 . 2 . 0 1 2 0 0 0 2  1  7
 1587125 16 . 1 . . 2 . 0 1 2 0 0 0 2  1  9
 1587125 17 . 1 . . 2 . 0 1 2 0 0 0 2  1 11
 1587125 18 1 1 . 1 2 . 0 1 2 0 0 0 2  1  1
 1587125 19 . 1 . . 2 . 0 1 2 0 0 0 2  1  3
 4849085  9 0 0 0 0 0 1 1 1 3 1 1 0 3 11  4
 4849085 15 1 0 1 . 1 1 1 1 3 1 1 0 3 11  7
 4849085 16 . 0 . . 1 1 1 1 3 1 1 0 3 11  9
 4849085 17 . 0 . . 1 1 1 1 . 1 1 0 3 11 11
 4849085 18 1 0 . 1 1 1 1 1 3 1 1 0 3 11  1
 4849085 19 . 0 . . 1 1 1 1 3 1 1 0 3 11  3
68002725  9 0 1 0 0 2 . 0 4 1 0 0 0 1  7  3
68002725 15 0 1 1 . 3 . 0 4 1 0 0 0 1  7  7
68002725 16 . 1 . . 3 . 0 4 1 0 0 0 1  7  9
68002725 17 . 1 . . 3 . 0 4 1 0 0 0 1  7 11
68002725 18 0 1 . 1 3 . 0 4 1 0 0 0 1  7  1
68002725 19 . 1 . . 3 . 0 4 1 0 0 0 1  7  3
68008847  9 1 1 0 0 2 0 1 1 1 0 0 0 1  1  3
68008847 15 1 1 1 . 2 0 1 1 . 0 0 0 1  1  7
68008847 16 . 1 . . 2 0 1 1 1 0 0 0 1  1  9
68008847 17 . 1 . . 2 0 1 1 1 0 0 0 1  1 11
68008847 18 0 1 . 1 2 0 1 1 . 0 0 0 1  1  1
68008847 19 . 1 . . 2 0 1 1 1 . 0 0 1  1  3
68010887  9 1 1 0 0 2 1 1 1 2 0 1 0 2  1  3
68010887 15 1 1 1 . 2 1 1 1 2 0 1 0 2  1  7
68010887 16 . 1 . . 2 1 1 1 2 0 1 0 2  1  9
68010887 17 . 1 . . 2 1 1 1 2 0 1 0 2  1 11
68010887 18 0 1 . 1 2 1 1 1 . 0 1 0 2  1  1
end

Many thanks
Karen

Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10188

18 Nov 2022, 08:33

You have to find the intersection of the estimation samples. I am not familiar with the command diff, but in case it tags the observations that are part of the estimation sample, the following may be what you want:

Code:

quietly diff dfruit_simp, t(treatfem) p(July2020) cov($controls) cluster(pidp)
g sample1= e(sample)
diff dfruit_simp, t(treatfem) p(Jan2021) cov($controls) cluster(pidp)
g sample = sample & e(sample)
drop sample1
diff dfruit_simp if sample, t(treatfem) p(July2020) cov($controls) cluster(pidp)
diff dfruit_simp if sample, t(treatfem) p(Jan2021) cov($controls) cluster(pidp)

Comment

Karen Arulsamy

Join Date: Feb 2021

Posts: 71
#3

21 Nov 2022, 02:12

Thank you for your response Andrew. When I do this, the diff-in-diff isn't estimated properly. An alternative is to set up the diff-in-diff manually as below where I've also included individual fixed effects but in this case everything is omitted from the model in the results.

xtset pidp wave

(1) xtreg dfruit_simp i.treatfem##i.July2020 $controls if sample, fe vce(cluster pidp)

(2) xtreg dfruit_simp treatfem##Jan2021 $controls, fe vce(cluster pidp)

There are 10,699 observations for (2) and 10,963 observations for (1). My goal is to have 10,699 observations for both (1) & (2) when I estimate these regressions separately.

Many thanks
Karen
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10188

21 Nov 2022, 08:28

Yes, I missed a step to distribute the estimation sample over identifiers. Try this:

Code:

quietly diff dfruit_simp, t(treatfem) p(July2020) cov($controls) cluster(pidp)
g sample1= e(sample)
diff dfruit_simp, t(treatfem) p(Jan2021) cov($controls) cluster(pidp)
g sample2= e(sample)
bys pidp: egen sample = max(sample1 & sample2)
drop sample?
diff dfruit_simp if sample, t(treatfem) p(July2020) cov($controls) cluster(pidp)
diff dfruit_simp if sample, t(treatfem) p(Jan2021) cov($controls) cluster(pidp)

Comment

Karen Arulsamy

Join Date: Feb 2021

Posts: 71
#5

21 Nov 2022, 08:45

Thank you Andrew, I've just tried that now but it still doesn't work. It still runs the same estimation sample available for July 2020 and January 2021 respectively.
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1379
#6

21 Nov 2022, 10:58

Karen Arulsamy

could you try this, instead of the line highlighted in red in #4?

Code:

gen byte sample = min(sample1, sample2)

Last edited by Hemanshu Kumar; 21 Nov 2022, 11:00.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10188
#7

21 Nov 2022, 13:45

Originally posted by Karen Arulsamy View Post

Thank you Andrew, I've just tried that now but it still doesn't work. It still runs the same estimation sample available for July 2020 and January 2021 respectively.

Seems that I cannot get my head around it without a data example. Perhaps Hemanshu Kumar's suggestion is what you need.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10188

21 Nov 2022, 15:21

I think I have figured out the issue with #4. We need to distribute the in-sample observations independently, then find the intersection. Try this one more time:

Code:

quietly diff dfruit_simp, t(treatfem) p(July2020) cov($controls) cluster(pidp)
bys pidp: egen sample1= max(e(sample))
diff dfruit_simp, t(treatfem) p(Jan2021) cov($controls) cluster(pidp)
bys pidp: egen sample2= max(e(sample))
diff dfruit_simp if sample1&sample2, t(treatfem) p(July2020) cov($controls) cluster(pidp)
diff dfruit_simp if sample1 & sample2, t(treatfem) p(Jan2021) cov($controls) cluster(pidp)

Comment

Karen Arulsamy

Join Date: Feb 2021

Posts: 71
#9

07 Dec 2022, 05:42

No go unfortunately Andrew Musau and Hemanshu Kumar on this. Andrew Musau I've shared the data in the first post but do let me know if you need more data.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10188

#10

07 Dec 2022, 14:48

For the data in #1, the observation count is 12 in each case. Note that diff is from SSC (FAQ Advice #12). Enclose a reproducible example that exhibits different observation counts.

Code:

. diff dfruit_simp, t(treatfem) p(July2020) cov($controls) cluster(pidp)

DIFFERENCE-IN-DIFFERENCES ESTIMATION RESULTS
Number of observations in the DIFF-IN-DIFF: 12
            Before         After    
   Control: 1              1           2
   Treated: 5              5           10
            6              6
--------------------------------------------------------
 Outcome var.   | dfrui~p | S. Err. |   |t|   |  P>|t|
----------------+---------+---------+---------+---------
Before          |         |         |         |
   Control      | -0.000  |         |         |
   Treated      | 0.800   |         |         |
   Diff (T-C)   | 0.800   | 0.230   | 3.48    | 0.018**
After           |         |         |         |
   Control      | 1.000   |         |         |
   Treated      | 0.600   |         |         |
   Diff (T-C)   | -0.400  | 0.281   | 1.42    | 0.214
                |         |         |         |
Diff-in-Diff    | -1.200  | 0.230   | 5.22    | 0.003***
--------------------------------------------------------
R-square:    0.25
* Means and Standard Errors are estimated by linear regression
**Clustered Std. Errors
**Inference: *** p<0.01; ** p<0.05; * p<0.1

. diff dfruit_simp, t(treatfem) p(Jan2021) cov($controls) cluster(pidp)

DIFFERENCE-IN-DIFFERENCES ESTIMATION RESULTS
Number of observations in the DIFF-IN-DIFF: 12
            Before         After    
   Control: 1              1           2
   Treated: 5              5           10
            6              6
--------------------------------------------------------
 Outcome var.   | dfrui~p | S. Err. |   |t|   |  P>|t|
----------------+---------+---------+---------+---------
Before          |         |         |         |
   Control      | -0.000  |         |         |
   Treated      | 0.800   |         |         |
   Diff (T-C)   | 0.800   | 0.230   | 3.48    | 0.018**
After           |         |         |         |
   Control      | 1.000   |         |         |
   Treated      | 0.400   |         |         |
   Diff (T-C)   | -0.600  | 0.281   | 2.13    | 0.086*
                |         |         |         |
Diff-in-Diff    | -1.400  | 0.281   | 4.97    | 0.004***
--------------------------------------------------------
R-square:    0.31
* Means and Standard Errors are estimated by linear regression
**Clustered Std. Errors
**Inference: *** p<0.01; ** p<0.05; * p<0.1

.

Comment

Karen Arulsamy

Join Date: Feb 2021

Posts: 71
#11

09 Dec 2022, 07:47

Hi Andrew,

I would like to share the relevant dataset with you so you have all the observations but this is not practical with dataex and I can't seem to attach the dta or excel file here. I'm sharing a dropbox link that you should be able to access: https://www.dropbox.com/s/2gkx5g8eyr..._Dec9.dta?dl=0 . Please let me know if you are unable to download the file.

We can forget about the 'diff' command for now and use xtreg with the DID interaction term.

global controls_simp i.agecat_sec i.educ_ref i.race_main i.jbstat_ref i.cathhincome_ref i.mhvalue_ref_cutoffW9 i.marstat_ref i.nchild015_ref c.hhsize_ref ///
i.gor_main i.imonth_final

xtreg dfruit_simp i.treatfem##i.July2020 $controls_simp, fe vce(cluster pidp)

xtreg dfruit_simp treatfem##Jan2021 $controls_simp, fe vce(cluster pidp)

Many thanks
Karen
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10188

#12

09 Dec 2022, 10:12

I think your confusion comes at the level you are counting at. You state that you want to run a regression which includes the same individuals across both regressions. In other words, the $N$ should be the same. This value is given by the number of groups in xtreg. The number of observations ($NT$ in panel data) are not the same due to the fact that xtreg does not drop singletons. I show here how to do that manually: https://www.statalist.org/forums/for...n-observations. However, much easier if you use reghdfe from SSC.

Code:

use "C:\Users\amus\Downloads\statalist_Dec9.dta"
reghdfe dfruit_simp i.treatfem##i.July2020 $controls_simp, absorb(pidp) vce(cluster pidp)
bys pidp: egen insample1= max(e(sample))
reghdfe dfruit_simp i.treatfem##i.Jan2021 $controls_simp, absorb(pidp) vce(cluster pidp)
bys pidp: egen insample2= max(e(sample))
reghdfe dfruit_simp i.treatfem##i.July2020 $controls_simp if insample1 & insample2, absorb(pidp) vce(cluster pidp)
reghdfe dfruit_simp i.treatfem##i.Jan2021 $controls_simp if insample1 & insample2, absorb(pidp) vce(cluster pidp)

Res.:

Code:

. reghdfe dfruit_simp i.treatfem##i.July2020 $controls_simp, absorb(pidp) vce(cluster pidp)
(dropped 73 singleton observations)
(MWFE estimator converged in 1 iterations)
warning: missing F statistic; dropped variables due to collinearity or too few clusters

HDFE Linear regression                            Number of obs   =     17,416
Absorbing 1 HDFE group                            F(   3,   8707) =          .
Statistics robust to heteroskedasticity           Prob > F        =          .
                                                  R-squared       =     0.7460
                                                  Adj R-squared   =     0.4919
                                                  Within R-sq.    =     0.0031
Number of clusters (pidp)    =      8,708         Root MSE        =     0.3263

                                    (Std. Err. adjusted for 8,708 clusters in pidp)
-----------------------------------------------------------------------------------
                  |               Robust
      dfruit_simp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
       1.treatfem |   1.029406   .0063857   161.21   0.000     1.016889    1.041923
       1.July2020 |   .0107143   .0078034     1.37   0.170    -.0045823    .0260108
                  |
treatfem#July2020 |
             1 1  |  -.0401202   .0100832    -3.98   0.000    -.0598856   -.0203549
                  |
            _cons |   .1085209   .0024722    43.90   0.000     .1036749    .1133669
-----------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
        pidp |      8708        8708           0    *|
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

.
. bys pidp: egen insample1= max(e(sample))

.
. reghdfe dfruit_simp i.treatfem##i.Jan2021 $controls_simp, absorb(pidp) vce(cluster pidp)
(dropped 160 singleton observations)
(MWFE estimator converged in 1 iterations)

HDFE Linear regression                            Number of obs   =     17,242
Absorbing 1 HDFE group                            F(   3,   8620) =      17.70
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.7244
                                                  Adj R-squared   =     0.4487
                                                  Within R-sq.    =     0.0060
Number of clusters (pidp)    =      8,621         Root MSE        =     0.3417

                                   (Std. Err. adjusted for 8,621 clusters in pidp)
----------------------------------------------------------------------------------
                 |               Robust
     dfruit_simp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
      1.treatfem |   .5478851   .3536691     1.55   0.121     -.145391    1.241161
       1.Jan2021 |   .0072082   .0081683     0.88   0.378    -.0088035      .02322
                 |
treatfem#Jan2021 |
            1 1  |  -.0550933   .0105969    -5.20   0.000    -.0758658   -.0343207
                 |
           _cons |    .389282    .205633     1.89   0.058    -.0138079    .7923719
----------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
        pidp |      8621        8621           0    *|
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

.
. bys pidp: egen insample2= max(e(sample))

.
. reghdfe dfruit_simp i.treatfem##i.July2020 $controls_simp if insample1 & insample2, absorb(pidp) vce(cluster pidp)
(MWFE estimator converged in 1 iterations)
warning: missing F statistic; dropped variables due to collinearity or too few clusters

HDFE Linear regression                            Number of obs   =     17,188
Absorbing 1 HDFE group                            F(   3,   8593) =          .
Statistics robust to heteroskedasticity           Prob > F        =          .
                                                  R-squared       =     0.7460
                                                  Adj R-squared   =     0.4918
                                                  Within R-sq.    =     0.0030
Number of clusters (pidp)    =      8,594         Root MSE        =     0.3261

                                    (Std. Err. adjusted for 8,594 clusters in pidp)
-----------------------------------------------------------------------------------
                  |               Robust
      dfruit_simp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
       1.treatfem |   1.029023   .0064335   159.95   0.000     1.016412    1.041634
       1.July2020 |   .0102863   .0078379     1.31   0.189    -.0050779    .0256506
                  |
treatfem#July2020 |
             1 1  |  -.0393096   .0101402    -3.88   0.000    -.0591868   -.0194324
                  |
            _cons |   .1098441   .0024875    44.16   0.000     .1049681    .1147201
-----------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
        pidp |      8594        8594           0    *|
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

.
. reghdfe dfruit_simp i.treatfem##i.Jan2021 $controls_simp if insample1 & insample2, absorb(pidp) vce(cluster pidp)
(MWFE estimator converged in 1 iterations)

HDFE Linear regression                            Number of obs   =     17,188
Absorbing 1 HDFE group                            F(   3,   8593) =      17.31
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.7246
                                                  Adj R-squared   =     0.4491
                                                  Within R-sq.    =     0.0059
Number of clusters (pidp)    =      8,594         Root MSE        =     0.3415

                                   (Std. Err. adjusted for 8,594 clusters in pidp)
----------------------------------------------------------------------------------
                 |               Robust
     dfruit_simp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
      1.treatfem |    .547438   .3536694     1.55   0.122     -.145839    1.240715
       1.Jan2021 |   .0066741   .0081745     0.82   0.414    -.0093499     .022698
                 |
treatfem#Jan2021 |
            1 1  |   -.054112   .0106065    -5.10   0.000    -.0749034   -.0333207
                 |
           _cons |   .3898068   .2056208     1.90   0.058    -.0132593     .792873
----------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
        pidp |      8594        8594           0    *|
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

Comment

Karen Arulsamy

Join Date: Feb 2021

Posts: 71
#13

15 Dec 2022, 02:59

Thanks so for your help Andrew Musau - this is exactly what I needed. Appreciate it.
Comment

Announcement

keeping the same number of observations over time

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment