(1) areg, xtreg, reghdfe or reg and (2) time-trend

Pietro Guglielmi

Join Date: Apr 2021
Posts: 13

(1) areg, xtreg, reghdfe or reg and (2) time-trend

03 May 2021, 05:13

Good morning, I am replicating the identification strategy of Lavy, Victor, and Analia Schlosser. 2011. “Mechanisms and Impacts of Gender Peer Effects at School.” American Economic Journal: Applied Economics 3 (2): 1–33.
I have a dataset on test scores taken in grade 8 by 10 cohorts (2010-2019) of around 500,000 students each, from around 5,000 schools. Therefore I have only one observation for each individual, but I observe the whole cohort of grade 8 of each school every year. So the dataset is a repeated cross-section of the population of grade 8 students in all Italian schools. The dataset looks something like this:

student_id	school_id	test_score	year	proportion_females
1	1	100	2010	0.49
2	1	103	2010	0.49
1001	2	98	2010	0.52
1002	2	100	2010	0.52
...
500,001	1	102	2011	0.50
500,002	1	101	2011	0.50
501,001	2	97	2011	0.51
501,002	2	99	2011	0.51
...
1,000,001	1	98	2012	0.48
1,000,002	1	100	2012	0.48
1,001,001	2	101	2012	0.49
1,001,002	2	97	2012	0.49

I run the following regression to exploit the supposedly exogenous variation of the proportion of female peers in a school in each year compared to the school-specific time trend to capture the effect on test scores of having more female peers. My regression needs to include (1) school FE, (2) year FE, and (3) a school-specific time trend

I thought of running the following regression:

Code:

reghdfe test_score proportion_females, absorb (school_id year)

But I saw that some papers using this strategy used something like

Code:

xi: reg test_score proportion_females i.school_id i.year

and some others used (1) just reg or (2) areg

My questions are:
(0) Any general comments on this approach?
(1) Which regression command is suitable for this strategy? I know xtreg and reghdfe are used for panel regressions, and I am wondering whether my dataset can be considered a "school-panel" and therefore those commands would be ok.
(2) how to include school-specific time-trend? By adding c.year#school_id?

Thanks in advance,
Pietro

Tags: None

Andrew Musau

Join Date: Oct 2014

Posts: 10214
#2

04 May 2021, 03:17

reghdfe is from SSC (FAQ Advice #12).

(0) Any general comments on this approach?
(1) Which regression command is suitable for this strategy? I know xtreg and reghdfe are used for panel regressions, and I am wondering whether my dataset can be considered a "school-panel" and therefore those commands would be ok.

You must have been looking through some old code as you do not need the -xi- prefix with factor variables. The following are equivalent:

Code:

reg test_score proportion_females i.school_id i.year reghdfe test_score proportion_females, absorb(school_id year)

regress inverts a matrix, so too many indicators do not help. For speed and efficiency reasons, you should prefer reghdfe over regress for multi-fixed effects models. You do not necessarily need panel data (repeated units over time) to use xtreg or reghdfe. As long as you have some level of nesting (e.g., students within schools, counties within states, etc.), you can use these estimators,

(2) how to include school-specific time-trend? By adding c.year#school_id?

Both should be indicators:

Code:

i.school#i.year
Comment
Pietro Guglielmi

Join Date: Apr 2021

Posts: 13
#3

04 May 2021, 04:46

Andrew Musau Thank you very much for the reply. You perfectly clarified my first two questions, again. I appreciate it.

On the second question: my explanation of what I want the "school-specific time-trend" to control for was unclear, sorry about that. I want the school-specific time-trend to account for the trend in each school of the proportion of females over time, and not for the trend in terms of test scores. For example school n might experience in my sample a trend of getting a bigger proportion of female pupils over time. I want the school-specific time trend to control for it, so that my key explanatory variable (proportion_females) would capture the effect on my dependent variable of the yearly exogenous deviation of the proportion of females in a school from the school’s trend.

Do you have suggestions on how I can do that? Thanks again.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10214
#4

04 May 2021, 08:18

I want the school-specific time-trend to account for the trend in each school of the proportion of females over time, and not for the trend in terms of test scores.

I do not get what you are asking here. The proportion of females is your independent variable of interest, and it has to vary over school and time for you to be able to identify its effect. Do you have a reference from the Lavy paper that is similar to what you want to do? I have the paper, so point to me the table and column number.

Last edited by Andrew Musau; 04 May 2021, 08:23.
Comment
Pietro Guglielmi

Join Date: Apr 2021

Posts: 13
#5

05 May 2021, 03:05

Andrew Musau thank you for your time.

Page 5-6: https://www.jstor.org/stable/pdf/41288627.pdf

Equations 1 and 2 describe the main identification strategy

where i denotes individuals, g denotes grades, s denotes schools, and t denotes time. y_tgst is an achievement measure for a male/female student i in grade g, school s, and year t. α_g is a grade effect. β_S is a school effect. γ_t is a time effect. x_tgst is a vector of student's covariates that includes mother's and father's years of schooling, number of siblings, immigration status, and ethnic origin, and indicators for missing values in these covariates. S_gstis a vector of characteristics of a grade g in school s and time t, and includes a quadratic function of enrollment and a set of variables for the average characteristics of the students in the grade. P_gst is the proportion of female students in grade g (which we refer to as the proportion female from here on), school s, and year t and ε_gst is the error term, which is composed of a school-specific random element that allows for any type of correlation within observations of the same school across time and an individual random element. The coefficient of interest is π, which captures the effects of having more female peers on student achievement.

For the estimates in equation (1) to have a causal interpretation, the unobserved determinant of achievement must be uncorrelated with the treatment variable. Including school fixed effects controls for the most obvious potential confounding factor - the endogenous sorting of students across schools. However, one may be concerned that there are time-varying unobserved factors that are also correlated with changes in the proportion of female students. To address this concern, we add to equation (1) a full set of school-specific linear time trends δ. In this case, identification is achieved from the deviation in the proportion of female students from its school long-term trend, and is estimated by the following equation

I am therefore trying to understand how to include in Stata δ, the school-specific linear time trend in the proportion of female students. The school-specific linear time trends are included in the results shown in Table 3. I figured that these school-specific linear time-trends need to control for the trend in the proportion of females each school is experiencing, to control for schools that are getting, over time, more (or less) female pupils, and therefore avoid that variation (which is not exogenous) to be captured by the key independent variable (P). Hence I thought
i.school#i.year would not be the right way to introduce such school-specific time trend as it would control for a trend in scores (y) and not in the proportion of females (P). Am I wrong?

Last edited by Pietro Guglielmi; 05 May 2021, 03:54.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10214

05 May 2021, 04:39

I am therefore trying to understand how to include in Stata δ, the school-specific linear time trend in the proportion of female students.

No, that is not what the authors state. They state that they are including school-specific linear time trends to account for time-varying unobserved factors that are also correlated with changes in the proportion of female students (you will not find the terminology "school-specific linear time trend in the proportion of female students"). If I want to include a linear time-trend in the Grunfeld dataset, I will just add time or year as a regressor.

Code:

webuse grunfeld, clear
*LINEAR TIME TREND
xtreg invest mvalue kstock year, fe

However, this time trend applies to all firms in the sample. For firm-specific linear time trends, I need to just interact firm and year.

Code:

webuse grunfeld, clear
*LINEAR FIRM-SPECIFIC TIME TRENDS
xtreg invest mvalue kstock i.company#c.year, fe

Res.:

Code:

. 
. *LINEAR TIME TREND

. 
. xtreg invest mvalue kstock year, fe

Fixed-effects (within) regression               Number of obs     =        200
Group variable: company                         Number of groups  =         10

R-sq:                                           Obs per group:
     within  = 0.7786                                         min =         20
     between = 0.8108                                         avg =       20.0
     overall = 0.8010                                         max =         20

                                                F(3,187)          =     219.16
corr(u_i, Xb)  = -0.2367                        Prob > F          =     0.0000

------------------------------------------------------------------------------
      invest |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      mvalue |   .1107207   .0115852     9.56   0.000     .0878662    .1335751
      kstock |   .3535765   .0218494    16.18   0.000     .3104735    .3966795
        year |  -2.664218    .843852    -3.16   0.002    -4.328911   -.9995244
       _cons |   5109.172   1636.907     3.12   0.002     1879.994    8338.349
-------------+----------------------------------------------------------------
     sigma_u |   89.64192
     sigma_e |  51.552705
         rho |  .75146422   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(9, 187) = 52.51                     Prob > F = 0.0000


. *LINEAR FIRM-SPECIFIC TIME TRENDS

. xtreg invest mvalue kstock i.company#c.year, fe

Fixed-effects (within) regression               Number of obs     =        200
Group variable: company                         Number of groups  =         10

R-sq:                                           Obs per group:
     within  = 0.8506                                         min =         20
     between = 0.6699                                         avg =       20.0
     overall = 0.5154                                         max =         20

                                                F(12,178)         =      84.45
corr(u_i, Xb)  = -0.9999                        Prob > F          =     0.0000

--------------------------------------------------------------------------------
        invest |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
        mvalue |    .109207   .0100774    10.84   0.000     .0893204    .1290935
        kstock |   .2994898   .0345494     8.67   0.000     .2313108    .3676689
               |
company#c.year |
            1  |   7.817957    3.66379     2.13   0.034     .5879051    15.04801
            2  |     7.8207   1.835886     4.26   0.000     4.197797     11.4436
            3  |  -6.247047   2.198013    -2.84   0.005    -10.58456   -1.909529
            4  |   .0642396   1.768885     0.04   0.971    -3.426444    3.554924
            5  |  -8.996451   2.010673    -4.47   0.000    -12.96427   -5.028628
            6  |  -1.689837     1.7435    -0.97   0.334    -5.130426    1.750753
            7  |  -3.778075    1.83712    -2.06   0.041    -7.403413   -.1527373
            8  |  -3.562496   1.730403    -2.06   0.041     -6.97724   -.1477525
            9  |  -3.585076   1.765422    -2.03   0.044    -7.068925   -.1012263
           10  |   .1631045   1.683186     0.10   0.923    -3.158463    3.484672
               |
         _cons |   2277.202   2025.001     1.12   0.262    -1718.897    6273.301
---------------+----------------------------------------------------------------
       sigma_u |  10603.001
       sigma_e |  43.403371
           rho |  .99998324   (fraction of variance due to u_i)
--------------------------------------------------------------------------------
F test that all u_i=0: F(9, 178) = 9.40                      Prob > F = 0.0000

.

Comment

Pietro Guglielmi

Join Date: Apr 2021

Posts: 13
#7

05 May 2021, 11:03

Andrew Musau thanks for the clarification (again)! So considering the empirical strategy the authors use (the one I included in #5), would you say the following regression does what they describe?

Code:

reghdfe test_score proportion_females i.company#c.year, absorb(school_id year)

Also, how can I estimate separate regressions for boys and girls using reghdfe? With the command regress I would have used

Code:

by gender: regress test_score proportion_females i.company#c.year i.school_id i.year

How can I do the same with reghdfe? Thanks again for your kind help!
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10214
#8

05 May 2021, 11:41

You want to interact the school_id with year.

Code:

reghdfe test_score proportion_females i.school_id#c.year, absorb(school_id year)

Your model therefore is

$$ y_{ist}= \beta_{s}+\delta_{s}year_{st}+\gamma_{t}+\pi P_{st}+\epsilon_{ist}$$

as you do not include a grade effect, school covariates and grade characteristics.

Also, how can I estimate separate regressions for boys and girls using reghdfe? With the command regress I would have used

You can also use the -by- prefix with reghdfe.
Comment
Pietro Guglielmi

Join Date: Apr 2021

Posts: 13
#9

06 May 2021, 02:06

Andrew Musau Thank you very much for all this help! I run this regression

Code:

by gender: reghdfe test_score proportion_females i.school_id#c.year, absorb(school_id year)

and get the following error: "reghdfe may not be combined with by r(190)"

I am using Stata 15.1 base on a OS X, I tried to install the latest version of reghdfe but I keep getting "version 5.9.0 03jun2020".

Is there any alternative way of running the regression by gender? Thanks
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10214

#10

06 May 2021, 03:26

Code:

reghdfe test_score proportion_females i.school_id#c.year if gender=="female", absorb(school_id year)
reghdfe test_score proportion_females i.school_id#c.year if gender=="male", absorb(school_id year)

Comment

Pietro Guglielmi

Join Date: Apr 2021

Posts: 13
#11

06 May 2021, 10:41

Andrew Musau I really appreciate it. I'll try your patience and ask some more questions, feel free to ignore me
(1) Clustering Standard Errors: I thought it would make sense to cluster them at the school level, does it make sense to you?

(2) Interpretation of the coefficient of my independent variable: if test-scores (my dependent variable) are normalised to have a mean=200 and standard deviation=40, and the coefficient of my key independent variable is 20, and my independent variable is % of female peers in a school ----> then the interpretation of that coefficient is that a 10% increase in the proportion of female peers increases test scores by 2, i.e. by 5% of the standard deviation, correct?

(3) In a few schools (0.5%) I have significant attrition, meaning that I have data on less than 50% of the school's pupils. What should I do with those schools? Should I drop them?

Thank you so much again!!
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10214
#12

06 May 2021, 11:25

Start a new thread if you have questions that do not involve the title of the thread.

1) Clustering Standard Errors: I thought it would make sense to cluster them at the school level, does it make sense to you?

Yes.

(2) Interpretation of the coefficient of my independent variable: if test-scores (my dependent variable) are normalised to have a mean=200 and standard deviation=40, and the coefficient of my key independent variable is 20, and my independent variable is % of female peers in a school ----> then the interpretation of that coefficient is that a 10% increase in the proportion of female peers increases test scores by 2, i.e. by 5% of the standard deviation, correct?

The interpretation is exactly as in linear regression.

(3) In a few schools (0.5%) I have significant attrition, meaning that I have data on less than 50% of the school's pupils. What should I do with those schools? Should I drop them?

There are varied approaches ranging from doing nothing to multiple imputation and collecting more data. Collecting more data may not be feasible and multiple imputation may not help if your data is missing not at random (MNAR). You should dig deeper into these possibilities.
Comment

Announcement

(1) areg, xtreg, reghdfe or reg and (2) time-trend

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment