Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • (1) areg, xtreg, reghdfe or reg and (2) time-trend

    Good morning, I am replicating the identification strategy of Lavy, Victor, and Analia Schlosser. 2011. “Mechanisms and Impacts of Gender Peer Effects at School.” American Economic Journal: Applied Economics 3 (2): 1–33.
    I have a dataset on test scores taken in grade 8 by 10 cohorts (2010-2019) of around 500,000 students each, from around 5,000 schools. Therefore I have only one observation for each individual, but I observe the whole cohort of grade 8 of each school every year. So the dataset is a repeated cross-section of the population of grade 8 students in all Italian schools. The dataset looks something like this:
    student_id school_id test_score year proportion_females
    1 1 100 2010 0.49
    2 1 103 2010 0.49
    1001 2 98 2010 0.52
    1002 2 100 2010 0.52
    ...
    500,001 1 102 2011 0.50
    500,002 1 101 2011 0.50
    501,001 2 97 2011 0.51
    501,002 2 99 2011 0.51
    ...
    1,000,001 1 98 2012 0.48
    1,000,002 1 100 2012 0.48
    1,001,001 2 101 2012 0.49
    1,001,002 2 97 2012 0.49
    I run the following regression to exploit the supposedly exogenous variation of the proportion of female peers in a school in each year compared to the school-specific time trend to capture the effect on test scores of having more female peers. My regression needs to include (1) school FE, (2) year FE, and (3) a school-specific time trend

    I thought of running the following regression:

    Code:
    reghdfe test_score proportion_females, absorb (school_id year)
    But I saw that some papers using this strategy used something like
    Code:
    xi: reg test_score proportion_females i.school_id i.year
    and some others used (1) just reg or (2) areg

    My questions are:
    (0) Any general comments on this approach?
    (1) Which regression command is suitable for this strategy? I know xtreg and reghdfe are used for panel regressions, and I am wondering whether my dataset can be considered a "school-panel" and therefore those commands would be ok.
    (2) how to include school-specific time-trend? By adding c.year#school_id?

    Thanks in advance,
    Pietro

  • #2
    reghdfe is from SSC (FAQ Advice #12).

    (0) Any general comments on this approach?
    (1) Which regression command is suitable for this strategy? I know xtreg and reghdfe are used for panel regressions, and I am wondering whether my dataset can be considered a "school-panel" and therefore those commands would be ok.
    You must have been looking through some old code as you do not need the -xi- prefix with factor variables. The following are equivalent:

    Code:
    reg test_score proportion_females i.school_id i.year
    reghdfe test_score proportion_females, absorb(school_id year)

    regress inverts a matrix, so too many indicators do not help. For speed and efficiency reasons, you should prefer reghdfe over regress for multi-fixed effects models. You do not necessarily need panel data (repeated units over time) to use xtreg or reghdfe. As long as you have some level of nesting (e.g., students within schools, counties within states, etc.), you can use these estimators,


    (2) how to include school-specific time-trend? By adding c.year#school_id?
    Both should be indicators:

    Code:
    i.school#i.year

    Comment


    • #3
      Andrew Musau Thank you very much for the reply. You perfectly clarified my first two questions, again. I appreciate it.

      On the second question: my explanation of what I want the "school-specific time-trend" to control for was unclear, sorry about that. I want the school-specific time-trend to account for the trend in each school of the proportion of females over time, and not for the trend in terms of test scores. For example school n might experience in my sample a trend of getting a bigger proportion of female pupils over time. I want the school-specific time trend to control for it, so that my key explanatory variable (proportion_females) would capture the effect on my dependent variable of the yearly exogenous deviation of the proportion of females in a school from the school’s trend.

      Do you have suggestions on how I can do that? Thanks again.

      Comment


      • #4
        I want the school-specific time-trend to account for the trend in each school of the proportion of females over time, and not for the trend in terms of test scores.
        I do not get what you are asking here. The proportion of females is your independent variable of interest, and it has to vary over school and time for you to be able to identify its effect. Do you have a reference from the Lavy paper that is similar to what you want to do? I have the paper, so point to me the table and column number.
        Last edited by Andrew Musau; 04 May 2021, 08:23.

        Comment


        • #5
          Andrew Musau thank you for your time.

          Page 5-6: https://www.jstor.org/stable/pdf/41288627.pdf

          Equations 1 and 2 describe the main identification strategy
          Click image for larger version

Name:	Screen Shot 2021-05-05 at 10.50.33 AM.png
Views:	1
Size:	19.3 KB
ID:	1607759


          where i denotes individuals, g denotes grades, s denotes schools, and t denotes time. ytgst is an achievement measure for a male/female student i in grade g, school s, and year t. αg is a grade effect. βS is a school effect. γt is a time effect. xtgst is a vector of student's covariates that includes mother's and father's years of schooling, number of siblings, immigration status, and ethnic origin, and indicators for missing values in these covariates. Sgst is a vector of characteristics of a grade g in school s and time t, and includes a quadratic function of enrollment and a set of variables for the average characteristics of the students in the grade. Pgst is the proportion of female students in grade g (which we refer to as the proportion female from here on), school s, and year t and εgst is the error term, which is composed of a school-specific random element that allows for any type of correlation within observations of the same school across time and an individual random element. The coefficient of interest is π, which captures the effects of having more female peers on student achievement.

          For the estimates in equation (1) to have a causal interpretation, the unobserved determinant of achievement must be uncorrelated with the treatment variable. Including school fixed effects controls for the most obvious potential confounding factor - the endogenous sorting of students across schools. However, one may be concerned that there are time-varying unobserved factors that are also correlated with changes in the proportion of female students. To address this concern, we add to equation (1) a full set of school-specific linear time trends δ. In this case, identification is achieved from the deviation in the proportion of female students from its school long-term trend, and is estimated by the following equation
          Click image for larger version

Name:	Screen Shot 2021-05-05 at 10.50.37 AM.png
Views:	1
Size:	21.8 KB
ID:	1607760




          I am therefore trying to understand how to include in Stata δ, the school-specific linear time trend in the proportion of female students. The school-specific linear time trends are included in the results shown in Table 3. I figured that these school-specific linear time-trends need to control for the trend in the proportion of females each school is experiencing, to control for schools that are getting, over time, more (or less) female pupils, and therefore avoid that variation (which is not exogenous) to be captured by the key independent variable (P). Hence I thought
          i.school#i.year would not be the right way to introduce such school-specific time trend as it would control for a trend in scores (y) and not in the proportion of females (P). Am I wrong?
          Last edited by Pietro Guglielmi; 05 May 2021, 03:54.

          Comment


          • #6
            I am therefore trying to understand how to include in Stata δ, the school-specific linear time trend in the proportion of female students.
            No, that is not what the authors state. They state that they are including school-specific linear time trends to account for time-varying unobserved factors that are also correlated with changes in the proportion of female students (you will not find the terminology "school-specific linear time trend in the proportion of female students"). If I want to include a linear time-trend in the Grunfeld dataset, I will just add time or year as a regressor.

            Code:
            webuse grunfeld, clear
            *LINEAR TIME TREND
            xtreg invest mvalue kstock year, fe
            However, this time trend applies to all firms in the sample. For firm-specific linear time trends, I need to just interact firm and year.

            Code:
            webuse grunfeld, clear
            *LINEAR FIRM-SPECIFIC TIME TRENDS
            xtreg invest mvalue kstock i.company#c.year, fe
            Res.:


            Code:
            . 
            . *LINEAR TIME TREND
            
            . 
            . xtreg invest mvalue kstock year, fe
            
            Fixed-effects (within) regression               Number of obs     =        200
            Group variable: company                         Number of groups  =         10
            
            R-sq:                                           Obs per group:
                 within  = 0.7786                                         min =         20
                 between = 0.8108                                         avg =       20.0
                 overall = 0.8010                                         max =         20
            
                                                            F(3,187)          =     219.16
            corr(u_i, Xb)  = -0.2367                        Prob > F          =     0.0000
            
            ------------------------------------------------------------------------------
                  invest |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                  mvalue |   .1107207   .0115852     9.56   0.000     .0878662    .1335751
                  kstock |   .3535765   .0218494    16.18   0.000     .3104735    .3966795
                    year |  -2.664218    .843852    -3.16   0.002    -4.328911   -.9995244
                   _cons |   5109.172   1636.907     3.12   0.002     1879.994    8338.349
            -------------+----------------------------------------------------------------
                 sigma_u |   89.64192
                 sigma_e |  51.552705
                     rho |  .75146422   (fraction of variance due to u_i)
            ------------------------------------------------------------------------------
            F test that all u_i=0: F(9, 187) = 52.51                     Prob > F = 0.0000
            
            
            . *LINEAR FIRM-SPECIFIC TIME TRENDS
            
            . xtreg invest mvalue kstock i.company#c.year, fe
            
            Fixed-effects (within) regression               Number of obs     =        200
            Group variable: company                         Number of groups  =         10
            
            R-sq:                                           Obs per group:
                 within  = 0.8506                                         min =         20
                 between = 0.6699                                         avg =       20.0
                 overall = 0.5154                                         max =         20
            
                                                            F(12,178)         =      84.45
            corr(u_i, Xb)  = -0.9999                        Prob > F          =     0.0000
            
            --------------------------------------------------------------------------------
                    invest |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            ---------------+----------------------------------------------------------------
                    mvalue |    .109207   .0100774    10.84   0.000     .0893204    .1290935
                    kstock |   .2994898   .0345494     8.67   0.000     .2313108    .3676689
                           |
            company#c.year |
                        1  |   7.817957    3.66379     2.13   0.034     .5879051    15.04801
                        2  |     7.8207   1.835886     4.26   0.000     4.197797     11.4436
                        3  |  -6.247047   2.198013    -2.84   0.005    -10.58456   -1.909529
                        4  |   .0642396   1.768885     0.04   0.971    -3.426444    3.554924
                        5  |  -8.996451   2.010673    -4.47   0.000    -12.96427   -5.028628
                        6  |  -1.689837     1.7435    -0.97   0.334    -5.130426    1.750753
                        7  |  -3.778075    1.83712    -2.06   0.041    -7.403413   -.1527373
                        8  |  -3.562496   1.730403    -2.06   0.041     -6.97724   -.1477525
                        9  |  -3.585076   1.765422    -2.03   0.044    -7.068925   -.1012263
                       10  |   .1631045   1.683186     0.10   0.923    -3.158463    3.484672
                           |
                     _cons |   2277.202   2025.001     1.12   0.262    -1718.897    6273.301
            ---------------+----------------------------------------------------------------
                   sigma_u |  10603.001
                   sigma_e |  43.403371
                       rho |  .99998324   (fraction of variance due to u_i)
            --------------------------------------------------------------------------------
            F test that all u_i=0: F(9, 178) = 9.40                      Prob > F = 0.0000
            
            .

            Comment


            • #7
              Andrew Musau thanks for the clarification (again)! So considering the empirical strategy the authors use (the one I included in #5), would you say the following regression does what they describe?
              Code:
              reghdfe test_score proportion_females i.company#c.year, absorb(school_id year)
              Also, how can I estimate separate regressions for boys and girls using reghdfe? With the command regress I would have used
              Code:
              by gender: regress test_score proportion_females i.company#c.year i.school_id i.year
              How can I do the same with reghdfe? Thanks again for your kind help!

              Comment


              • #8
                You want to interact the school_id with year.

                Code:
                reghdfe test_score proportion_females i.school_id#c.year, absorb(school_id year)

                Your model therefore is

                $$ y_{ist}= \beta_{s}+\delta_{s}year_{st}+\gamma_{t}+\pi P_{st}+\epsilon_{ist}$$

                as you do not include a grade effect, school covariates and grade characteristics.

                Also, how can I estimate separate regressions for boys and girls using reghdfe? With the command regress I would have used
                You can also use the -by- prefix with reghdfe.

                Comment


                • #9
                  Andrew Musau Thank you very much for all this help! I run this regression

                  Code:
                   by gender: reghdfe test_score proportion_females i.school_id#c.year, absorb(school_id year)
                  and get the following error: "reghdfe may not be combined with by r(190)"

                  I am using Stata 15.1 base on a OS X, I tried to install the latest version of reghdfe but I keep getting "version 5.9.0 03jun2020".

                  Is there any alternative way of running the regression by gender? Thanks

                  Comment


                  • #10
                    Code:
                    reghdfe test_score proportion_females i.school_id#c.year if gender=="female", absorb(school_id year)
                    reghdfe test_score proportion_females i.school_id#c.year if gender=="male", absorb(school_id year)

                    Comment


                    • #11
                      Andrew Musau I really appreciate it. I'll try your patience and ask some more questions, feel free to ignore me
                      (1) Clustering Standard Errors: I thought it would make sense to cluster them at the school level, does it make sense to you?

                      (2) Interpretation of the coefficient of my independent variable: if test-scores (my dependent variable) are normalised to have a mean=200 and standard deviation=40, and the coefficient of my key independent variable is 20, and my independent variable is % of female peers in a school ----> then the interpretation of that coefficient is that a 10% increase in the proportion of female peers increases test scores by 2, i.e. by 5% of the standard deviation, correct?

                      (3) In a few schools (0.5%) I have significant attrition, meaning that I have data on less than 50% of the school's pupils. What should I do with those schools? Should I drop them?

                      Thank you so much again!!

                      Comment


                      • #12
                        Start a new thread if you have questions that do not involve the title of the thread.

                        1) Clustering Standard Errors: I thought it would make sense to cluster them at the school level, does it make sense to you?
                        Yes.

                        (2) Interpretation of the coefficient of my independent variable: if test-scores (my dependent variable) are normalised to have a mean=200 and standard deviation=40, and the coefficient of my key independent variable is 20, and my independent variable is % of female peers in a school ----> then the interpretation of that coefficient is that a 10% increase in the proportion of female peers increases test scores by 2, i.e. by 5% of the standard deviation, correct?
                        The interpretation is exactly as in linear regression.

                        (3) In a few schools (0.5%) I have significant attrition, meaning that I have data on less than 50% of the school's pupils. What should I do with those schools? Should I drop them?
                        There are varied approaches ranging from doing nothing to multiple imputation and collecting more data. Collecting more data may not be feasible and multiple imputation may not help if your data is missing not at random (MNAR). You should dig deeper into these possibilities.

                        Comment

                        Working...
                        X