Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replicating Autor, Katz and Kearney 2008. Trouble adjusting for Composition.

    Hello!

    I'm replicating the findings of Autor, Katz and Kearney 2008, specifically table 1 and figures 1-3.Ive hit a wall with the generation of predicted values. The table describes the regression that generates the predicted values. AKK generate a predicted value of the log of the weekly wage for combinations of year, sex, education group, and experience group. There are 43 years, two sexes, five education groups, and four experience groups, so there are 1,720 cells with predicted values.

    This is what I HOPE to accomplish.

    Run one regression using the statsby prefix, as well as by(sex year). Have statsby save the regression coefficients, so statsby ..., by(): regress ... creates a data set with 86 observations, each with a full set of regression coefficients.

    Next you create 20 prediction variables, one for each combination of education group and experience group. Use generate to multiply the coefficient estimates, which are variables here, by numbers, which are values of the original variables. (You're just evaluating the regression at 20 different values of the X matrix.) Each of the 20 generate commands treats races and regions the same: set race to white, which means you can ignore the race effects; define macros with the region proportions, and use the region proportions here. What varies across the 20 generate commands is the values of education and experience dummies. Across the 20 generate commands, the only things that change are the zeros and ones associated with education and experience groups.

    Ive ran the Statby regression and my code looks like this:

    statsby _b, by(female year) saving(coefs, replace): ///
    regress lrwwage i.potential_education_group i.potential_experience_group ///
    i.region_group race_dummy

    however, for some reason 3 coefficient variables _stat_1 , _stat_6, and _stat_10 are all 0

    . tabulate _stat_1

    _b[1b.poten |
    tial_educat |
    ion_group] | Freq. Percent Cum.
    ------------+-----------------------------------
    0 | 84 100.00 100.00
    ------------+-----------------------------------
    Total | 84 100.00

    . tabulate _stat_6

    _b[0b.poten |
    tial_experi |
    ence_group] | Freq. Percent Cum.
    ------------+-----------------------------------
    0 | 84 100.00 100.00
    ------------+-----------------------------------
    Total | 84 100.00

    . tabulate _stat_10

    _b[1b.regio |
    n_group] | Freq. Percent Cum.
    ------------+-----------------------------------
    0 | 84 100.00 100.00
    ------------+-----------------------------------
    Total | 84 100.00

    .
    Looking for help figuring out why that is and tips on formating the generate command to create the 20 prediction variables. Anything helps! Thank you!

  • #2
    The answer is in plain sight in the -tabulate- outputs. Notice the names of the coefficients: _b[1b.potential_education_group], _b[0b.potential_experience_group], and _b[1b.region_group]. The b's that I've put in bold face tell you that these are the reference (omitted) categories of the corresponding categorical variables. So their coefficients are, by definition of reference category, constrained to be zero. Remember, if a categorical variable as N levels, you get N-1 non-zero coefficients for it. That is what is happening here.
    Last edited by Clyde Schechter; 06 Jul 2023, 19:40.

    Comment


    • #3
      Thank you so much for the reply Clyde, I appreciate your feedback! I understand now that stata is using them as base categories . Running into another issue now with Statby. Ive added variables to make my regression align closer to the methods of Autor Katz and Kearney 2008. Notably, I have an potential education group category that looks like this.

      generate potential_education_group = .
      replace potential_education_group = 0 if inrange(grade,0,10)
      replace potential_education_group = 1 if grade == 11
      replace potential_education_group = 2 if grade == 12
      replace potential_education_group = 3 if grade == 13
      replace potential_education_group = 4 if inrange(grade,14,16)
      replace potential_education_group = 5 if grade == 18
      tabulate potential_education_group
      drop if potential_education_group == .
      tabulate potential_education_group


      potential_e |
      ducation_gr |
      oup | Freq. Percent Cum.
      ------------+-----------------------------------
      0 | 247,716 11.78 11.78
      1 | 75,530 3.59 15.37
      2 | 807,886 38.41 53.78
      3 | 256,348 12.19 65.97
      4 | 549,981 26.15 92.12
      5 | 165,809 7.88 100.00
      ------------+-----------------------------------
      Total | 2,103,270 100.00

      But when I run my Statby regression its dropping categories 0,1, 3 and making category 2 the base.

      the new statby looks like this.

      statsby _b, by(female year) saving(coefs, replace): ///
      regress lrwwage i.potential_education_group i.potential_experience ///
      experience#i.broad_education exp_squared#i.broad_education ///
      exp_cubed#i.broad_education exp_quart#i.broad_education i.potential_education_group ///
      w_race_dummy b_race_dummy o_race_dummy i.region_group

      Here is part of the describe command.

      Contains data from coefs.dta
      Observations: 84 statsby: regress
      Variables: 485 7 Jul 2023 13:16
      -------------------------------------------------------------------------------------------------------------------
      Variable Storage Display Value
      name type format label Variable label
      -------------------------------------------------------------------------------------------------------------------
      female byte %8.0g FEMALE Sex Dummy Variable
      year int %8.0g survey year
      _stat_1 float %9.0g _b[2b.potential_education_group]
      _stat_2 float %9.0g _b[4.potential_education_group]
      _stat_3 float %9.0g _b[5.potential_education_group]
      _stat_4 float %9.0g _b[1b.potential_experience]
      _stat_5 float %9.0g _b[2.potential_experience]
      _stat_6 float %9.0g _b[3.potential_experience]
      _stat_7 float %9.0g _b[4.potential_experience]
      ....

      Any ideas why this is the case? How could I get coefficient estimates for categories 1 and 3?

      Comment

      Working...
      X