Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Choosing best OLS post-estimation command

    Which method produces the best estimate of the race difference in annual household income (equivalized for household composition): margins or predict?
    I use OLS to regress age, sex, education, region, urban and race on logged, equivalized annual household income for a sample of black and non-black households.
    CODE reg lnehhinc age i.sex i.educ i.region i.urban i.black
    I’ve tried the following post-estimation commands:
    margins black
    This generates an estimate for non-blacks and for blacks. Both are logged and need to be exponentiated. For example, if I exponentiate the black margin, I get CODE di exp(10.31384) = $30,146
    Alternatively, I can use predict
    Code predict yhat if black==1 & e(sample)
    sum yhat, detail
    These commands generate, among other things, a mean and median. Both of which, again, need to exponentiated. For the median: CODE di exp(10.35673) = $31,468;
    for the mean: CODE di exp(10.37806) =$32,147
    Ultimately, my goal is to compute the ratio of black/non-black household income. Which result should I use? Or, put differently, what are the strengths and weaknesses of the three different results? Thanks!

  • #2
    There is no such thing as "the" ratio of black/non-black household income. As you have discovered, there are several things one can do that might be called a ratio of black/non-black household income.

    That said, there are some problems with the approaches you have used. Exponentiating a calculated mean is rarely a useful thing to do. In particular, it most definitely does not give a useful estimate of the mean of the non-log-transformed variable, except under circumstances that rarely arise and that would make the calculation pointless in any case. This is because the exponential function is highly non-linear and does not map means to means. To get the black and non-black averages of non-logged household income after your regression, the command would be:
    Code:
    margins black, expression(exp(predict xb))
    Note that this method will give you an estimate that is fully adjusted for all of the regressors in your model. That is, it will give you estimates that describe a counterfactual situation in which both the blacks and the non-blacks had the same joint distribution of all of the regressors as is observed in the entire regression estimation sample.

    Using -predict-, you could do:
    Code:
    predict xb, xb
    gen non_logged_income = exp(xb)
    by black, sort: summarize non_logged_income
    These estimated means would be partially adjusted for the regressors in your model. That is, to the extent that the regression itself accounts for and removes some part of the variance in household income, that adjustment is reflected in the -predict- output. BUT, by separately calculating the incomes for blacks and non-blacks in the -by black, sort: summarize...- command, the estimate for blacks is calculated only from observations on black people, and that for non-blacks only from observations on non-black-people, so the two means are estimated with (almost certainly) different joint distributions of the regressors. There is no standardization of both groups to the joint distribution of the entire sample of blacks and non-blacks.

    Unlike means, it is perfectly appropriate to exponentiate a median of a log-transformed variable to get an estimate of the median of the non-log transformed variable. That's because the exponential function is monotone increasing, so it does map medians to medians.

    Depending on the use to which you plan to put your ratio of these estimates, and how you wish to interpret it, you might choose any of these three approaches. And you might even use unadjusted means or medians without any regression model, again depending on how you wish to interpret your results. All of these things are meaningful and useful for different purposes.

    Comment


    • #3
      If you wish to compute (averaged) expected values of the original outcome variable, you would need to add half the residual variance to the linear prediction before exponentiation (see here). Here is an example:
      Code:
      . set seed 534
      . set obs 10000
      Number of observations (_N) was 0, now 10,000.
      . 
      . gen x = rnormal()
      . gen e = rnormal()
      . gen y = exp(1 + x + e)
      . 
      . gen lny = ln(y)
      . reg lny x
      
            Source |       SS           df       MS      Number of obs   =    10,000
      -------------+----------------------------------   F(1, 9998)      =   9950.57
             Model |  10177.0632         1  10177.0632   Prob > F        =    0.0000
          Residual |  10225.5687     9,998  1.02276143   R-squared       =    0.4988
      -------------+----------------------------------   Adj R-squared   =    0.4988
             Total |   20402.632     9,999  2.04046724   Root MSE        =    1.0113
      
      ------------------------------------------------------------------------------
               lny | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
                 x |   1.009715   .0101222    99.75   0.000     .9898737    1.029557
             _cons |   .9908134   .0101134    97.97   0.000     .9709891    1.010638
      ------------------------------------------------------------------------------
      
      . predict xb, xb
      . gen mu1 = exp(xb + 0.5*`e(rmse)'^2)
      . sum mu1
      
          Variable |        Obs        Mean    Std. dev.       Min        Max
      -------------+---------------------------------------------------------
               mu1 |     10,000    7.473725    9.461671   .1045006   206.1754
      
      . margins, expression(exp(predict(xb)+0.5*`e(rmse)'^2))
      
      Predictive margins                                      Number of obs = 10,000
      Model VCE: OLS
      
      Expression: exp(predict(xb)+0.5*1.011316679546227^2)
      
      ------------------------------------------------------------------------------
                   |            Delta-method
                   |     Margin   std. err.      z    P>|z|     [95% conf. interval]
      -------------+----------------------------------------------------------------
             _cons |   7.473725   .1062444    70.34   0.000      7.26549     7.68196
      ------------------------------------------------------------------------------
      You could also directly fit a lognormal model using gsem:
      Code:
      . gsem (y = x), family(lognormal) link(log)
      
      Iteration 0:  Log likelihood =  -35505.73  (not concave)
      Iteration 1:  Log likelihood =  -24282.71  
      Iteration 2:  Log likelihood = -24276.566  
      Iteration 3:  Log likelihood = -24276.557  
      Iteration 4:  Log likelihood = -24276.557  
      
      Generalized structural equation model              Number of obs   =    10,000
      Response: y                                        No. of failures =    10,000
      Family:   Lognormal                                Time at risk    = 74,822.96
      Form:     Accelerated failure time
      Link:     Log                     
      Log likelihood = -24276.557
      
      ------------------------------------------------------------------------------
                   | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
      -------------+----------------------------------------------------------------
      y            |
                 x |   1.009715   .0101212    99.76   0.000     .9898781    1.029552
             _cons |   .9908134   .0101124    97.98   0.000     .9709935    1.010633
      -------------+----------------------------------------------------------------
      /y           |
              logs |   .0111531   .0070711                     -.0027059    .0250122
      ------------------------------------------------------------------------------
      
      . margins
      
      Predictive margins                                      Number of obs = 10,000
      Model VCE: OIM
      
      Expression: Predicted mean (y), predict(mu outcome(y))
      
      ------------------------------------------------------------------------------
                   |            Delta-method
                   |     Margin   std. err.      z    P>|z|     [95% conf. interval]
      -------------+----------------------------------------------------------------
             _cons |   7.472961   .1191761    62.71   0.000      7.23938    7.706542
      ------------------------------------------------------------------------------
      The latter solution would be slightly preferable because it takes into account that the variance parameter is estimated rather than known, or 'fixed', which is what we assume in the first example. This could yield a more accurate standard error of the averaged predictions.

      Comment


      • #4
        Suzanne:
        welcome to this forum.
        As per FAQ, please share what you typed and what Stata gave you back.
        It worths more that tons of words aimed ay describing what you did (and why you're complaining about your results). Thanks.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          If you just want the ratio of expected wage of blacks over the expected wage of whites, then you don't need margins. All you need to do is use a log link function and robust standard errors. The easiest way to do so in Stata is use poisson together with the vce(robust) option (see https://blog.stata.com/2011/08/22/us...tell-a-friend/ and the books cited in the post and the comments).

          poisson on its own is intended for count data, but with the vce(robust) option it becomes a quasi-likelihood model, i.e. a model that only cares about the conditional mean, and that conditional mean is modeled using a log-link function. It is important that you do not use the log of income, but the raw income itself. The link function takes care of taking the logarithm. The problem you had is that you model the mean of log income, and it is hard to recover the mean of income from that. With the link function you model the log of mean income, which makes it trivial to recover the mean income.

          Another major advantage is that the exponentiated regression coefficient is exactly the ratio of mean wages that you were looking for. With poisson you just add the irr option, and you get those exponentiated coefficients. So no need to predict means, compute ratios, figure out how to do inference on that. You just look at the regression table, and you have the coefficient you want.

          Here is an example:
          Code:
          . // load and prepare example data
          . sysuse nlsw88, clear
          (NLSW, 1988 extract)
          
          . 
          . gen byte black:black_lb = race == 2 if race <= 2
          (26 missing values generated)
          
          . label define black_lb 0 "White" 1 "Black"
          
          . 
          . gen byte urb:urb_lb = c_city + smsa
          
          . label define urb_lb 0 "rural" 1 "suburb" 2 "city"
          
          . 
          . // the model
          . poisson wage i.black i.urb grade i.south ttl_exp, irr vce(robust)
          note: noncount dependent variable encountered; results correspond to an exponential-mean model rather than a poisson
                model.
          
          Iteration 0:  Log pseudolikelihood = -6743.0911  
          Iteration 1:  Log pseudolikelihood = -6743.0909  
          
          Poisson regression                                      Number of obs =  2,218
                                                                  Wald chi2(6)  = 544.68
                                                                  Prob > chi2   = 0.0000
          Log pseudolikelihood = -6743.0909                       Pseudo R2     = 0.1115
          
          ------------------------------------------------------------------------------
                       |               Robust
                  wage |        IRR   std. err.      z    P>|z|     [95% conf. interval]
          -------------+----------------------------------------------------------------
                 black |
                Black  |   .9088679   .0320401    -2.71   0.007     .8481908    .9738856
                       |
                   urb |
               suburb  |   1.262356   .0455085     6.46   0.000     1.176239    1.354778
                 city  |   1.251422   .0504944     5.56   0.000     1.156268    1.354408
                       |
                 grade |   1.077399   .0060877    13.19   0.000     1.065533    1.089397
                       |
                 south |
                South  |   .8913582   .0261891    -3.91   0.000     .8414784    .9441946
               ttl_exp |    1.03761   .0032063    11.95   0.000     1.031345    1.043913
                 _cons |   1.602138    .146464     5.16   0.000     1.339321    1.916527
          ------------------------------------------------------------------------------
          Note: _cons estimates baseline incidence rate.
          
          . est store model
          
          . 
          . // predict wage for central city highschool graduates from non-south with 5 years experience
          . margins black, at(urb=2 grade=12 south=0 ttl_exp=5) post
          
          Adjusted predictions                                     Number of obs = 2,218
          Model VCE: Robust
          
          Expression: Predicted number of events, predict()
          At: urb     =  2
              grade   = 12
              south   =  0
              ttl_exp =  5
          
          ------------------------------------------------------------------------------
                       |            Delta-method
                       |     Margin   std. err.      z    P>|z|     [95% conf. interval]
          -------------+----------------------------------------------------------------
                 black |
                White  |   5.899192   .2509011    23.51   0.000     5.407435    6.390949
                Black  |   5.361586   .2250443    23.82   0.000     4.920507    5.802665
          ------------------------------------------------------------------------------
          
          . 
          . // ratio of black versus white income
          . di _b[1.black]/_b[0.black]
          .90886789
          
          . 
          . // predict wage for rural highschool graduates from non-south with 5 years experience
          . est restore model
          (results model are active now)
          
          . margins black, at(urb=0 grade=12 south=0 ttl_exp=5) post
          
          Adjusted predictions                                     Number of obs = 2,218
          Model VCE: Robust
          
          Expression: Predicted number of events, predict()
          At: urb     =  0
              grade   = 12
              south   =  0
              ttl_exp =  5
          
          ------------------------------------------------------------------------------
                       |            Delta-method
                       |     Margin   std. err.      z    P>|z|     [95% conf. interval]
          -------------+----------------------------------------------------------------
                 black |
                White  |   4.713989   .2144403    21.98   0.000     4.293694    5.134284
                Black  |   4.284393   .2187626    19.58   0.000     3.855626     4.71316
          ------------------------------------------------------------------------------
          
          . 
          . // ratio of black versus white income
          . di _b[1.black]/_b[0.black]
          .90886789
          
          . 
          . // strange: that is the same number, lets try something else:
          . // predict wage for central city highschool graduates from south with 5 years experience
          . est restore model
          (results model are active now)
          
          . margins black, at(urb=2 grade=12 south=1 ttl_exp=5) post
          
          Adjusted predictions                                     Number of obs = 2,218
          Model VCE: Robust
          
          Expression: Predicted number of events, predict()
          At: urb     =  2
              grade   = 12
              south   =  1
              ttl_exp =  5
          
          ------------------------------------------------------------------------------
                       |            Delta-method
                       |     Margin   std. err.      z    P>|z|     [95% conf. interval]
          -------------+----------------------------------------------------------------
                 black |
                White  |   5.258293   .2490601    21.11   0.000     4.770144    5.746442
                Black  |   4.779093   .2006576    23.82   0.000     4.385812    5.172375
          ------------------------------------------------------------------------------
          
          . 
          . // ratio of black versus white income
          . di _b[1.black]/_b[0.black]
          .90886789
          
          . 
          . // So regardles of your other characteristics: the ratio of black versus white income is .90886789 
          . // lets look at the model again
          . est restore model
          (results model are active now)
          
          . poisson, irr
          
          Poisson regression                                      Number of obs =  2,218
                                                                  Wald chi2(6)  = 544.68
                                                                  Prob > chi2   = 0.0000
          Log pseudolikelihood = -6743.0909                       Pseudo R2     = 0.1115
          
          ------------------------------------------------------------------------------
                       |               Robust
                  wage |        IRR   std. err.      z    P>|z|     [95% conf. interval]
          -------------+----------------------------------------------------------------
                 black |
                Black  |   .9088679   .0320401    -2.71   0.007     .8481908    .9738856
                       |
                   urb |
               suburb  |   1.262356   .0455085     6.46   0.000     1.176239    1.354778
                 city  |   1.251422   .0504944     5.56   0.000     1.156268    1.354408
                       |
                 grade |   1.077399   .0060877    13.19   0.000     1.065533    1.089397
                       |
                 south |
                South  |   .8913582   .0261891    -3.91   0.000     .8414784    .9441946
               ttl_exp |    1.03761   .0032063    11.95   0.000     1.031345    1.043913
                 _cons |   1.602138    .146464     5.16   0.000     1.339321    1.916527
          ------------------------------------------------------------------------------
          Note: _cons estimates baseline incidence rate.
          
          . 
          . // he, the "effect" of black is exactly that ratio
          . // OK, I don't need all that -margins- stuff. I can just directly report the irr for black
          The difference between the poisson (maximum quasi-likelihood) and gsem (maximum likelihood) suggested by Joerg Luedicke (StataCorp) is that the former is a bit more robust: it only uses information from the conditional means, if for example the variance is incorrectly specified than quasi-likelihood does not care. Maximum likelihood estimates are influenced by those misspecifications.
          ---------------------------------
          Maarten L. Buis
          University of Konstanz
          Department of history and sociology
          box 40
          78457 Konstanz
          Germany
          http://www.maartenbuis.nl
          ---------------------------------

          Comment


          • #6
            Thanks to everyone for their help. While I've been following Stata List for years, this was my first post. The suggestions of Dr. Schechter (median) & Professor Buis (mean) are especially appreciated. I will doubtless be using the List again. Suzanne

            Comment

            Working...
            X