Choosing best OLS post-estimation command

Suzanne Model

Join Date: Jun 2025

Posts: 2
#1

Choosing best OLS post-estimation command

03 Jun 2025, 16:25

Which method produces the best estimate of the race difference in annual household income (equivalized for household composition): margins or predict?
I use OLS to regress age, sex, education, region, urban and race on logged, equivalized annual household income for a sample of black and non-black households.
CODE reg lnehhinc age i.sex i.educ i.region i.urban i.black
I’ve tried the following post-estimation commands:
margins black
This generates an estimate for non-blacks and for blacks. Both are logged and need to be exponentiated. For example, if I exponentiate the black margin, I get CODE di exp(10.31384) = $30,146
Alternatively, I can use predict
Code predict yhat if black==1 & e(sample)
sum yhat, detail
These commands generate, among other things, a mean and median. Both of which, again, need to exponentiated. For the median: CODE di exp(10.35673) = $31,468;
for the mean: CODE di exp(10.37806) =$32,147
Ultimately, my goal is to compute the ratio of black/non-black household income. Which result should I use? Or, put differently, what are the strengths and weaknesses of the three different results? Thanks!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#2

03 Jun 2025, 18:56

There is no such thing as "the" ratio of black/non-black household income. As you have discovered, there are several things one can do that might be called a ratio of black/non-black household income.

That said, there are some problems with the approaches you have used. Exponentiating a calculated mean is rarely a useful thing to do. In particular, it most definitely does not give a useful estimate of the mean of the non-log-transformed variable, except under circumstances that rarely arise and that would make the calculation pointless in any case. This is because the exponential function is highly non-linear and does not map means to means. To get the black and non-black averages of non-logged household income after your regression, the command would be:

Code:

margins black, expression(exp(predict xb))

Note that this method will give you an estimate that is fully adjusted for all of the regressors in your model. That is, it will give you estimates that describe a counterfactual situation in which both the blacks and the non-blacks had the same joint distribution of all of the regressors as is observed in the entire regression estimation sample.

Using -predict-, you could do:

Code:

predict xb, xb gen non_logged_income = exp(xb) by black, sort: summarize non_logged_income

These estimated means would be partially adjusted for the regressors in your model. That is, to the extent that the regression itself accounts for and removes some part of the variance in household income, that adjustment is reflected in the -predict- output. BUT, by separately calculating the incomes for blacks and non-blacks in the -by black, sort: summarize...- command, the estimate for blacks is calculated only from observations on black people, and that for non-blacks only from observations on non-black-people, so the two means are estimated with (almost certainly) different joint distributions of the regressors. There is no standardization of both groups to the joint distribution of the entire sample of blacks and non-blacks.

Unlike means, it is perfectly appropriate to exponentiate a median of a log-transformed variable to get an estimate of the median of the non-log transformed variable. That's because the exponential function is monotone increasing, so it does map medians to medians.

Depending on the use to which you plan to put your ratio of these estimates, and how you wish to interpret it, you might choose any of these three approaches. And you might even use unadjusted means or medians without any regression model, again depending on how you wish to interpret your results. All of these things are meaningful and useful for different purposes.
Comment

Joerg Luedicke (StataCorp)

StataCorp Employee

Join Date: Apr 2014
Posts: 116

04 Jun 2025, 12:58

If you wish to compute (averaged) expected values of the original outcome variable, you would need to add half the residual variance to the linear prediction before exponentiation (see here). Here is an example:

Code:

. set seed 534
. set obs 10000
Number of observations (_N) was 0, now 10,000.
. 
. gen x = rnormal()
. gen e = rnormal()
. gen y = exp(1 + x + e)
. 
. gen lny = ln(y)
. reg lny x

      Source |       SS           df       MS      Number of obs   =    10,000
-------------+----------------------------------   F(1, 9998)      =   9950.57
       Model |  10177.0632         1  10177.0632   Prob > F        =    0.0000
    Residual |  10225.5687     9,998  1.02276143   R-squared       =    0.4988
-------------+----------------------------------   Adj R-squared   =    0.4988
       Total |   20402.632     9,999  2.04046724   Root MSE        =    1.0113

------------------------------------------------------------------------------
         lny | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
           x |   1.009715   .0101222    99.75   0.000     .9898737    1.029557
       _cons |   .9908134   .0101134    97.97   0.000     .9709891    1.010638
------------------------------------------------------------------------------

. predict xb, xb
. gen mu1 = exp(xb + 0.5*`e(rmse)'^2)
. sum mu1

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         mu1 |     10,000    7.473725    9.461671   .1045006   206.1754

. margins, expression(exp(predict(xb)+0.5*`e(rmse)'^2))

Predictive margins                                      Number of obs = 10,000
Model VCE: OLS

Expression: exp(predict(xb)+0.5*1.011316679546227^2)

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       _cons |   7.473725   .1062444    70.34   0.000      7.26549     7.68196
------------------------------------------------------------------------------

You could also directly fit a lognormal model using gsem:

Code:

. gsem (y = x), family(lognormal) link(log)

Iteration 0:  Log likelihood =  -35505.73  (not concave)
Iteration 1:  Log likelihood =  -24282.71  
Iteration 2:  Log likelihood = -24276.566  
Iteration 3:  Log likelihood = -24276.557  
Iteration 4:  Log likelihood = -24276.557  

Generalized structural equation model              Number of obs   =    10,000
Response: y                                        No. of failures =    10,000
Family:   Lognormal                                Time at risk    = 74,822.96
Form:     Accelerated failure time
Link:     Log                     
Log likelihood = -24276.557

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
y            |
           x |   1.009715   .0101212    99.76   0.000     .9898781    1.029552
       _cons |   .9908134   .0101124    97.98   0.000     .9709935    1.010633
-------------+----------------------------------------------------------------
/y           |
        logs |   .0111531   .0070711                     -.0027059    .0250122
------------------------------------------------------------------------------

. margins

Predictive margins                                      Number of obs = 10,000
Model VCE: OIM

Expression: Predicted mean (y), predict(mu outcome(y))

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       _cons |   7.472961   .1191761    62.71   0.000      7.23938    7.706542
------------------------------------------------------------------------------

The latter solution would be slightly preferable because it takes into account that the variance parameter is estimated rather than known, or 'fixed', which is what we assume in the first example. This could yield a more accurate standard error of the averaged predictions.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#4

05 Jun 2025, 00:00

Suzanne:
welcome to this forum.
As per FAQ, please share what you typed and what Stata gave you back.
It worths more that tons of words aimed ay describing what you did (and why you're complaining about your results). Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment

Maarten Buis

Join Date: Mar 2014
Posts: 3454

05 Jun 2025, 04:31

If you just want the ratio of expected wage of blacks over the expected wage of whites, then you don't need margins. All you need to do is use a log link function and robust standard errors. The easiest way to do so in Stata is use poisson together with the vce(robust) option (see https://blog.stata.com/2011/08/22/us...tell-a-friend/ and the books cited in the post and the comments).

poisson on its own is intended for count data, but with the vce(robust) option it becomes a quasi-likelihood model, i.e. a model that only cares about the conditional mean, and that conditional mean is modeled using a log-link function. It is important that you do not use the log of income, but the raw income itself. The link function takes care of taking the logarithm. The problem you had is that you model the mean of log income, and it is hard to recover the mean of income from that. With the link function you model the log of mean income, which makes it trivial to recover the mean income.

Another major advantage is that the exponentiated regression coefficient is exactly the ratio of mean wages that you were looking for. With poisson you just add the irr option, and you get those exponentiated coefficients. So no need to predict means, compute ratios, figure out how to do inference on that. You just look at the regression table, and you have the coefficient you want.

Here is an example:

Code:

. // load and prepare example data
. sysuse nlsw88, clear
(NLSW, 1988 extract)

. 
. gen byte black:black_lb = race == 2 if race <= 2
(26 missing values generated)

. label define black_lb 0 "White" 1 "Black"

. 
. gen byte urb:urb_lb = c_city + smsa

. label define urb_lb 0 "rural" 1 "suburb" 2 "city"

. 
. // the model
. poisson wage i.black i.urb grade i.south ttl_exp, irr vce(robust)
note: noncount dependent variable encountered; results correspond to an exponential-mean model rather than a poisson
      model.

Iteration 0:  Log pseudolikelihood = -6743.0911  
Iteration 1:  Log pseudolikelihood = -6743.0909  

Poisson regression                                      Number of obs =  2,218
                                                        Wald chi2(6)  = 544.68
                                                        Prob > chi2   = 0.0000
Log pseudolikelihood = -6743.0909                       Pseudo R2     = 0.1115

------------------------------------------------------------------------------
             |               Robust
        wage |        IRR   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       black |
      Black  |   .9088679   .0320401    -2.71   0.007     .8481908    .9738856
             |
         urb |
     suburb  |   1.262356   .0455085     6.46   0.000     1.176239    1.354778
       city  |   1.251422   .0504944     5.56   0.000     1.156268    1.354408
             |
       grade |   1.077399   .0060877    13.19   0.000     1.065533    1.089397
             |
       south |
      South  |   .8913582   .0261891    -3.91   0.000     .8414784    .9441946
     ttl_exp |    1.03761   .0032063    11.95   0.000     1.031345    1.043913
       _cons |   1.602138    .146464     5.16   0.000     1.339321    1.916527
------------------------------------------------------------------------------
Note: _cons estimates baseline incidence rate.

. est store model

. 
. // predict wage for central city highschool graduates from non-south with 5 years experience
. margins black, at(urb=2 grade=12 south=0 ttl_exp=5) post

Adjusted predictions                                     Number of obs = 2,218
Model VCE: Robust

Expression: Predicted number of events, predict()
At: urb     =  2
    grade   = 12
    south   =  0
    ttl_exp =  5

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       black |
      White  |   5.899192   .2509011    23.51   0.000     5.407435    6.390949
      Black  |   5.361586   .2250443    23.82   0.000     4.920507    5.802665
------------------------------------------------------------------------------

. 
. // ratio of black versus white income
. di _b[1.black]/_b[0.black]
.90886789

. 
. // predict wage for rural highschool graduates from non-south with 5 years experience
. est restore model
(results model are active now)

. margins black, at(urb=0 grade=12 south=0 ttl_exp=5) post

Adjusted predictions                                     Number of obs = 2,218
Model VCE: Robust

Expression: Predicted number of events, predict()
At: urb     =  0
    grade   = 12
    south   =  0
    ttl_exp =  5

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       black |
      White  |   4.713989   .2144403    21.98   0.000     4.293694    5.134284
      Black  |   4.284393   .2187626    19.58   0.000     3.855626     4.71316
------------------------------------------------------------------------------

. 
. // ratio of black versus white income
. di _b[1.black]/_b[0.black]
.90886789

. 
. // strange: that is the same number, lets try something else:
. // predict wage for central city highschool graduates from south with 5 years experience
. est restore model
(results model are active now)

. margins black, at(urb=2 grade=12 south=1 ttl_exp=5) post

Adjusted predictions                                     Number of obs = 2,218
Model VCE: Robust

Expression: Predicted number of events, predict()
At: urb     =  2
    grade   = 12
    south   =  1
    ttl_exp =  5

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       black |
      White  |   5.258293   .2490601    21.11   0.000     4.770144    5.746442
      Black  |   4.779093   .2006576    23.82   0.000     4.385812    5.172375
------------------------------------------------------------------------------

. 
. // ratio of black versus white income
. di _b[1.black]/_b[0.black]
.90886789

. 
. // So regardles of your other characteristics: the ratio of black versus white income is .90886789 
. // lets look at the model again
. est restore model
(results model are active now)

. poisson, irr

Poisson regression                                      Number of obs =  2,218
                                                        Wald chi2(6)  = 544.68
                                                        Prob > chi2   = 0.0000
Log pseudolikelihood = -6743.0909                       Pseudo R2     = 0.1115

------------------------------------------------------------------------------
             |               Robust
        wage |        IRR   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       black |
      Black  |   .9088679   .0320401    -2.71   0.007     .8481908    .9738856
             |
         urb |
     suburb  |   1.262356   .0455085     6.46   0.000     1.176239    1.354778
       city  |   1.251422   .0504944     5.56   0.000     1.156268    1.354408
             |
       grade |   1.077399   .0060877    13.19   0.000     1.065533    1.089397
             |
       south |
      South  |   .8913582   .0261891    -3.91   0.000     .8414784    .9441946
     ttl_exp |    1.03761   .0032063    11.95   0.000     1.031345    1.043913
       _cons |   1.602138    .146464     5.16   0.000     1.339321    1.916527
------------------------------------------------------------------------------
Note: _cons estimates baseline incidence rate.

. 
. // he, the "effect" of black is exactly that ratio
. // OK, I don't need all that -margins- stuff. I can just directly report the irr for black

The difference between the poisson (maximum quasi-likelihood) and gsem (maximum likelihood) suggested by Joerg Luedicke (StataCorp) is that the former is a bit more robust: it only uses information from the conditional means, if for example the variance is incorrectly specified than quasi-likelihood does not care. Maximum likelihood estimates are influenced by those misspecifications.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------

Comment

Suzanne Model

Join Date: Jun 2025

Posts: 2
#6

06 Jun 2025, 18:06

Thanks to everyone for their help. While I've been following Stata List for years, this was my first post. The suggestions of Dr. Schechter (median) & Professor Buis (mean) are especially appreciated. I will doubtless be using the List again. Suzanne
Comment

Announcement