How to create a dummy that indicates original zeroes before ln-transformation?

Rachel Sleeps

Join Date: May 2015

Posts: 64
#1

How to create a dummy that indicates original zeroes before ln-transformation?

26 May 2015, 04:28

Hi,

I am using Stata 13. I got a technical question related to the transformaton of data related to a discussion on stackexchange (see link below). The problem evolves around the treatment of highly skewed positive data with zeroes. Merely ln-transforming a variable [var] results in missings for [var=0] . I decided to generate a dummy that takes on 1 if the non-transformed variable has a value of 0. The dummy has a value of 0 otherwise.

Here is a an example of what I do:

Code:

clear sysuse nlsw88 gen ln_tenure=ln(tenure) gen null_tenure = 0 replace null_tenure=1 if tenure==0 reg wage grade ln_tenure null_tenure

A historgram of tenure shows it is positively skewed. I geneate [ln_tenure] and the bespoke dummy [null_tenure]. However [null_tenure] gets omitted. How can I avoid the dummy being omitted?

Best
/R

How should I transform non-negative data including zeros? - Cross Validated

http://stats.stackexchange.com

If I have highly skewed positive data I often take logs. But what should I do with highly skewed non-negative data that include zeros? I have seen two transformations used: log(x+1) which has the...
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35754
#2

26 May 2015, 04:36

As I understand it, the idea is to fudge the zeros to ones, thus making the transformation applicable even when otherwise ln(0) would merely yield missing values. The intent of the indicator (you say dummy) is to estimate from the data the offset for the zeros. Also, and crucially, we keep track of which observations were fudged.

Code:

clear sysuse nlsw88 gen ln_tenure = ln(cond(tenure == 0, 1, tenure)) gen null_tenure = tenure == 0 reg wage grade ln_tenure null_tenure

With your code, the zeros on tenure get omitted on transformation; so in the data used, the indicator is a constant and necessarily dropped from the estimation. .

UPDATE:

Note that symmetrizing the marginal distributions of the predictors is in no sense needed or even intrinsically desirable for regression. Otherwise every indicator predictor that wasn't split 50:50 would be problematic, and very few are.

It's the effect of the predictors on the response that's crucial. That can mean wanting to pull outliers in, not the same issue at all.

Although I think this is a clever idea, it brings a cost in its train, namely that every problematic transformation has to carry an extra indicator variable.

Last edited by Nick Cox; 26 May 2015, 05:09.
1 like
Comment
Rachel Sleeps

Join Date: May 2015

Posts: 64
#3

26 May 2015, 05:42

Great, thanks a lot! The code works perfectly.
Given the "costs" you mention, I certainly agree. Espcially the interpretation increases in difficulty. I will see, if a Box-Cox transformation is the better solution.

/R
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35754

26 May 2015, 06:37

Box-Cox, despite its wonderful name, is vastly oversold in my view.

But let's back-track. First, although it's an excellent principle to ask questions in terms of mutually accessible datasets, I am guessing that your real dataset is something else. And we can't comment on what makes sense for your real problem.

But let's focus on the example you started with. It's a great sandbox to play in and illustrates several principles.

To jump to a conclusion: If any variable benefits from transformation here, it's wage!

The naive untransformed regression would be a bad idea here:

Code:

sysuse nlsw88
regress wage grade tenure
favplots

Here my bias is use favplots (SSC) rather than avplots. The former cuts down on decimal places, etc. The result shows that we are not capturing the structure at all well.

Click image for larger version

Name: favplots_1.png
Views: 1
Size: 21.5 KB
ID: 1295619

Let's look at some descriptive statistics, using moments (SSC) for a concise summary. Naturally we can and should look at graphs too.

Code:

. moments wage grade tenure

------------------------------------------------------------------------
               n = 2229 |       mean          SD    skewness    kurtosis
------------------------+-----------------------------------------------
            hourly wage |      7.794       5.767       3.091      15.792
current grade completed |     13.101       2.524       0.044       3.615
     job tenure (years) |      5.971       5.507       1.048       3.177
------------------------------------------------------------------------

But as every economist knows, or should know, wages are usually best considered on logarithmic scale.

Code:

. gen ln_wage = ln(wage)

. regress ln_wage grade tenure

      Source |       SS           df       MS      Number of obs   =     2,229
-------------+----------------------------------   F(2, 2226)      =    346.08
       Model |  173.625859         2  86.8129293   Prob > F        =    0.0000
    Residual |  558.381943     2,226  .250845437   R-squared       =    0.2372
-------------+----------------------------------   Adj R-squared   =    0.2365
       Total |  732.007802     2,228  .328549283   Root MSE        =    .50084

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       grade |   .0876606   .0042355    20.70   0.000     .0793547    .0959665
      tenure |   .0263594   .0019414    13.58   0.000     .0225523    .0301665
       _cons |   .5669802   .0562862    10.07   0.000     .4566012    .6773592
------------------------------------------------------------------------------

. favplots

Click image for larger version

Name: favplots_2.png
Views: 1
Size: 24.3 KB
ID: 1295620

There is still some irregularity left, worth exploring, but I don't see that the message is emphatically to transform tenure! In practice, there would be other predictors used in many versions of this problem.

Last edited by Nick Cox; 26 May 2015, 07:37.

Comment

Rachel Sleeps

Join Date: May 2015

Posts: 64
#5

26 May 2015, 08:17

Fair point(s) and too the point as well! Thank you very much. Moreover, you are right, my data is unfortunatly of proprietary nature and I hence may not disclose it.

A follow up question: Wouldn't in the case above a different model make more sense than a transformation? That is a model that takes account of the truncated/censored data structure (e.g. Tobit)? Espcially given a variable such as wage, one could think of negative income and thus a latent variable capturing these cases could do the trick.

Such as:

Code:

tobit wage tenure, ll(0) margins, dydx(*) margins, predict(ystar(0,.))
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#6

26 May 2015, 08:41

Yes indeed; elsewhere I have (often!) written on the need to respect the bounds of response variables.

But I emphatically would not use tobit here. I did experiment with glm and poisson with this dataset earlier, but cut that out of the post as digressing in a different direction.

Despite what might be guessed glm, link(log) and poisson give essentially the same predictions with these data and grade and tenure as predictors.

See (e.g.) http://blog.stata.com/2011/08/22/use...tell-a-friend/ for one version of the main argument.
1 like
Comment
Rachel Sleeps

Join Date: May 2015

Posts: 64
#7

26 May 2015, 23:44

That is surprising and very interesting. Thank you very much!
Comment

Announcement