Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to create a dummy that indicates original zeroes before ln-transformation?

    Hi,

    I am using Stata 13. I got a technical question related to the transformaton of data related to a discussion on stackexchange (see link below). The problem evolves around the treatment of highly skewed positive data with zeroes. Merely ln-transforming a variable [var] results in missings for [var=0] . I decided to generate a dummy that takes on 1 if the non-transformed variable has a value of 0. The dummy has a value of 0 otherwise.

    Here is a an example of what I do:

    Code:
    clear
    
    sysuse nlsw88
    
    gen ln_tenure=ln(tenure)
    
    gen null_tenure = 0
        replace null_tenure=1 if tenure==0
    
    reg wage grade ln_tenure null_tenure
    A historgram of tenure shows it is positively skewed. I geneate [ln_tenure] and the bespoke dummy [null_tenure]. However [null_tenure] gets omitted. How can I avoid the dummy being omitted?

    Best
    /R
    If I have highly skewed positive data I often take logs. But what should I do with highly skewed non-negative data that include zeros? I have seen two transformations used: log(x+1) which has the...

  • #2
    As I understand it, the idea is to fudge the zeros to ones, thus making the transformation applicable even when otherwise ln(0) would merely yield missing values. The intent of the indicator (you say dummy) is to estimate from the data the offset for the zeros. Also, and crucially, we keep track of which observations were fudged.

    Code:
      
    clear
    sysuse nlsw88
    gen ln_tenure = ln(cond(tenure == 0, 1, tenure))  
    gen null_tenure = tenure == 0
    reg wage grade ln_tenure null_tenure
    With your code, the zeros on tenure get omitted on transformation; so in the data used, the indicator is a constant and necessarily dropped from the estimation. .

    UPDATE:

    Note that symmetrizing the marginal distributions of the predictors is in no sense needed or even intrinsically desirable for regression. Otherwise every indicator predictor that wasn't split 50:50 would be problematic, and very few are.

    It's the effect of the predictors on the response that's crucial. That can mean wanting to pull outliers in, not the same issue at all.

    Although I think this is a clever idea, it brings a cost in its train, namely that every problematic transformation has to carry an extra indicator variable.
    Last edited by Nick Cox; 26 May 2015, 05:09.

    Comment


    • #3
      Great, thanks a lot! The code works perfectly.
      Given the "costs" you mention, I certainly agree. Espcially the interpretation increases in difficulty. I will see, if a Box-Cox transformation is the better solution.

      /R

      Comment


      • #4
        Box-Cox, despite its wonderful name, is vastly oversold in my view.

        But let's back-track. First, although it's an excellent principle to ask questions in terms of mutually accessible datasets, I am guessing that your real dataset is something else. And we can't comment on what makes sense for your real problem.

        But let's focus on the example you started with. It's a great sandbox to play in and illustrates several principles.

        To jump to a conclusion: If any variable benefits from transformation here, it's wage!

        The naive untransformed regression would be a bad idea here:

        Code:
        sysuse nlsw88
        regress wage grade tenure
        favplots
        Here my bias is use favplots (SSC) rather than avplots. The former cuts down on decimal places, etc. The result shows that we are not capturing the structure at all well.
        Click image for larger version

Name:	favplots_1.png
Views:	1
Size:	21.5 KB
ID:	1295619





        Let's look at some descriptive statistics, using moments (SSC) for a concise summary. Naturally we can and should look at graphs too.


        Code:
        . moments wage grade tenure
        
        ------------------------------------------------------------------------
                       n = 2229 |       mean          SD    skewness    kurtosis
        ------------------------+-----------------------------------------------
                    hourly wage |      7.794       5.767       3.091      15.792
        current grade completed |     13.101       2.524       0.044       3.615
             job tenure (years) |      5.971       5.507       1.048       3.177
        ------------------------------------------------------------------------
        But as every economist knows, or should know, wages are usually best considered on logarithmic scale.

        Code:
        . gen ln_wage = ln(wage)
        
        . regress ln_wage grade tenure
        
              Source |       SS           df       MS      Number of obs   =     2,229
        -------------+----------------------------------   F(2, 2226)      =    346.08
               Model |  173.625859         2  86.8129293   Prob > F        =    0.0000
            Residual |  558.381943     2,226  .250845437   R-squared       =    0.2372
        -------------+----------------------------------   Adj R-squared   =    0.2365
               Total |  732.007802     2,228  .328549283   Root MSE        =    .50084
        
        ------------------------------------------------------------------------------
             ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
               grade |   .0876606   .0042355    20.70   0.000     .0793547    .0959665
              tenure |   .0263594   .0019414    13.58   0.000     .0225523    .0301665
               _cons |   .5669802   .0562862    10.07   0.000     .4566012    .6773592
        ------------------------------------------------------------------------------
        
        . favplots
        Click image for larger version

Name:	favplots_2.png
Views:	1
Size:	24.3 KB
ID:	1295620






        There is still some irregularity left, worth exploring, but I don't see that the message is emphatically to transform tenure! In practice, there would be other predictors used in many versions of this problem.
        Last edited by Nick Cox; 26 May 2015, 07:37.

        Comment


        • #5
          Fair point(s) and too the point as well! Thank you very much. Moreover, you are right, my data is unfortunatly of proprietary nature and I hence may not disclose it.

          A follow up question: Wouldn't in the case above a different model make more sense than a transformation? That is a model that takes account of the truncated/censored data structure (e.g. Tobit)? Espcially given a variable such as wage, one could think of negative income and thus a latent variable capturing these cases could do the trick.

          Such as:
          Code:
          tobit wage tenure, ll(0)
          margins, dydx(*)
          margins, predict(ystar(0,.))

          Comment


          • #6
            Yes indeed; elsewhere I have (often!) written on the need to respect the bounds of response variables.

            But I emphatically would not use tobit here. I did experiment with glm and poisson with this dataset earlier, but cut that out of the post as digressing in a different direction.

            Despite what might be guessed glm, link(log) and poisson give essentially the same predictions with these data and grade and tenure as predictors.

            See (e.g.) http://blog.stata.com/2011/08/22/use...tell-a-friend/ for one version of the main argument.

            Comment


            • #7
              That is surprising and very interesting. Thank you very much!

              Comment

              Working...
              X