Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • mlogit - log transformation with zero values and interpretation

    Hi,

    I am still working on my master's thesis, and I have come across a challenge in my dataset. For this particular problem, I am measuring if the share of workers in a firms workforce engaged in R&D activities increases the probability of choosing 4 strategies (DI, FO, DIFO, FI) over domestic outsourcing (DO). The difficulty comes because the data is heavily skewed, and I therefore want to log transform it, but that leads to a lot of values being assigned to missing, because log(0) is not defined. So essentially, if I just do the log transformation and don't care about the missing values, the mlogit will be a comparison only between the firms that have posivitive values for R&D workers? I know it will always be a discussion whether the 0 values are really 0 or rather missing, but there is also a large part of the dataset that reports it as missing, so I believe that the 0 values are real observations.

    I have looked at the discussion here: Statlist and I adopted the option suggested by Maarten.

    The results:
    Without Maartens option:
    Code:
    . mlogit sourcingmode tfp2008 lnrdemployees, robust base(1)
    
    Iteration 0:   log pseudolikelihood = -5694.5181 
    Iteration 1:   log pseudolikelihood = -5585.3787 
    Iteration 2:   log pseudolikelihood =  -5583.703 
    Iteration 3:   log pseudolikelihood = -5583.7025 
    Iteration 4:   log pseudolikelihood = -5583.7025 
    
    Multinomial logistic regression                 Number of obs     =      4,150
                                                    Wald chi2(8)      =     170.71
                                                    Prob > chi2       =     0.0000
    Log pseudolikelihood = -5583.7025               Pseudo R2         =     0.0195
    
    -------------------------------------------------------------------------------
                  |               Robust
     sourcingmode |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    --------------+----------------------------------------------------------------
    DO            |  (base outcome)
    --------------+----------------------------------------------------------------
    DI            |
          tfp2008 |   .4398071   .1203532     3.65   0.000     .2039192     .675695
    lnrdemployees |  -.3125721   .0642648    -4.86   0.000    -.4385289   -.1866154
            _cons |  -2.192719   .1846315   -11.88   0.000     -2.55459   -1.830848
    --------------+----------------------------------------------------------------
    FO            |
          tfp2008 |   .5604576   .0778736     7.20   0.000     .4078282     .713087
    lnrdemployees |  -.1284441   .0383343    -3.35   0.001    -.2035779   -.0533102
            _cons |  -.1979827   .1044565    -1.90   0.058    -.4027136    .0067482
    --------------+----------------------------------------------------------------
    DIFO          |
          tfp2008 |   .8262717   .1361701     6.07   0.000     .5593831     1.09316
    lnrdemployees |   -.455896   .0603923    -7.55   0.000    -.5742626   -.3375293
            _cons |  -2.495834   .1786481   -13.97   0.000    -2.845978   -2.145691
    --------------+----------------------------------------------------------------
    FI            |
          tfp2008 |   1.171344   .1438492     8.14   0.000     .8894053    1.453284
    lnrdemployees |  -.2154562   .0678397    -3.18   0.001    -.3484196   -.0824928
            _cons |  -2.062353   .1901989   -10.84   0.000    -2.435136    -1.68957
    -------------------------------------------------------------------------------
    With Maartens suggestion:
    Code:
    . mlogit sourcingmode tfp2008 RDintensity NoRD, robust base(1)
    
    Iteration 0:   log pseudolikelihood = -8807.9663 
    Iteration 1:   log pseudolikelihood = -8526.8314 
    Iteration 2:   log pseudolikelihood = -8516.6735 
    Iteration 3:   log pseudolikelihood = -8516.5925 
    Iteration 4:   log pseudolikelihood = -8516.5925 
    
    Multinomial logistic regression                 Number of obs     =      6,767
                                                    Wald chi2(12)     =     457.07
                                                    Prob > chi2       =     0.0000
    Log pseudolikelihood = -8516.5925               Pseudo R2         =     0.0331
    
    ------------------------------------------------------------------------------
                 |               Robust
    sourcingmode |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    DO           |  (base outcome)
    -------------+----------------------------------------------------------------
    DI           |
         tfp2008 |   .5273473   .0997652     5.29   0.000     .3318111    .7228835
     RDintensity |  -.3095094    .064229    -4.82   0.000    -.4353959    -.183623
            NoRD |  -1.893336   .3003138    -6.30   0.000    -2.481941   -1.304732
           _cons |  -2.179907    .184301   -11.83   0.000     -2.54113   -1.818684
    -------------+----------------------------------------------------------------
    FO           |
         tfp2008 |     .59379    .061399     9.67   0.000     .4734501    .7141298
     RDintensity |   -.127827   .0383913    -3.33   0.001    -.2030726   -.0525813
            NoRD |  -1.155353   .1855371    -6.23   0.000    -1.518999    -.791707
           _cons |  -.1932086   .1044491    -1.85   0.064     -.397925    .0115079
    -------------+----------------------------------------------------------------
    DIFO         |
         tfp2008 |   .8839903   .1133623     7.80   0.000     .6618042    1.106176
     RDintensity |  -.4543958   .0602361    -7.54   0.000    -.5724564   -.3363351
            NoRD |  -2.951971   .2829529   -10.43   0.000    -3.506548   -2.397393
           _cons |  -2.491253   .1782818   -13.97   0.000    -2.840679   -2.141827
    -------------+----------------------------------------------------------------
    FI           |
         tfp2008 |   1.135145    .130703     8.68   0.000     .8789721    1.391319
     RDintensity |  -.2159861   .0678936    -3.18   0.001    -.3490551   -.0829172
            NoRD |  -2.544993   .3380066    -7.53   0.000    -3.207474   -1.882512
           _cons |   -2.05091   .1901253   -10.79   0.000    -2.423549   -1.678271
    ------------------------------------------------------------------------------
    RDintensity is ln(rdemployees) and NoRD is a dummy = 1 if the firm has no R&D workers. As is obvious, I lose a lot of the observations when not including the zero values.

    Assuming that Maartens option is a good one, I was then just wondering if I get the interpretation right if I interpret it in the following way:

    All other strategies are significantly more likely to have some workers engaged in R&D. However, for the firms that have a positive number of R&D, the intensity (share of R&D workers over total employment) is higher for DO over the other categories.


    Just as a side note, this is not a causal study, it just looks at relationships between observed variables across firms sorted into different categories.


    Regards,
    Jørgen

  • #2
    Jorgen: In #1 you state "The difficulty comes because the data is heavily skewed...". There is no a priori reason why skewness of a RHS variable is problematic for mlogit estimation (or other estimation for that matter). I would suggest thinking through why you believe skewness is problematic before going to the step of transformation. If for your theoretical model zero is an interesting value for RDintensity, then zero is an interesting value for RDintensity.

    Comment


    • #3
      I guess why I have transformed it is because of the inspiration from earlier studies on the subject has done so. But these are rather interested in the relationship between RDintensity and log(Productivity) and use OLS, and then I guess log transforming the intensity gives results that are easier to interpret. I am only interested in the relative probabilities of being in each category, so I do not need t his benefit. After Reading up a bit I think I will just include the variable without log transforming it then. Thanks.

      Comment

      Working...
      X