mlogit - log transformation with zero values and interpretation

Jorgen Steen

Join Date: Apr 2017
Posts: 6

mlogit - log transformation with zero values and interpretation

01 May 2017, 02:59

Hi,

I am still working on my master's thesis, and I have come across a challenge in my dataset. For this particular problem, I am measuring if the share of workers in a firms workforce engaged in R&D activities increases the probability of choosing 4 strategies (DI, FO, DIFO, FI) over domestic outsourcing (DO). The difficulty comes because the data is heavily skewed, and I therefore want to log transform it, but that leads to a lot of values being assigned to missing, because log(0) is not defined. So essentially, if I just do the log transformation and don't care about the missing values, the mlogit will be a comparison only between the firms that have posivitive values for R&D workers? I know it will always be a discussion whether the 0 values are really 0 or rather missing, but there is also a large part of the dataset that reports it as missing, so I believe that the 0 values are real observations.

I have looked at the discussion here: Statlist and I adopted the option suggested by Maarten.

The results:
Without Maartens option:

Code:

. mlogit sourcingmode tfp2008 lnrdemployees, robust base(1)

Iteration 0:   log pseudolikelihood = -5694.5181 
Iteration 1:   log pseudolikelihood = -5585.3787 
Iteration 2:   log pseudolikelihood =  -5583.703 
Iteration 3:   log pseudolikelihood = -5583.7025 
Iteration 4:   log pseudolikelihood = -5583.7025 

Multinomial logistic regression                 Number of obs     =      4,150
                                                Wald chi2(8)      =     170.71
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -5583.7025               Pseudo R2         =     0.0195

-------------------------------------------------------------------------------
              |               Robust
 sourcingmode |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
DO            |  (base outcome)
--------------+----------------------------------------------------------------
DI            |
      tfp2008 |   .4398071   .1203532     3.65   0.000     .2039192     .675695
lnrdemployees |  -.3125721   .0642648    -4.86   0.000    -.4385289   -.1866154
        _cons |  -2.192719   .1846315   -11.88   0.000     -2.55459   -1.830848
--------------+----------------------------------------------------------------
FO            |
      tfp2008 |   .5604576   .0778736     7.20   0.000     .4078282     .713087
lnrdemployees |  -.1284441   .0383343    -3.35   0.001    -.2035779   -.0533102
        _cons |  -.1979827   .1044565    -1.90   0.058    -.4027136    .0067482
--------------+----------------------------------------------------------------
DIFO          |
      tfp2008 |   .8262717   .1361701     6.07   0.000     .5593831     1.09316
lnrdemployees |   -.455896   .0603923    -7.55   0.000    -.5742626   -.3375293
        _cons |  -2.495834   .1786481   -13.97   0.000    -2.845978   -2.145691
--------------+----------------------------------------------------------------
FI            |
      tfp2008 |   1.171344   .1438492     8.14   0.000     .8894053    1.453284
lnrdemployees |  -.2154562   .0678397    -3.18   0.001    -.3484196   -.0824928
        _cons |  -2.062353   .1901989   -10.84   0.000    -2.435136    -1.68957
-------------------------------------------------------------------------------

With Maartens suggestion:

Code:

. mlogit sourcingmode tfp2008 RDintensity NoRD, robust base(1)

Iteration 0:   log pseudolikelihood = -8807.9663 
Iteration 1:   log pseudolikelihood = -8526.8314 
Iteration 2:   log pseudolikelihood = -8516.6735 
Iteration 3:   log pseudolikelihood = -8516.5925 
Iteration 4:   log pseudolikelihood = -8516.5925 

Multinomial logistic regression                 Number of obs     =      6,767
                                                Wald chi2(12)     =     457.07
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -8516.5925               Pseudo R2         =     0.0331

------------------------------------------------------------------------------
             |               Robust
sourcingmode |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
DO           |  (base outcome)
-------------+----------------------------------------------------------------
DI           |
     tfp2008 |   .5273473   .0997652     5.29   0.000     .3318111    .7228835
 RDintensity |  -.3095094    .064229    -4.82   0.000    -.4353959    -.183623
        NoRD |  -1.893336   .3003138    -6.30   0.000    -2.481941   -1.304732
       _cons |  -2.179907    .184301   -11.83   0.000     -2.54113   -1.818684
-------------+----------------------------------------------------------------
FO           |
     tfp2008 |     .59379    .061399     9.67   0.000     .4734501    .7141298
 RDintensity |   -.127827   .0383913    -3.33   0.001    -.2030726   -.0525813
        NoRD |  -1.155353   .1855371    -6.23   0.000    -1.518999    -.791707
       _cons |  -.1932086   .1044491    -1.85   0.064     -.397925    .0115079
-------------+----------------------------------------------------------------
DIFO         |
     tfp2008 |   .8839903   .1133623     7.80   0.000     .6618042    1.106176
 RDintensity |  -.4543958   .0602361    -7.54   0.000    -.5724564   -.3363351
        NoRD |  -2.951971   .2829529   -10.43   0.000    -3.506548   -2.397393
       _cons |  -2.491253   .1782818   -13.97   0.000    -2.840679   -2.141827
-------------+----------------------------------------------------------------
FI           |
     tfp2008 |   1.135145    .130703     8.68   0.000     .8789721    1.391319
 RDintensity |  -.2159861   .0678936    -3.18   0.001    -.3490551   -.0829172
        NoRD |  -2.544993   .3380066    -7.53   0.000    -3.207474   -1.882512
       _cons |   -2.05091   .1901253   -10.79   0.000    -2.423549   -1.678271
------------------------------------------------------------------------------

RDintensity is ln(rdemployees) and NoRD is a dummy = 1 if the firm has no R&D workers. As is obvious, I lose a lot of the observations when not including the zero values.

Assuming that Maartens option is a good one, I was then just wondering if I get the interpretation right if I interpret it in the following way:

All other strategies are significantly more likely to have some workers engaged in R&D. However, for the firms that have a positive number of R&D, the intensity (share of R&D workers over total employment) is higher for DO over the other categories.

Just as a side note, this is not a causal study, it just looks at relationships between observed variables across firms sorted into different categories.

Regards,
Jørgen

Tags: log, mlogit

John Mullahy

Join Date: Dec 2016

Posts: 746
#2

01 May 2017, 05:32

Jorgen: In #1 you state "The difficulty comes because the data is heavily skewed...". There is no a priori reason why skewness of a RHS variable is problematic for mlogit estimation (or other estimation for that matter). I would suggest thinking through why you believe skewness is problematic before going to the step of transformation. If for your theoretical model zero is an interesting value for RDintensity, then zero is an interesting value for RDintensity.
Comment
Jorgen Steen

Join Date: Apr 2017

Posts: 6
#3

01 May 2017, 06:16

I guess why I have transformed it is because of the inspiration from earlier studies on the subject has done so. But these are rather interested in the relationship between RDintensity and log(Productivity) and use OLS, and then I guess log transforming the intensity gives results that are easier to interpret. I am only interested in the relative probabilities of being in each category, so I do not need t his benefit. After Reading up a bit I think I will just include the variable without log transforming it then. Thanks.
Comment

Announcement

mlogit - log transformation with zero values and interpretation

Comment

Comment