Hi,
I am still working on my master's thesis, and I have come across a challenge in my dataset. For this particular problem, I am measuring if the share of workers in a firms workforce engaged in R&D activities increases the probability of choosing 4 strategies (DI, FO, DIFO, FI) over domestic outsourcing (DO). The difficulty comes because the data is heavily skewed, and I therefore want to log transform it, but that leads to a lot of values being assigned to missing, because log(0) is not defined. So essentially, if I just do the log transformation and don't care about the missing values, the mlogit will be a comparison only between the firms that have posivitive values for R&D workers? I know it will always be a discussion whether the 0 values are really 0 or rather missing, but there is also a large part of the dataset that reports it as missing, so I believe that the 0 values are real observations.
I have looked at the discussion here: Statlist and I adopted the option suggested by Maarten.
The results:
Without Maartens option:
With Maartens suggestion:
RDintensity is ln(rdemployees) and NoRD is a dummy = 1 if the firm has no R&D workers. As is obvious, I lose a lot of the observations when not including the zero values.
Assuming that Maartens option is a good one, I was then just wondering if I get the interpretation right if I interpret it in the following way:
All other strategies are significantly more likely to have some workers engaged in R&D. However, for the firms that have a positive number of R&D, the intensity (share of R&D workers over total employment) is higher for DO over the other categories.
Just as a side note, this is not a causal study, it just looks at relationships between observed variables across firms sorted into different categories.
Regards,
Jørgen
I am still working on my master's thesis, and I have come across a challenge in my dataset. For this particular problem, I am measuring if the share of workers in a firms workforce engaged in R&D activities increases the probability of choosing 4 strategies (DI, FO, DIFO, FI) over domestic outsourcing (DO). The difficulty comes because the data is heavily skewed, and I therefore want to log transform it, but that leads to a lot of values being assigned to missing, because log(0) is not defined. So essentially, if I just do the log transformation and don't care about the missing values, the mlogit will be a comparison only between the firms that have posivitive values for R&D workers? I know it will always be a discussion whether the 0 values are really 0 or rather missing, but there is also a large part of the dataset that reports it as missing, so I believe that the 0 values are real observations.
I have looked at the discussion here: Statlist and I adopted the option suggested by Maarten.
The results:
Without Maartens option:
Code:
. mlogit sourcingmode tfp2008 lnrdemployees, robust base(1) Iteration 0: log pseudolikelihood = -5694.5181 Iteration 1: log pseudolikelihood = -5585.3787 Iteration 2: log pseudolikelihood = -5583.703 Iteration 3: log pseudolikelihood = -5583.7025 Iteration 4: log pseudolikelihood = -5583.7025 Multinomial logistic regression Number of obs = 4,150 Wald chi2(8) = 170.71 Prob > chi2 = 0.0000 Log pseudolikelihood = -5583.7025 Pseudo R2 = 0.0195 ------------------------------------------------------------------------------- | Robust sourcingmode | Coef. Std. Err. z P>|z| [95% Conf. Interval] --------------+---------------------------------------------------------------- DO | (base outcome) --------------+---------------------------------------------------------------- DI | tfp2008 | .4398071 .1203532 3.65 0.000 .2039192 .675695 lnrdemployees | -.3125721 .0642648 -4.86 0.000 -.4385289 -.1866154 _cons | -2.192719 .1846315 -11.88 0.000 -2.55459 -1.830848 --------------+---------------------------------------------------------------- FO | tfp2008 | .5604576 .0778736 7.20 0.000 .4078282 .713087 lnrdemployees | -.1284441 .0383343 -3.35 0.001 -.2035779 -.0533102 _cons | -.1979827 .1044565 -1.90 0.058 -.4027136 .0067482 --------------+---------------------------------------------------------------- DIFO | tfp2008 | .8262717 .1361701 6.07 0.000 .5593831 1.09316 lnrdemployees | -.455896 .0603923 -7.55 0.000 -.5742626 -.3375293 _cons | -2.495834 .1786481 -13.97 0.000 -2.845978 -2.145691 --------------+---------------------------------------------------------------- FI | tfp2008 | 1.171344 .1438492 8.14 0.000 .8894053 1.453284 lnrdemployees | -.2154562 .0678397 -3.18 0.001 -.3484196 -.0824928 _cons | -2.062353 .1901989 -10.84 0.000 -2.435136 -1.68957 -------------------------------------------------------------------------------
Code:
. mlogit sourcingmode tfp2008 RDintensity NoRD, robust base(1) Iteration 0: log pseudolikelihood = -8807.9663 Iteration 1: log pseudolikelihood = -8526.8314 Iteration 2: log pseudolikelihood = -8516.6735 Iteration 3: log pseudolikelihood = -8516.5925 Iteration 4: log pseudolikelihood = -8516.5925 Multinomial logistic regression Number of obs = 6,767 Wald chi2(12) = 457.07 Prob > chi2 = 0.0000 Log pseudolikelihood = -8516.5925 Pseudo R2 = 0.0331 ------------------------------------------------------------------------------ | Robust sourcingmode | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- DO | (base outcome) -------------+---------------------------------------------------------------- DI | tfp2008 | .5273473 .0997652 5.29 0.000 .3318111 .7228835 RDintensity | -.3095094 .064229 -4.82 0.000 -.4353959 -.183623 NoRD | -1.893336 .3003138 -6.30 0.000 -2.481941 -1.304732 _cons | -2.179907 .184301 -11.83 0.000 -2.54113 -1.818684 -------------+---------------------------------------------------------------- FO | tfp2008 | .59379 .061399 9.67 0.000 .4734501 .7141298 RDintensity | -.127827 .0383913 -3.33 0.001 -.2030726 -.0525813 NoRD | -1.155353 .1855371 -6.23 0.000 -1.518999 -.791707 _cons | -.1932086 .1044491 -1.85 0.064 -.397925 .0115079 -------------+---------------------------------------------------------------- DIFO | tfp2008 | .8839903 .1133623 7.80 0.000 .6618042 1.106176 RDintensity | -.4543958 .0602361 -7.54 0.000 -.5724564 -.3363351 NoRD | -2.951971 .2829529 -10.43 0.000 -3.506548 -2.397393 _cons | -2.491253 .1782818 -13.97 0.000 -2.840679 -2.141827 -------------+---------------------------------------------------------------- FI | tfp2008 | 1.135145 .130703 8.68 0.000 .8789721 1.391319 RDintensity | -.2159861 .0678936 -3.18 0.001 -.3490551 -.0829172 NoRD | -2.544993 .3380066 -7.53 0.000 -3.207474 -1.882512 _cons | -2.05091 .1901253 -10.79 0.000 -2.423549 -1.678271 ------------------------------------------------------------------------------
Assuming that Maartens option is a good one, I was then just wondering if I get the interpretation right if I interpret it in the following way:
All other strategies are significantly more likely to have some workers engaged in R&D. However, for the firms that have a positive number of R&D, the intensity (share of R&D workers over total employment) is higher for DO over the other categories.
Just as a side note, this is not a causal study, it just looks at relationships between observed variables across firms sorted into different categories.
Regards,
Jørgen
Comment