Coding my continous variable to natural logarithm

Bence Boross

Join Date: Mar 2020

Posts: 23
#1

Coding my continous variable to natural logarithm

11 Apr 2020, 07:25

Dear All,

I am working on research to identify the influential factors of growth ambitions. I have a continuous dependent variable named growth aspirations which measure the number of jobs created in five years. This variable is right-skewed (12.3) so I would like to take the natural logarithm to be more normally distributed. However, most of the values take the value 0 which results in missing data if I transform it. Previous authors calculate entrepreneurs’ growth aspirations as the difference between (the natural logarithms of) the entrepreneurs expected number of employees in the next 5 years and the actual number of employees, exclusive of owners, at the firm’s inception. But if I take the log(Growthaspirations-actual numbers of employees) it also results in missing data as the difference takes a negative value. I have 90 thousand observations and 60-70% would be lost if I transform it to log. I am using the same database as those authors and their papers were published in highly recognized papers so I am sure they did the right thing.

Can anyone help me to solve this issue?

Thank you so much!
Tags: data, log, logit
Nick Cox

Join Date: Mar 2014

Posts: 35724
#2

11 Apr 2020, 07:47

Perhaps you should write to the authors of those papers to ask for their advice. That suggestion is intended seriously, not flippantly. On the face of it most of your data is inapplicable. Drop the data or change your goals.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#3

11 Apr 2020, 07:50

Well, you haven't said why you want to "normalize" this variable--there is seldom a good reason to do this in the first place. The only really good reason to transform (logarithmically or otherwise) a variable in a regression model is to properly specify a non-linear relationship. The distributions of predictor variables are of no importance at all. And as far as the outcome variable is concerned, when people worry about the normality of that they are usually mistaking it for the normality of the regression residuals, which, in turn, can be relevant, but only in small samples.

But assuming you need to linearize a relationship here, look at other transformations that might be helpful in this context and have a logarithm-like shape, such as cube root, or asinh(). I should add that, if you have a big spike of zeroes in your distribution, there is no transformation whatsoever that will eliminate that.
Comment
Bence Boross

Join Date: Mar 2020

Posts: 23
#4

11 Apr 2020, 08:21

Dear Nick Cox and Clyde Schechter,

I really appreciate your fast reply. Indeed, I need to linearize the relationship.

I am not doing anything pioneer work as I only investigate a well-established field from a slightly different perspective. So ,I guess, the goal of my paper is fine. I was just wondering if all the previous authors dropped nearly half of their observations just to take the natural logarithm of the dependent variable. I am using the same database that they used. I just thought they could have done something different than me. I am not a statistician until their articles appear in the Academy of Management Journal for example.

I have 90k observations in total out of which 56k take the value 0 for growth aspirations. So there is nothing to do about it?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

11 Apr 2020, 08:28

I can’t add to my previous comments. Sorry.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

11 Apr 2020, 09:35

Previous authors calculate entrepreneurs’ growth aspirations as the difference between (the natural logarithms of) the entrepreneurs expected number of employees in the next 5 years and the actual number of employees, exclusive of owners, at the firm’s inception.

I take this to mean

Code:

log(expected number of employees in the next 5 years) - log(actual number of employees, exclusive of owners, at the firm’s inception)

Note that they are taking the difference between two logs, neither of which will have zero values. (Unless they had zero employees at inception or expect to have zero employees after 5 years, both of which seem unlikely unless they are classifying all their employees as "contractors" ... .)

You on the other hand are taking the log of the difference between two values, which can be zero or negative.

So previous authors have apparently done something different than you.
Comment

Bence Boross

Join Date: Mar 2020
Posts: 23

12 Apr 2020, 04:29

Thank you so much for your reply.

This data was derived from the Global Entrepreneurship Monitor. This survey focuses on early-stage entrepreneurs. As a result, it is common to have firms with 0 growth aspiration and 0 number of employees (exclusive of owners). So if I perform the log transition I lose most of my variables.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double GrowthA float Ln_FirmS
 0  2
13  2
 7  7
 0  0
 0 16
 0  1
 0  1
 2  0
 0  3
 1  1
 0  0
 2  1
 5 10
 0  3
 3  3
80 20
15 35
 3  3
 3  0
 0  1
 0 30
-1  3
 0 15
 0  2
 7  3
10 10
 3  2
 0  1
 0  4
 0  3
 2  0
 0  2
 2  1
-1  4
 4  4
 0  2
 1  2
 0  2
10  0
 5 10
-5 10
 0  4
 2  2
 0 30
 0  3
 0  3
15  5
 2  2
 3  3
 0  2
 1  2
-1  4
 1  2
 0  4
 0  3
 0  5
 5  2
 0  2
 0  1
 2  1
 1  6
 0  1
 4  1
 0  4
-5 35
-1  1
 0  1
20 50
10 10
-1  1
 0  8
-1  1
-1  1
 0  1
 0  2
 0  2
 2  2
 1  1
-1  3
 4  2
 0  0
 5  1
 0  3
-1  2
 3  2
 2  2
-4  5
 0  5
 0 20
 0  3
 0  1
 2  4
 0  2
 0  0
 3  7
 0  6
14 46
 3  2
-1  4
 1  1
end

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

12 Apr 2020, 06:29

You do not appear to have understood what I wrote in post #6, or else you have not attempted to see how it applies to your data. I will try one more time with more detail.

There are two basic numbers
The actual number of employees at some time; I will call this E0

The expected number of employees 5 years later; I will call this E5

There are three measures of growth aspirations
E5 - E0: The expected change in the number of employees over the 5 year period; I will call this GAe

E5/E0: The expected change in the number of employees over the 5 year period expressed as a ratio: I will call this GAr

log(E5/E0) = log(E5) - log(E0): The logarithm of this ratio, a common approach to modeling growth; I will call this LGAr

Your variable Ln_FirmS appears to be E0.

Your variable GrowthA appears to be GAe. It is clearly not E0: a firm cannot have negative employment.

Previous authors have used either GAe or LGAr.

With your data
GAe = GrowthA

LGAr = log(E5) - log(E0) = log(E0+GAe) - log(E0) = log(Ln_FirmS + GrowthA) - log(Ln_FirmS)

They do not take log(GrowthA).
3 likes
Comment
Bence Boross

Join Date: Mar 2020

Posts: 23
#9

24 Apr 2020, 12:30

Dear William,

Thank you so much for taking the time to explain it to me. Now, I could transform my data!

Kind regards,
Bence
Comment

Announcement

Coding my continous variable to natural logarithm

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment