Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Coding my continous variable to natural logarithm

    Dear All,

    I am working on research to identify the influential factors of growth ambitions. I have a continuous dependent variable named growth aspirations which measure the number of jobs created in five years. This variable is right-skewed (12.3) so I would like to take the natural logarithm to be more normally distributed. However, most of the values take the value 0 which results in missing data if I transform it. Previous authors calculate entrepreneurs’ growth aspirations as the difference between (the natural logarithms of) the entrepreneurs expected number of employees in the next 5 years and the actual number of employees, exclusive of owners, at the firm’s inception. But if I take the log(Growthaspirations-actual numbers of employees) it also results in missing data as the difference takes a negative value. I have 90 thousand observations and 60-70% would be lost if I transform it to log. I am using the same database as those authors and their papers were published in highly recognized papers so I am sure they did the right thing.

    Can anyone help me to solve this issue?

    Thank you so much!


  • #2
    Perhaps you should write to the authors of those papers to ask for their advice. That suggestion is intended seriously, not flippantly. On the face of it most of your data is inapplicable. Drop the data or change your goals.

    Comment


    • #3
      Well, you haven't said why you want to "normalize" this variable--there is seldom a good reason to do this in the first place. The only really good reason to transform (logarithmically or otherwise) a variable in a regression model is to properly specify a non-linear relationship. The distributions of predictor variables are of no importance at all. And as far as the outcome variable is concerned, when people worry about the normality of that they are usually mistaking it for the normality of the regression residuals, which, in turn, can be relevant, but only in small samples.

      But assuming you need to linearize a relationship here, look at other transformations that might be helpful in this context and have a logarithm-like shape, such as cube root, or asinh(). I should add that, if you have a big spike of zeroes in your distribution, there is no transformation whatsoever that will eliminate that.

      Comment


      • #4
        Dear Nick Cox and Clyde Schechter,

        I really appreciate your fast reply. Indeed, I need to linearize the relationship.

        I am not doing anything pioneer work as I only investigate a well-established field from a slightly different perspective. So ,I guess, the goal of my paper is fine. I was just wondering if all the previous authors dropped nearly half of their observations just to take the natural logarithm of the dependent variable. I am using the same database that they used. I just thought they could have done something different than me. I am not a statistician until their articles appear in the Academy of Management Journal for example.

        I have 90k observations in total out of which 56k take the value 0 for growth aspirations. So there is nothing to do about it?

        Comment


        • #5
          I can’t add to my previous comments. Sorry.

          Comment


          • #6
            Previous authors calculate entrepreneurs’ growth aspirations as the difference between (the natural logarithms of) the entrepreneurs expected number of employees in the next 5 years and the actual number of employees, exclusive of owners, at the firm’s inception.
            I take this to mean
            Code:
            log(expected number of employees in the next 5 years) - log(actual number of employees, exclusive of owners, at the firm’s inception)
            Note that they are taking the difference between two logs, neither of which will have zero values. (Unless they had zero employees at inception or expect to have zero employees after 5 years, both of which seem unlikely unless they are classifying all their employees as "contractors" ... .)

            You on the other hand are taking the log of the difference between two values, which can be zero or negative.

            So previous authors have apparently done something different than you.

            Comment


            • #7
              Thank you so much for your reply.

              This data was derived from the Global Entrepreneurship Monitor. This survey focuses on early-stage entrepreneurs. As a result, it is common to have firms with 0 growth aspiration and 0 number of employees (exclusive of owners). So if I perform the log transition I lose most of my variables.

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input double GrowthA float Ln_FirmS
               0  2
              13  2
               7  7
               0  0
               0 16
               0  1
               0  1
               2  0
               0  3
               1  1
               0  0
               2  1
               5 10
               0  3
               3  3
              80 20
              15 35
               3  3
               3  0
               0  1
               0 30
              -1  3
               0 15
               0  2
               7  3
              10 10
               3  2
               0  1
               0  4
               0  3
               2  0
               0  2
               2  1
              -1  4
               4  4
               0  2
               1  2
               0  2
              10  0
               5 10
              -5 10
               0  4
               2  2
               0 30
               0  3
               0  3
              15  5
               2  2
               3  3
               0  2
               1  2
              -1  4
               1  2
               0  4
               0  3
               0  5
               5  2
               0  2
               0  1
               2  1
               1  6
               0  1
               4  1
               0  4
              -5 35
              -1  1
               0  1
              20 50
              10 10
              -1  1
               0  8
              -1  1
              -1  1
               0  1
               0  2
               0  2
               2  2
               1  1
              -1  3
               4  2
               0  0
               5  1
               0  3
              -1  2
               3  2
               2  2
              -4  5
               0  5
               0 20
               0  3
               0  1
               2  4
               0  2
               0  0
               3  7
               0  6
              14 46
               3  2
              -1  4
               1  1
              end

              Comment


              • #8
                You do not appear to have understood what I wrote in post #6, or else you have not attempted to see how it applies to your data. I will try one more time with more detail.

                There are two basic numbers
                • The actual number of employees at some time; I will call this E0
                • The expected number of employees 5 years later; I will call this E5
                There are three measures of growth aspirations
                • E5 - E0: The expected change in the number of employees over the 5 year period; I will call this GAe
                • E5/E0: The expected change in the number of employees over the 5 year period expressed as a ratio: I will call this GAr
                • log(E5/E0) = log(E5) - log(E0): The logarithm of this ratio, a common approach to modeling growth; I will call this LGAr
                Your variable Ln_FirmS appears to be E0.

                Your variable GrowthA appears to be GAe. It is clearly not E0: a firm cannot have negative employment.

                Previous authors have used either GAe or LGAr.

                With your data
                • GAe = GrowthA
                • LGAr = log(E5) - log(E0) = log(E0+GAe) - log(E0) = log(Ln_FirmS + GrowthA) - log(Ln_FirmS)
                They do not take log(GrowthA).


                Comment


                • #9
                  Dear William,

                  Thank you so much for taking the time to explain it to me. Now, I could transform my data!

                  Kind regards,
                  Bence

                  Comment

                  Working...
                  X