Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using log for skewed data

    I am a Stata novice so please bare with me. Any help would be greatly appreciated.

    I have a collection of research articles, the number of citations each paper has received, when it was released, the number of authors who have worked on the paper, and where these authors are from.

    I am regressing the number of citations (CIT) received against the number of authors (AUT), and the quality of the universities and research facilities, each one I have selected the top 5 universities and top 5 non academic research affiliates in the subject. I have a dummy variable for each of these as to whether a top 5 university or top 5 affiliate was involved (UNID and AFFD).

    I then have to control the skewness of this data as papers tend to recieve most of their citations within the first 4-5 years. To do this I will use Ln on the citations and authors, a use the i. command to control for year.

    So the full command I use is: reg lnCIT lnAUT UNID AFFD i.year, robust

    When I do this I get P-values of near zero, and an R-Squared of about 0.38, which seems like quite good results.

    We decided to go deeper, I changed the dummy variables to count how many top 5 universities and top 5 affiliates worked on each paper (UNI and AFF)

    Using the same form of regression: reg lnCIT lnAUT UNI AFF i.year, robust

    This gave me slightly higher p-values, still less than 0.05, and about the same R-squared.

    I then figured it would be best to use the Ln on both these new variables, as they are no longer Dummies.

    reg lnCIT lnAUT lnUNI lnAFF i.year, robust

    This then gave me a r-squared of almost 0.5, but all the p values shot up to much more the 0.05.

    Can anyone help me in interpreting this, or point out anything I've missed. From everything I've learnt in Econometrics this should not be the case, but it is fair to say this is not my strongest subject.
    Many thanks in advanced


  • #2
    Without having seen your data, it is difficult to answer your question. I would not use the log-transformation for your independent variables without a good reason. The skewness of the data does not matter for your independent variables.
    The problem is that the coefficient for a regression of log-dependent variable on a log-independent variable is the elasticity between the two level variables. In your case, the last specification estimates the effect of a 1% percent change in the number of author on the x% percent change of the number of citations. The same logic applies to your other independent variables.
    Whether these elasticies are meaningful depends on your research question.
    When in doubt, I would try to find out what the bibliometrics literature has to say about this modelling exercise.

    Comment

    Working...
    X