'Simple' correlation regression

Thomas Meurs

Join Date: Jun 2015

Posts: 27
#1

'Simple' correlation regression

03 Jun 2015, 07:15

Hi all,

I'm quite new to Stata and I already run into trouble.

I want to run a fairly simple regression which tests the correlation between 2 variables.

My dataset is obtained from compustat and covers all firms in compustat from 01/01/02 up to 31/12/14. I want to check whether 'revenue growth' is correlated with 'epmloyee growth' as in:

Rev Growth at t = B0 + B1Employee growth at t. (I require B1)

Both employee- and revenue growth are computed as (Revenue at t - Revenue at t-1)/Revenue at t-1.

My dataset contains a lot of firms which all have a unique identifier and multiple year data. How do I do this? I'm not even sure how to compute the growth for each individual firm in one command.

P.s. I want to import an Excel file into stata which has about 1.1 million observations and Stata already freezes, is this due to my computer (5+ years old) or is this considered a 'large' sample for Stata understandings?
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

03 Jun 2015, 07:27

Hello Thomas,

Welcome to the Statalist!

With regards to the "fairly simple correlation", perhaps you may reach what you want by typing:

Code:

. by firm, sort: pwcorr rev_growth emp_growth, obs sig */ Or maybe you wish this: . by firm year, sort: pwcorr rev_growth emp_growth, obs sig

Best,

Marcos

Best regards,

Marcos
Comment
Thomas Meurs

Join Date: Jun 2015

Posts: 27
#3

03 Jun 2015, 16:02

Allright, I might want to specify some things.

I'm trying to replicate a previous paper, in which they state:
'We test whether Employee growth is typically correlated with Revenue growth by performing the following regression on all firms in Compustat during our sample period (1993 - 2002):
Rev Growth at t = B0 + B1Employee growth at t
In nontabulated results, the coefficient on Employee growth is positive (0.597) and highly significant (p<0.01), The R² is high (0.28), suggesting a strong correlation between employee growth and revenue growth reported by companies.

What I did:
Download revenue + employee data for all firms in compustat for 1993 - 2002. Calculated revenue and employee growth by (I did not take into account missing data):

Code:

bys gvkey: gen rev_g=(revt[_n]-revt[_n-1])/revt[_n-1]

Where gvkey is a unique firm identifier and revt is total revenue.

So I am not even sure if this is a correlation or a OLS regression. However, when I use 'reg, corr or pwcorr' I never get anything close to 0.597 (and it does not show R², probably just a setting, I know..)
Is it neccesary to remove missing data or is it ignored when running a regression? (both missing data and a '.' for the first years of rev_g/emp_g)

Last edited by Thomas Meurs; 03 Jun 2015, 16:09.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30162
#4

03 Jun 2015, 16:24

Well, for one thing, it sounds like you have panel data, so the use of -regress-, -corr-, or -pwcorr- would not be appropriate. You probably want to -xtset gvkey year- and then look into -xtreg-, probably with fixed effects.

Also, I don't know what the paper was referring to when they mentioned "the" R2, because panel data regression would give three R2 values, one within, one between, and one overall.

You do not have to explicitly remove missing data before doing any kind of regression model in Stata. Stata automatically excludes such observations from the calculations (sometimes to the dismay of users who aren't expecting that).
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17730
#5

04 Jun 2015, 04:42

Thomas:
why don't you e-mail the Authors of the paper and ask them what they did in their reserch?

Kind regards,
Carlo
(Stata 19.0)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#6

04 Jun 2015, 07:54

It seems it is not a "simple correlation regression" query at all. Also, I gather the interest is mostly towards the coefficient of determination.

Besides, Clyde and Carlo already provided great suggestions.

That said, here I share some suggestions from Multilevel and Longitudinal Model Using Stata (Rabe-Hesketh and Skrondal, StataPress, 2012), basically, when it comes to estimate R-squared from different - xtreg - models.

One can compare the null model with the "full" model, as well as with other models. In order to calculate the R-squared, first, we square the values of /sigma_u and /sigma_e of both models. Then, sum up the values in each model, respectively. To get the fraction that would "explain the variance by the covariates" (in the case presented in the query, that could be just one xvar), there is a quite "simple" formula according to the results we already got: (null model - "other model")/null model.

Hopefully that helps,

Best,

Marcos

Best regards,

Marcos
Comment
Thomas Meurs

Join Date: Jun 2015

Posts: 27
#7

04 Jun 2015, 09:20

That's exactly what I did. One of the auditors replied and he said he used the program 'SAS' (which I'm not familiar with) and ran a 'Proc Corr' procedure. He explicitly mentioned it is a simple correlation. I googled it and it appears to be a Pearson correlation coefficient (which is similar to the pwcorrelation function in Stata right?)
In perspective of the paper, which looks whether non-financial measures (e.g. employee growth) are related to cases of fraud (overstatement of revenue) I think I'm just interested if employee growth is correlated to revenue growth (if anyone is interested, link is posted below). So isn't it an option I just check the correlation between each revenue growth/employee growth pair?

I added the .dta file link (which should be accessible for 48h) if anyone cares to try.

The paper: http://onlinelibrary.wiley.com/doi/1...09.00349.x/pdf

Wharton Research Data Services

https://wrds-sol2.wharton.upenn.edu

Wharton Research Data Services taps the most comprehensive sources of financial, accounting, economic, management, marketing, banking, and insurance data.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30162
#8

04 Jun 2015, 09:51

Perhaps your inability to replicate the analysis arises from your remark

(I did not take into account missing data.)

For example, if a firm has missing records in 2001 and 2000, the calculated growth rates from 1999 to 2002 will be seriously inflated, and this may distort the calculation of correlations. Why don't you try it again as:

Code:

xtset gvkey year gen rev_growth = D1.rev_t/L1.rev_t gen employee_growth = D1.employee/L1.employee corr rev_growth employee_growth regress rev_growth employee_growth

and see if that works out better.

That said, I still don't think that taking a correlation or performing a regression on pooled data from many firms like this is an appropriate analysis (unless, without telling us in the article they first demonstrated that other firm-level effects were ignorable in their data), even if they did manage to get it published. Another possible interpretation of what they say in the article on page 1150 (p. 16 of the pdf at your link), is that they did a separate regression for each firm and are reporting the average of the coefficients and R2 terms. If that's what they did, the replication would look something like this:

Code:

statsby _b e(r2), saving(firm_by_firm_results, replace) by(gvkey): /// regress rev_growth employee_growth use firm_by_firm_results, clear rename _eq2_stat_1 r2 summ _b_employee_growth r2

One other question: they wrote this article in 2006 and it was accepted in 2009. Is it possible that the data at Compustat have been revised since then? Can you replicate their tables 1, 2, and 3, for example?

Last edited by Clyde Schechter; 04 Jun 2015, 10:01.
Comment
Joshua D Merfeld

Join Date: Jun 2015

Posts: 86
#9

04 Jun 2015, 10:45

Hi Thomas,

Looking at your code, it jumps out at me that you may not be calculating revenue change correctly. You have only sorted on the unique identifier for business, but not on wave. As such, depending on how your dataset was created, although you are calculating differences in revenue between one observation and the next, the observations may not be in the correct temporal order. I would suggest trying:

Code:

sort gvkey year bys gvkey: gen rev_g=(revt[_n]-revt[_n-1])/revt[_n-1]

You are quite correct that you only want to calculate changes within unique ids, though, so be sure not to use year in the -bys: gen- command.

Of course, depending on how your data were already sorted, this may not be an issue, but hopefully this gets you one step closer!

Hope this helps.
Josh
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30162
#10

04 Jun 2015, 11:00

Joshua is correct: if your data was not sorted by year within gvkey, then your calculation of rev_g was off the mark. I wish I had noticed that!

However, his code can be simplified:

Code:

bysort gvkey (year): gen rev_g = (revt - revt[_n-1])/revt[_n-1]

The -bysort gvkey (year)- prefix, with year in parentheses, tells Stata to sort on gvkey and year within gvkey, but to define the by-groups using only gvkey. And it is not necessary to explicitly say revt[_n], because revt without any subscripts is understood to mean the current observation.

All of that said, if there are missing years within any of the panels, the above code will still be wrong because it will calculate the growth from whatever observation happens to be preceding, even if it is several years earlier. The code in #8, which relies on the D and L operators automatically handles that situation correctly.
Comment
Joshua D Merfeld

Join Date: Jun 2015

Posts: 86
#11

04 Jun 2015, 11:12

Clyde,

Thanks for that code snippet. I was not aware of the power of parentheses in a bys command!
Comment
Thomas Meurs

Join Date: Jun 2015

Posts: 27
#12

04 Jun 2015, 13:20

Clyde Schechter As I read it, they took every firm between 1993 and 2002 that Compustat had data on, so even if Compustat updated some of its data I doubt they've changed so much that the correlation went from 0.597 to 0.0398. However, If they took only the firms in their sample (which is not really what they state) this might give different results since I don't have their sample and I've been testing on the entire population

In my latest runs I first removed all observations which did not have data for both revenue and employee numbers or if it was 0 (which makes no sense)

Code:

//removing missing observations revt + emp drop if emp >= . drop if revt >= . //remove 0 rev/emp drop if revt == 0 drop if emp == 0

I do get different results depending on how I calculate my growth rates.
When using:

Code:

xtset gvkey fyear gen rev_g = D1.revt/L1.revt gen emp_g = D1.emp/L1.emp corr rev_g emp_g regress rev_g emp_g

I get a correlation of 0.0398 and the regression coefficient of 0.1894329

When using:

Code:

bysort gvkey (fyear): gen rev_g = (revt - revt[_n-1])/revt[_n-1] bysort gvkey (fyear): gen emp_g = (emp - emp[_n-1])/emp[_n-1] corr rev_g emp_g regress rev_g emp_g

I get a correlation of 0.1409 and the regression coefficient is 1.92423

Where the latter is more close to their result..

I didn't understand the following part (still new to stata). It says: /// command not found.

Code:

statsby _b e(r2), saving(firm_by_firm_results, replace) by(gvkey): /// regress rev_growth employee_growth use firm_by_firm_results, clear rename _eq2_stat_1 r2 summ _b_employee_growth r2

I think I'll just leave this for now and ask my thesis supervisor what he thinks. Thank so far anyways
Comment
Thomas Meurs

Join Date: Jun 2015

Posts: 27
#13

04 Jun 2015, 13:35

Update:
While I was being a critical thinker I was wondering a possible other explanation. When I only do this (no removal of missing observations or anything):

Code:

//Revenue// gen rev_g=(revt[_n]-revt[_n-1])/revt[_n-1] label var rev_g "Revenue Growth" //Employee// gen emp_g=(emp[_n]-emp[_n-1])/emp[_n-1] label var emp_g "Employee Growth" corr rev_g emp_g pwcorr rev_g emp_g, obs sig star(0.01)

I get a correlation of 0.6324* (the pwcorr thing is because I know no other way of determining the significance of a correlation.

Anyone who believes 3 authors whom got their paper published in an A-rated accounting journal 'forgot' to calculate revenues for the firm specific observations and calculated them in one run?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30162
#14

04 Jun 2015, 13:48

I didn't understand the following part (still new to stata). It says: /// command not found.

When I post code on this Forum, I usually assume that it will be pasted into a do-file, perhaps edited, and then run from there. The /// notation in a do-file tells Stata that the command continues on the next line. Without it, a line break ends the command. But if you tried to run this by entering it in the Command window line-by-line, Stata will not recognize /// as a continuation symbol. Instead it thinks it's just a regular part of the line you entered. In this case, what follows the colon in a -statsby- command is supposed to be a command or name of a program. So Stata would think you are asking it to execute the command -///- repeatedly for each group. And, of course, there is no such command.

So if you want to run my code from the Command window, you need to eliminate the /// and put the -regress...- from the next line immediately after the colon without a line-break between. But really, it's better to copy the code into the do-editor and run it from there. Since you are doing this for a thesis, you will need a good audit trail of everything you have done, and having your code in do-files helps provide that. What you put in the Command window is lost forever once you close your Stata session.

With regard to the different results using the D and L operators vs. using _n-1 subscripting: if there were no missing years the two codes would produce identical results. The fact that there is a difference means that there must be some missing years in the data. (This may have been true in the original Compustat data set, or also may have been introduced by your code dropping observations with missing or zero values for emp and revt.) Moreover, given that there are missing years, the version with the D and L operators will be correct and the version using _n-1 subscripting will be incorrect.

Your thought on sample vs population may well be on target. I'll be curious to learn what your thesis supervisor has to say about all of this. Good luck!
Comment

Announcement

'Simple' correlation regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment