Small correlations between DV and IVs

Monica Muller

Join Date: Jul 2014

Posts: 226
#1

Small correlations between DV and IVs

06 Jun 2017, 02:34

Hi,

I have a continuous dependent variable and several dummy variables and continuous variables as independent and control variables. When I run the regression the results are all significant and at the direction of my hypotheses. But the correlations between DV and IVs are in the range of 0.04 to 0.08, all significant.

Is it a serious problem? How can I address it?

Thanks
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#2

06 Jun 2017, 02:40

Monica:
your chances of getting helpfule replies are conditional on posting what you typed and what Stata gave you back (as per FAQ). Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

06 Jun 2017, 03:03

The best option to get helpful replies, as Carlo pointed out, is sharing data and command.

That said, and assuming you performed a linear regression, I wonder whether you also got a tony r-squared. Being this so, it may well be an issue related to the model itself.

Additionally, you may wish to start by using a model with less variables and, say, 'strong' rationale,and check what happens.

Best regards,

Marcos
Comment
Monica Muller

Join Date: Jul 2014

Posts: 226
#4

06 Jun 2017, 09:11

Thanks Carlo and Marcos,

I am attaching the regression outputs once with control variables and once without control variables. All the x variables are standardized. As you said R2 is extremely small without control variables included. I have developed theory and hypotheses around these x variables and I have written the theoretical and method parts of the paper, The results are significant and all in the correct direction but the effect sizes are tiny. I have spent 2 years on this project developing the theory and cleaning the data, but the results are very weird. Does it mean the whole thing is useless?

Last edited by Monica Muller; 06 Jun 2017, 09:15.
Comment
Monica Muller

Join Date: Jul 2014

Posts: 226
#5

06 Jun 2017, 09:14

without control

with control:
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#6

06 Jun 2017, 10:15

Monica:
I'm not clear whether you have a cross-sectional or a panel dataset (as it would seem from the -yr**-).
Assuming that -regress- was the way to go, did you perform a thorough postestimation session (-estat ovtest-; -estat hettest-; -estat vif-)?
I find strange that you did not use -fvvarlist- for -yr**-.
I would also suggest to test -yr**- via -testparm-

Kind regards,
Carlo
(Stata 19.0)
Comment
Monica Muller

Join Date: Jul 2014

Posts: 226
#7

06 Jun 2017, 10:29

Hi Carlo,

The year variables are the dummies for the year the employees got hired. But the performance variable is only one observation per person and it's their last performance. Tenure controls for the number of years they were in the organization. Some people left at different times. In other words, all the variables are time invariant. Does it justify using regress?
Comment
Monica Muller

Join Date: Jul 2014

Posts: 226
#8

06 Jun 2017, 10:33

I also just ran the three tests on the full model:
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#9

06 Jun 2017, 10:37

Monica:
thanks for further clarifications.
Yes, I would still go -regress- but I would use -fvvarlist- notation and would search for possible turning points for -age- and -tenure-.
I suspected, as it is the case, that your regression model suffers from omitted variable bias (that is, some predictors, such -age.- and -tenure- may have non-linear relationship with the DV).
Besides, if your DV is positively skewed, try logging it. Logging may improve the goodness of fit of your model eliminating both omitted variable bias and heteroskedasticity (omitted variable bias is ranked first on my misspecification list, though).

Last edited by Carlo Lazzaro; 06 Jun 2017, 10:40.

Kind regards,
Carlo
(Stata 19.0)
Comment
Monica Muller

Join Date: Jul 2014

Posts: 226
#10

06 Jun 2017, 11:31

Hi Carlo,

Thanks so much for taking the time and answering my questions. I really appreciate it.
I followed your advice and sounds like age and tenure do have nolinear relationship with DV, but even with that R-squared didn't change much. Also, the effect size of my main predictors are still small. The DV is almost normally distributed.

Here is the new output:

Performance Distribution:
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#11

06 Jun 2017, 11:53

Monica:
-what about heteroskedasticity and omitted variable bias with this new model specification? If they are still apparent, your estimates are unavoidably biased.
-what if you log the DV?
-despite squared age and age being statistical significance, there's something strange about the turning point coordinates, since the formula (-b/2a) seems to give back a negative number, which is obviously outside any range for age:

Code:

. di -(-.1170154)/(2*-.0557921) -1.0486736

Conversely, tenure and squared tenure look reasonable:

Code:

. di -(.2667565)/(2*-.0488669) 2.7294191

Kind regards,
Carlo
(Stata 19.0)
Comment
Monica Muller

Join Date: Jul 2014

Posts: 226
#12

06 Jun 2017, 13:37

Thanks again Carlo, Unfortunately, the p-values for hettest and ovtest are still 0. I also tried it with log of the DV but the results are the same. The age variable is standardized, so I guess that's why there is the negative value. Sounds like I am out of options. right?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#13

06 Jun 2017, 14:27

Monica:
what if you center -age- and -tenure- around their mean without standardizing age?
Did you consider all the predictors reported in the literature of your research field for this kind of research?
As per your ANOVA-like outcome table, it seems that residuals are still relevant. I suspect that you may have some predictors embedded among residuals that is correlated with both the DV and one or more independent variables (endogeneity).
For instance, it may be that belonging to a given industry correlates with both tenure and DV (I assume that industry is actually not included among the set of your anonymized predictors).
Another issue I would investigate relates to the risk or reversal causality (endogeneity again): are you sure that performance cannot explain variation in tenure, for instance?
That said, please consider that I'm not a labor
economist and take a look at the literature in your research field to look for other suggestions.
Eventually, it may also happen that, on average, the squared R of linear regression models in your research field are simply low.

Kind regards,
Carlo
(Stata 19.0)
Comment
Monica Muller

Join Date: Jul 2014

Posts: 226
#14

07 Jun 2017, 18:48

Carlo, thank you so much for all the help. I will try your other suggestions. Thanks.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2156
#15

08 Jun 2017, 10:53

I have a somewhat different take.

1. The hettest is unnecessary. Use robust standard errors. Unless you have a good reason to want to know whether there's heteroskedasticity I wouldn't even bother.
2. perf takes negative and positive values, so taking the log is not an option.
3. There's a case for ovtest, but that's a pure test for functional form. I've written on this. It's pretty clear that putting in nonlinear functions of fitted values, which are just functions of the covariates, is testing functional form. Of course you're functional form is not perfect -- it rarely is -- and with lots of observations you'll detect minor deviations that are unlikely to be important. I agree with Carlo that putting in squares can be a good idea. You might try some interactions, too. But OLS still estimates the best approximation to the conditional mean function, and the marginal effects you get from a flexible linear model are probably similar to those if you could really estimate the mean.
4. I rarely look at VIFs, and I never do if the variables I'm interested in have precisely estimated coefficients. If you'd gotten a VIF of 11, say, for x1, what would you do? Nothing. Any correlation between x1 and the other independent variables is properly captured by the standard errors. Your standard errors give you lots of significant variables.
5. An R-squared of .18 is not bad for cross sectional regressions, It means lots of the variation in perf is unexplained. That is not surprising. It doesn't mean that the coefficients on you x variables are somehow biased. Correlation between x and the error term is a completely different matter.

Your regression looks pretty good even though I don't know x1, x2, and so on. Use the "robust" option to obtain heteroskedasticity-robust standard errors. Maybe try some interactions between tenure and some of the x variables, and maybe age. Or interactions among the xj. Be sure to use "margins" to obtain average marginal effects.
1 like
Comment

Announcement

Small correlations between DV and IVs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment