Checking assumptions for multiple regression - right approach?!

Niklas Adolph

Join Date: Aug 2018

Posts: 10
#1

Checking assumptions for multiple regression - right approach?!

17 Aug 2018, 08:41

Hi everyone!

I currently struggling with my dataset and the multiple regression I would like to do as there are certain assumptions which have to be met before (listed below).
Assumption: You should have independence of observations (i.e., independence of residuals), which you can check in Stata using the Durbin-Watson statistic.

Assumption: There needs to be a linear relationship between (a) the dependent variable and each of your independent variables, and (b) the dependent variable and the independent variables collectively. You can check for linearity in Stata using scatterplots and partial regression plots.

Assumption: Your data needs to show homoscedasticity, which is where the variances along the line of best fit remain similar as you move along the line. You can check for homoscedasticity in Stata by plotting the studentized residuals against the unstandardized predicted values.

Assumption: Your data must not show multicollinearity, which occurs when you have two or more independent variables that are highly correlated with each other. You can check this assumption in Stata through an inspection of correlation coefficients and Tolerance/VIF values.

Assumption: There should be no significant outliers, high leverage points or highly influential points, which represent observations in your data set that are in some way unusual. These can have a very negative effect on the regression equation that is used to predict the value of the dependent variable based on the independent variables. You can check for outliers, leverage points and influential points using Stata.

Assumption: The residuals (errors) should be approximately normally distributed, which you can check in Stata using a histogram (with a superimposed normal curve) and Normal P-P Plot, or a Normal Q-Q Plot of the studentized residuals.

My depended variable is continous as well as my control only variable and both were logarithmized. All other independent variables are dummy variables (7x I want to test) which are set to 1 if they meet the criteria which they were created for.
The problem that I'm facing is that I have worked with SPSS before and have no clue how to check for these assumptions in Stata. And currently I'm kind of devasteded as I'm not even sure this is the right way to test my set of data statistically.
For example I was looking for the scatterplots between the DV and one of the IV - but as they were all dummyV the plots looked wired.
I did some research and it is said that Stata has all the tools to check the assumptions but I don't know were to start and how to check all assumptions so I don't make any mistake.

I just added a picture of the variables. There are two more dummyV - but I guess the approach will be the same for them.

I would be glad for any advice as it is hard to find any good source about dummy variables and regressions.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#2

17 Aug 2018, 10:03

If you run -help regress postestimation-, and after reading that, -help regress postestimation plots- you will find pretty much everything you need there.

A couple of comments about the approach.

Assumption: Your data must not show multicollinearity, which occurs when you have two or more independent variables that are highly correlated with each other. You can check this assumption in Stata through an inspection of correlation coefficients and Tolerance/VIF values.

I know this is still widely taught, and easy to do in Stata. But it really is waste of time. If the confidence intervals around all the coefficients that you are interested in are sufficiently narrow that the estimated coefficients are precise enough to be useful, then there is no multicollinearity problem. If there is a multicollinearity problem, there is no way to solve it with existing data anyway. Do read Goldberger's textbook of econometrics: there is a well written chapter about multicollinearity in which he entertainingly and exhaustively demolishes the whole idea.

Assumption: The residuals (errors) should be approximately normally distributed, which you can check in Stata using a histogram (with a superimposed normal curve) and Normal P-P Plot, or a Normal Q-Q Plot of the studentized residuals.

Again, easy enough to do in Stata but probably not worth doing. Unless your sample is small, the law of large numbers will kick in and assure that the distributions of the t-statistics you rely on for inference are well approximated by the standard normal distribution, so all of your tests and p-values will be correct. Normality of residuals is only needed in small samples.

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4983
#3

17 Aug 2018, 10:51

I mostly agree with Clyde, except I will add that multicollinearity may occur because of error on the researchers part. For example, including variables that were computed from each other. Or, including 10 measures of what is basically the same thing, when a single scale would have been better. See pp. 3-4 of

https://www3.nd.edu/~rwilliam/stats2/l11.pdf

I'll also add that things like and x^2 tend to be highly correlated, and you shouldn't worry about things like that.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment
Niklas Adolph

Join Date: Aug 2018

Posts: 10
#4

20 Aug 2018, 02:47

Thanks a lot for the advice!
Is there any possibility to use factor variable with dataex? As I'm trying to use it this error comes up:

'. dataex ln_Amount_Raised ln_Tokens_For_Sale i.d_PreICO i.d_Restricted_Areas i.d_Bonus i.Whitelist i.KYC i.White_Paper n_Overall_Rating
factor-variable and time-series operators not allowed
r(101)'

I don't know if I made any mistake when choosing the type of variable in my regression and therefore I can't use stata. All dummys (having an 'i' above) were put into the regression model by using factor-variable with the specification 'main effect' and base set as 'default'. 'Amount Raised ' and 'Tokens_For_sale' (bothlogarithmized) and in this case 'Overall_Rating' were entered as continuous variables.
Is there already a mistake?
Comment
Niklas Adolph

Join Date: Aug 2018

Posts: 10
#5

20 Aug 2018, 02:54

And as I try to check the assumptions I had two different models (screenshots added as I can't use dataex for some reason).
For both I first wanted to check for independence of residuals by using the Breusch-Pagan test.
If I add my variable called "overall_rating" my R-squared doubles but the test fails.
Any idea why that is and what to do now? As I would like to test that variable…

Thanks a lot for any advice.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#6

22 Aug 2018, 10:08

Expressions such as i.varname are not variable names: they are factor-variable notation that Stata, in the context of some commands, recognizes to create "virtual" variables for the particular command's use. These "virtual" variables are not actual variables and do not appear anywhere in the dataset. -dataex- is intended to generate code that will enable the re-creation of (part of) your actual data set. So -dataex- has no reason to recognize factor variable notation. Just take the i. prefixes out of your -dataex- command and it will run just fine.

As for the results of -estat hettest-, I am personally not fond of these statistics as a basis for making modeling decisions. But I am particularly not fond of making a fetish about the 0.05 threshold. I would say that your second regression is, on the basis of what is shown, just as problematic as the first. 0.0588 is scarcely different from 0.05. I would, first explore graphically what the heteroscedasticity actually looks like. (-rvfplot-, -rvpplot-) The plots may suggest that the problem can be resolved by some suitable transformation of a variable. If those don't lead to an improvement in your model, a simple solution to heteroscedasticity is to use the -vce(robust)- option in the -regress- command.
1 like
Comment
Niklas Adolph

Join Date: Aug 2018

Posts: 10
#7

23 Aug 2018, 04:00

Okay, I just left out the i. prefixes and everything stays still the same. If needed I will use -dataex- command.
Do I need to plot every variable in with -rvpplot- command? Or is there any hint which one to check first?
I already did some of them but how do I interpret them?

Sorry for asking that much but I really want to have meaningful results and if there is any violation in the assumptions it's going disturb that, right?
Attached Files
Comment
Niklas Adolph

Join Date: Aug 2018

Posts: 10
#8

24 Aug 2018, 10:17

If those don't lead to an improvement in your model, a simple solution to heteroscedasticity is to use the -vce(robust)- option in the -regress- command.

How can I tell when to use it? If it is the easy way - can I simply use it or should I try to do a transformation first?

Add to my last post:
As I haven't interpreted any graphs yet. But to me the fitted values vs the residuals look like a cloud so that there is no heteroscedasticity.
Can anyone confirm this?
Comment
Niklas Adolph

Join Date: Aug 2018

Posts: 10
#9

24 Aug 2018, 10:46

I just tried the -robust- command and it seems that there is not a big difference. I'm not allowed to carry out any test for heteroscedasticity by stata but I guess that's because of the -robust- command.

In the meantime I also looked for assumption No.6: in the following way.
1. Did the regression.
2. predict Residuen, resid
3. sktest Residuen
4. swilk Residuen

And both showed, for the normal and robust regression, that they are not approximately normally distributed (p=0.000).
Anything I can do there?

I'm kinda screwed as I need this be done by Monday and hand it in… any help appriciated.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#10

24 Aug 2018, 14:14

For the regression with -robust-, the normality of residuals is completely irrelevant. No need to test for it; part of what the robust estimator is robust to is the distributional assumptions about the residuals.

As for the regression with the usual OLS variance estimator, your sample size of 621 is sufficiently large that you needn't concern yourself with normality of residuals. The central limit theorem will give you assurance that inferences based on these standard errors are still correct.

So the short answer is that there is nothing to do; there is no problem. It isn't broke. Don't try to fix it.
Comment

Announcement

Checking assumptions for multiple regression - right approach?!

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment