Solution to the small sample size?

Ala Pokora

Join Date: Feb 2020

Posts: 6
#1

Solution to the small sample size?

02 Feb 2020, 15:07

Hi All,

I have tried searching for a similar case but haven't found it yet!
I need to run a regression for a project at university, and due to the subject I'm analyzing I have a relatively small sample size of 70 obs but many independent variables I would like to test.
Do you think it would be a good idea to split my independent variables into groups and test the dependent variable for each group at a time?
They could be splitted by a genre since each of these belongs to a different one (it concerns financial factors like profitability, liquidity and so on).

Do you think I risk heteroskedasticity and misspecification by proceeding in this way?

thank you in advance for your help!
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

03 Feb 2020, 10:30

When you have variables your analysis shows influence the dv, and you omit them and estimate another model explaining the same dv, your first results demonstrate the second suffers from omitted variables bias. The only time this is legitimate is if the rhs variables are uncorrelated (which usually occurs in experimental data). So, worse than misspecification or heterskedasticity, you're proving your own results are biased and inconsistent.

There are few good solutions to your problem. One strategy would be to try to condense the pile of rhs variables into a smaller set of factors. This could take the form of either exploratory factor analysis followed by generating predicted scores and regressing on the scores, or structural equation modelling (confirmatory factor analysis).
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#3

03 Feb 2020, 11:14

Ala:
welcome to this forum.
As an aside to Phil's helpful advice, please note that there are structural limits in regression: for instance you cannot have more predictors than observations.
That said, as probably many of your predictors are controls, you should first try to give a fair and true view od the data generating process, by selecting the predictors that makes sense including on theorethical grounds. An help can come from the literature in your research field.
Eventually, please consider a comprehensive postestimation tests to check whether your model is correctly specified, suffers for heteroskedasticity or, worse, endogeneity. If you are planning to run an OLS you should consider -regress postestimation- suite of commands.

Kind regards,
Carlo
(Stata 19.0)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#4

03 Feb 2020, 11:33

I fail to understand what is meant by "many independent variables". At least to me, the outcome was not clear enough.

Additionally, Phil and Carlo already provided insightful replies.

That said, a Bayesian approach or a Lasso strategy may perhaps be considered as well.

Best regards,

Marcos
Comment
Ala Pokora

Join Date: Feb 2020

Posts: 6
#5

03 Feb 2020, 13:41

Originally posted by Marcos Almeida View Post

I fail to understand what is meant by "many independent variables". At least to me, the outcome was not clear enough.

Additionally, Phil and Carlo already provided insightful replies.

That said, a Bayesian approach or a Lasso strategy may perhaps be considered as well.

Marcos, basically I'm doing an analysis of characteristics of certain companies from a certain field which really decreases my sample size as the field itself is not so big. However, I have gathered data in regards to many characteristics (33 different variables) and I wanted to use more than 7 to understand the dependent variable. However, due to the small sample size I cannot do this. I hope it is clearer now!

Originally posted by Phil Bromiley View Post

When you have variables your analysis shows influence the dv, and you omit them and estimate another model explaining the same dv, your first results demonstrate the second suffers from omitted variables bias. The only time this is legitimate is if the rhs variables are uncorrelated (which usually occurs in experimental data). So, worse than misspecification or heterskedasticity, you're proving your own results are biased and inconsistent.

There are few good solutions to your problem. One strategy would be to try to condense the pile of rhs variables into a smaller set of factors. This could take the form of either exploratory factor analysis followed by generating predicted scores and regressing on the scores, or structural equation modelling (confirmatory factor analysis).

Phil,thank you so much again for your reply on this. You are right and after running the corr command all of the variables are somehow correlated. Since I know I need to avoid a high VIF do you think a good start of eliminating some of the independent variables would be to eliminate the ones that have a too strong correlation? I'm running a linear regression and the factor option in stata appears as unavailable, as when typing factor I get the message that the last estimates are not found.

As for the tests, please correct me if I'm wrong but, for an example of a regression I tried to run with only seven variables I have conducted the following tests and my understanding is that my model "passed" them, since P value is greater than 0,05 and VIF lower than 5.

. hettest

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of publiclyquoted

chi2(1) = 3.38
Prob > chi2 = 0.0661

. ovtest

Ramsey RESET test using powers of the fitted values of publiclyquoted
Ho: model has no omitted variables
F(3, 58) = 1.51
Prob > F = 0.2205

. vif

Variable | VIF 1/VIF
-------------+----------------------
roa | 3.15 0.317272
roce | 3.06 0.327323
solvencyra~o | 1.14 0.881019
turnover | 1.07 0.934517
nrofcurren~m | 1.02 0.978746
-------------+----------------------
Mean VIF | 1.89

.
Thank you all so much for your help again it is really important for me to know that I should not proceed in this way. Regression is only a small part of my paper but it needs to be included, and since I'm not so proficient in stata and econometrics in general I really do appreciate you explaining this to me! Until now I knew how to read the results but doing a regression on my own is new to me.

Last edited by Ala Pokora; 03 Feb 2020, 13:50.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#6

03 Feb 2020, 14:16

Ala:
The results of all your postestimation tests do not show any problem.
That said, I fail to get your difficulties in dealing with -fvvarlist- notation.

Kind regards,
Carlo
(Stata 19.0)
Comment
Ala Pokora

Join Date: Feb 2020

Posts: 6
#7

05 Feb 2020, 13:36

Originally posted by Carlo Lazzaro View Post

Ala:
The results of all your postestimation tests do not show any problem.
That said, I fail to get your difficulties in dealing with -fvvarlist- notation.

Dear Carlo, sorry but since I'm just a beginner in stata I want to be sure I have understood you correctly. By fvvarlist you mean I need to proceed as per Phil's advice to try and condense the pile of rhs variables into a smaller set of factors? If so, I will try but I'm afraid I can make some errors and then I risk to obtain a false result.

Also I wanted to ask an additional question - in my sample I have a dummy variable 0/1 that describes companies of certain type (0-no, 1-yes) and when I want to regress only for companies of type 1 (yes) then I need to remove some of the variables since my sample size decreases?

Thank you so much for all of your help and understanding!
Comment
Ala Pokora

Join Date: Feb 2020

Posts: 6
#8

05 Feb 2020, 15:02

Originally posted by Ala Pokora View Post

Dear Carlo, sorry but since I'm just a beginner in stata I want to be sure I have understood you correctly. By fvvarlist you mean I need to proceed as per Phil's advice to try and condense the pile of rhs variables into a smaller set of factors? If so, I will try but I'm afraid I can make some errors and then I risk to obtain a false result.

Also I wanted to ask an additional question - in my sample I have a dummy variable 0/1 that describes companies of certain type (0-no, 1-yes) and when I want to regress only for companies of type 1 (yes) then I need to remove some of the variables since my sample size decreases?

Thank you so much for all of your help and understanding!

Also another additional question I have - if I get that after adding dummy variable my model fails to pass the hettest, if after using the "robust" fix the p and t values still indicate the same variables, am I safe that my relevance variables are correct? Please note my sample is only of 70 observations and it becomes 60 with the dummy variable.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#9

06 Feb 2020, 00:07

Ala:
1) not quite. I advised you to rely on -fvvarlist- notation instead of creating categorical variables and/or interactions yourself;
2) if you want to run the regression on a subsample of your original dataset you can use the -if- qualifier condition, as in the following toy-example (that uses also -fvvarlist- notation):

Code:

. sysuse auto.dta (1978 Automobile Data) . regress price i.foreign if rep78==3 Source | SS df MS Number of obs = 30 -------------+---------------------------------- F(1, 28) = 0.68 Model | 8539378.85 1 8539378.85 Prob > F = 0.4167 Residual | 351832337 28 12565440.6 R-squared = 0.0237 -------------+---------------------------------- Adj R-squared = -0.0112 Total | 360371715 29 12426610.9 Root MSE = 3544.8 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- foreign | Foreign | -1778.407 2157.282 -0.82 0.417 -6197.4 2640.585 _cons | 6607.074 682.1926 9.69 0.000 5209.666 8004.482 ------------------------------------------------------------------------------ .

Obviously, your sample size will decrease (and unavoidably so) if you run your regression on a subsample of the original dataset.

3) I'm not sure I got you last question (#8) right: if -estat hettest- outcome warns you about heteroskedasticity, you should go -robust- that does not rule out heteroskedastcity, but takes it into account in calculating standard errors. As such, the point estimates remain unchanged with or without the -robust- option.

As an aside, I'm really symphatetic with beginners, because I think I'm still one of them as far as some Stata commands I'm not familiar with are concerned. However, I would recommend two good habits:
1) take a comprehensive look at the FAQ: doing that you'll discover that the most fruitful way to receive helpful replies is to post what you typed and what Stata gave you back (via CODE delimiters, please). This appraoch outperforms spending tons of words trying to report what's going on;
2) as William Lisowski oftentimes wisely reminds, take a comprehensive look at Stata .pdf manual (at least for the commands you're more interested in).

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Ala Pokora

Join Date: Feb 2020

Posts: 6
#10

08 Feb 2020, 03:01

Dear Carlo,

thank you so much for your response. I now understand what you meant with the fvvarlist. Do you think that in order to tackle this problem I could also simply gain more observations by including more than one year of observations? For example instead of only considering the values for 2018 I could also add values for the same variables but from 2017? Or this would make my observations continuous and further complicate the process?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#11

08 Feb 2020, 03:51

Ala:
please note that if you add on the existing dataset another wave of data (2017), provided that they were elicited from the same sample units, you will end up with a panel dataset (see -xtreg-).

Kind regards,
Carlo
(Stata 19.0)
Comment
Ala Pokora

Join Date: Feb 2020

Posts: 6
#12

08 Feb 2020, 05:56

got it! I will check this option, even though I have very little time and having never used this regression I'm afraid I can mess it up or interpret it erroneously

I wanted to ask about another problem linked with a sample size - my distribution of residuals is non normal and not centered around 0 anymore.
Graph.gph

Does my ols regression still holds?
Attached Files

Last edited by Ala Pokora; 08 Feb 2020, 06:02.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#13

09 Feb 2020, 04:19

This histogram only shows the fitted values. There is important advice in #9.

Best regards,

Marcos
Comment

Announcement

Solution to the small sample size?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment