What is the minimun number of observations per response needed for a regression?* *

Santiago Valdivieso

Join Date: Dec 2019

Posts: 37
#1

What is the minimun number of observations per response needed for a regression?* *

09 Jan 2020, 23:56

I was running a regression and I obtained a very counter intuitive result (very high coefficient with statistical significance in the opposite direction of the expected one). When I looked for the possible causes, I realized that I had just 4 cases in one of the possible alternatives. My N=1200 and, following Cohen (1988), given my R2 and my number of regressors, my N should be of about 900 for a power and significance of .95. However, what happened to me with this 4 cases situation alerted me about this additional criteria that I should have in mind. What is the minimun number of observations per cases in each of the possible answers (for example, in a dummy variable)?
Tags: econometrics, regression
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#2

10 Jan 2020, 04:26

Santiago:
I find what you reported difficult to follow.
Do you mean that in one of your variable you have only 4 observations with observed data?
Do you mean that you have 4 observations that exert a remakable/unexpected/unduly leverage on regression result?
Please, see the FAQ on how to post more effectively, including what you typed and what Stata gave you back within CODE delimiters. Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

10 Jan 2020, 04:41

Just an addititonal note:

In the FAQ, we read:

Don't say "I ran a regression and then ...", say "I ran regress and then ...".

This is to say that I'm in doubt about the type of regression used in the specific case. On second thoughts, maybe you meant group categories when mentioning "cases".

Please clarify the query. That said, as Carlo pointed out, the best approach is to provide command and output.

Best regards,

Marcos
Comment
Santiago Valdivieso

Join Date: Dec 2019

Posts: 37
#4

10 Jan 2020, 08:08

Hi Carlo and Marcos,

I apologize for my confusing post. What I wanted to say is the following:

Trying to run a regression from a database with N=1200, I encountered an independent dummy variable (unemployment) for which I have just 4 subjects who enter in one of the categories (unemployed). That is, I have a very skewed distribution. I think it would be wrong to make an statement with just 4 cases in one side. In this context, my question is: how to know the number of observations needed per category in an independent categorical variable in order to obtain correct results?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

10 Jan 2020, 09:04

I think this is the wrong way round. There is never going to be a threshold above which you are fine and below which you might not be,

Your data are what they are. You can try some sensitivity analysis, e.g.

1. Omitting the dummy from your model and seeing whether parameter estimates, goodness of fit and predictions change dramatically.

2. Simulations using a small probability for the rare case.

3, Bootstrapping (although note that given bootstrap samples with none of the rare case your unstated regression command will presumably throw out the dummy whenever it is a constant).

Naturally, "get more data then" is easy to say and perhaps impossible to do.

Note that skewness isn't really the name of the problem. Most indicators are skewed but that does not much inhibit their use.
1 like
Comment
Santiago Valdivieso

Join Date: Dec 2019

Posts: 37
#6

10 Jan 2020, 10:02

Hi Nick,

Thank you for your answer.

The inclusion or exclusion of this dummy doesn't change very much the other estimates and/or indicators of the model. However, it's part of my investigation question, so I can't exlude it. In this context, senstivity tests would still be useful? If so, why?

And yes, to get more data is impossible since my database is from 2014 and the relevant conditions have changed since that time.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#7

10 Jan 2020, 10:32

Originally posted by Santiago Valdivieso View Post

I think it would be wrong to make an statement with just 4 cases in one side.

Theoretically, the respective standard error should reflect the uncertainty associated with the estimate based on 4 cases. Note, however, that the information in your data is even more sparse than you might think. If you have other predictors in the model (which you probably have), then you would need to think about possible combinations of all these predictors. For example, how many of the four unemployed will be males with a migration background and a higher education degree? The answer will often be: zero. Most regression-type models extrapolate way beyond the observed data. This is both the strength and weakness of parametric models. By assuming that the model accurately reflects the data-generating process, you can make statements about combinations of predictors that you have not observed.

Best
Daniel
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#8

10 Jan 2020, 11:10

Santiago:
substantively speaking, having 4 unemployed out of 1200 observations means probably a deep blunt to the mainsteam economic theory as far as the structural unemployment rate is concerned.
That said, I would be curious about the other levels the categorical variable is composed of: are the remaining individuals all employed? Is part of them retired? Is part of them hopuskeeper and hence not counted as a employed in the market? Is part of them passed away?

Kind regards,
Carlo
(Stata 19.0)
Comment
Santiago Valdivieso

Join Date: Dec 2019

Posts: 37
#9

10 Jan 2020, 11:25

.

Last edited by Santiago Valdivieso; 10 Jan 2020, 11:44.
Comment
Santiago Valdivieso

Join Date: Dec 2019

Posts: 37
#10

10 Jan 2020, 11:32

Sorry, I clicked "enter" withouth wanting to. I repeat:

Originally posted by daniel klein View Post

Theoretically, the respective standard error should reflect the uncertainty associated with the estimate based on 4 cases.

Daniel, thanks for your answer. Yes, as I undestand, the standard error should do that. Yet, I get a coefficient significant at 95% level of confidence. My question is, then, what is the criteria (if there is one) I should use (with respect to the number of observations per category of response) to see if it's pertinent to include or not an independent variable.

Last edited by Santiago Valdivieso; 10 Jan 2020, 11:43.
Comment
Santiago Valdivieso

Join Date: Dec 2019

Posts: 37
#11

10 Jan 2020, 11:42

Originally posted by Carlo Lazzaro View Post

Santiago:
substantively speaking, having 4 unemployed out of 1200 observations means probably a deep blunt to the mainsteam economic theory as far as the structural unemployment rate is concerned.
That said, I would be curious about the other levels the categorical variable is composed of: are the remaining individuals all employed? Is part of them retired? Is part of them hopuskeeper and hence not counted as a employed in the market? Is part of them passed away?

Carlo Lazzaro I undestand it can seem weird, but it's actually not in context. The population I'm studying are the household's chiefs of a province of Ecuador. In this country (my country) the unemployment rate is just 4.6%, while the underemployment or other forms of bad quality employments are about 50%. Moreover, while the unemployment is more prevalent in younger people and womes, the mean age of my studied population is (about) 40 and is composed mostly by men. Answering your question: the remaining subjects are either employed or underemployed. Retired, deads, and others are not included in the measure.

Last edited by Santiago Valdivieso; 10 Jan 2020, 12:41.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#12

11 Jan 2020, 02:43

Santiago:
thanks for clarifying.
Could make sense to group together unemployed and bad employed in an unique level of the categorical variable?

Kind regards,
Carlo
(Stata 19.0)
Comment
Santiago Valdivieso

Join Date: Dec 2019

Posts: 37
#13

11 Jan 2020, 13:55

Carlo Lazzaro Yes, but my problem goes beyond unemployment. There are other variables in which I also have low frequencies in some categories of response. For example, I have a independent variable with the following characteristics:

Dummy = 1 Freq= 1170
Dumme = 0 Freq= 30

Or categorical variables with a distribution as the following:

Option A= 1150
Option B= 20
Option C= 15
Option D = 15

So I was looking more for a specific rule to include or exclude this type of variables. Or, at least, to know if it's correct or incorrect to take as valid the results that come from them.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#14

12 Jan 2020, 04:25

Santiago:
the usual advice is to group together the levels of the categorical variable that show low numbers of observations.
Obviously, different research fields may well have different customary rules.

Kind regards,
Carlo
(Stata 19.0)
Comment
Santiago Valdivieso

Join Date: Dec 2019

Posts: 37
#15

12 Jan 2020, 21:17

Thank you, Carlo Lazzaro, you're a good man.
Comment

Announcement

What is the minimun number of observations per response needed for a regression?* *

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment