I was running a regression and I obtained a very counter intuitive result (very high coefficient with statistical significance in the opposite direction of the expected one). When I looked for the possible causes, I realized that I had just 4 cases in one of the possible alternatives. My N=1200 and, following Cohen (1988), given my R2 and my number of regressors, my N should be of about 900 for a power and significance of .95. However, what happened to me with this 4 cases situation alerted me about this additional criteria that I should have in mind. What is the minimun number of observations per cases in each of the possible answers (for example, in a dummy variable)?
Announcement
Collapse
No announcement yet.
X

Santiago:
I find what you reported difficult to follow.
Do you mean that in one of your variable you have only 4 observations with observed data?
Do you mean that you have 4 observations that exert a remakable/unexpected/unduly leverage on regression result?
Please, see the FAQ on how to post more effectively, including what you typed and what Stata gave you back within CODE delimiters. Thanks.Kind regards,
Carlo
(Stata 16.0 SE)

Just an addititonal note:
In the FAQ, we read:
Don't say "I ran a regression and then ...", say "I ran regress and then ...".
Please clarify the query. That said, as Carlo pointed out, the best approach is to provide command and output.Best regards,
Marcos
Comment

Hi Carlo and Marcos,
I apologize for my confusing post. What I wanted to say is the following:
Trying to run a regression from a database with N=1200, I encountered an independent dummy variable (unemployment) for which I have just 4 subjects who enter in one of the categories (unemployed). That is, I have a very skewed distribution. I think it would be wrong to make an statement with just 4 cases in one side. In this context, my question is: how to know the number of observations needed per category in an independent categorical variable in order to obtain correct results?
Comment

I think this is the wrong way round. There is never going to be a threshold above which you are fine and below which you might not be,
Your data are what they are. You can try some sensitivity analysis, e.g.
1. Omitting the dummy from your model and seeing whether parameter estimates, goodness of fit and predictions change dramatically.
2. Simulations using a small probability for the rare case.
3, Bootstrapping (although note that given bootstrap samples with none of the rare case your unstated regression command will presumably throw out the dummy whenever it is a constant).
Naturally, "get more data then" is easy to say and perhaps impossible to do.
Note that skewness isn't really the name of the problem. Most indicators are skewed but that does not much inhibit their use.
 1 like
Comment

Hi Nick,
Thank you for your answer.
The inclusion or exclusion of this dummy doesn't change very much the other estimates and/or indicators of the model. However, it's part of my investigation question, so I can't exlude it. In this context, senstivity tests would still be useful? If so, why?
And yes, to get more data is impossible since my database is from 2014 and the relevant conditions have changed since that time.
Comment

Originally posted by Santiago Valdivieso View PostI think it would be wrong to make an statement with just 4 cases in one side.
Best
Daniel
Comment

Santiago:
substantively speaking, having 4 unemployed out of 1200 observations means probably a deep blunt to the mainsteam economic theory as far as the structural unemployment rate is concerned.
That said, I would be curious about the other levels the categorical variable is composed of: are the remaining individuals all employed? Is part of them retired? Is part of them hopuskeeper and hence not counted as a employed in the market? Is part of them passed away?Kind regards,
Carlo
(Stata 16.0 SE)
Comment

Sorry, I clicked "enter" withouth wanting to. I repeat:
Originally posted by daniel klein View Post
Theoretically, the respective standard error should reflect the uncertainty associated with the estimate based on 4 cases.
Daniel, thanks for your answer. Yes, as I undestand, the standard error should do that. Yet, I get a coefficient significant at 95% level of confidence. My question is, then, what is the criteria (if there is one) I should use (with respect to the number of observations per category of response) to see if it's pertinent to include or not an independent variable.Last edited by Santiago Valdivieso; 10 Jan 2020, 12:43.
Comment

Originally posted by Carlo Lazzaro View PostSantiago:
substantively speaking, having 4 unemployed out of 1200 observations means probably a deep blunt to the mainsteam economic theory as far as the structural unemployment rate is concerned.
That said, I would be curious about the other levels the categorical variable is composed of: are the remaining individuals all employed? Is part of them retired? Is part of them hopuskeeper and hence not counted as a employed in the market? Is part of them passed away?Last edited by Santiago Valdivieso; 10 Jan 2020, 13:41.
Comment

Carlo Lazzaro Yes, but my problem goes beyond unemployment. There are other variables in which I also have low frequencies in some categories of response. For example, I have a independent variable with the following characteristics:
Dummy = 1 Freq= 1170
Dumme = 0 Freq= 30
Or categorical variables with a distribution as the following:
Option A= 1150
Option B= 20
Option C= 15
Option D = 15
So I was looking more for a specific rule to include or exclude this type of variables. Or, at least, to know if it's correct or incorrect to take as valid the results that come from them.
Comment
Comment