categorical variable sample sizes

Adrienne Wold

Join Date: Dec 2016

Posts: 141
#1

categorical variable sample sizes

22 Nov 2020, 14:43

Hi,

I am running a regression using the following code:

Code:

reghdfe incvote i.X controls, absorb(i.country i.region)

where incvote is a binary variable and the X variable is categorical with values 0, 1, and 2. The frequencies of these categories are respectively 1700, 5100, and 1900. Do the stark differences in the sizes (group 1 versus 0 and 2) matter in terms of whether I should run these on separate samples (0 and 1) versus (0 and 2) rather than jointly?

Thanks in advance.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

22 Nov 2020, 16:13

No. The number of observations in each category here is not a problem at all. In fact, the regression coefficients you get will be the same whether you do this as a single combined regression or as separate samples. However, the standard errors and the test statistics and confidence intervals associated with them will be different because they will be calculated from different samples of the data. The single regression approach, using a larger sample, will generally be more powerful unless the homoscedasticity assumption is substantially violated.
Comment
Duong Le

Join Date: Apr 2020

Posts: 66
#3

23 Nov 2020, 06:52

Dear Professor Clyde Schechter,

I have a follow-up question which is quite similar to that in #1. Does the number of observation matter in this case? say, I have a set of data containing 100 observations in which 90% of them report being in good health and only 10% have poor health.Is there any problems if I use logit regression with that health variable as a dependent variable? If not, could you please explain it to me?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

23 Nov 2020, 10:17

In a small sample, it is different. The 90%/10% doesn't matter. But the fact that 10% is only 10 does matter. In a sample of 10 events there simply isn't very much information about what things are associated with events. Moreover, specifically with logistic regression, the coefficient estimates are biased upward when the number of events is small. (To take an extreme, ridiculous case, when there are 0 events, the coefficient is negative infinity, so an infinite bias!) If you are going to do a logistic model with only 10 outcomes, I would recommend using penalized maximum likelihood estimation. You can do that with the -firthlogit- command by Joseph Coveney, available from SSC.

Note: When I say only 10 outcomes, it doesn't matter if it's 10 zeroes or 10 ones. When the split is such that one of the values occurs only a small number of times, logistic regression becomes problematic.

Added: The obvious next question is how small is too small? I don't have a clear answer to that and I don't believe there is any consensus on it. Personally, I would be skeptical of any logistic regression with fewer than 25 outcomes, but that's just an arbitrary number, and I would continue to have qualms when the number is between 25 and 50. After that, I'm generally comfortable, at least if the model doesn't involve too many predictor variables (which raises the spectre of overfitting, a separate problem). Other people have different opinions about this.
1 like
Comment
Adrienne Wold

Join Date: Dec 2016

Posts: 141
#5

23 Nov 2020, 12:33

Thank you. I also had a follow up on the initial question. Running the single specification, I obtain coefficients for values 1 and 2 that are significant at the 10 and 5 percent levels, respectively. I run a test on these, which yields an F-stat of 0.19 with p=0.66. In interpreting this result, I conclude there's no meaningful difference between the effects of 1 and 0 compared to 2 and 0, though I'm confused how to interpret this in conjunction with the single coefficient p-values.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#6

23 Nov 2020, 14:24

This is a common problem that arises with significance testing. As a logical predicate, the notion of statistical significance is logically incoherent That is one of the reasons that the American Statistical Association has recommended that the concept of statistical significance be abandoned. See https://www.tandfonline.com/doi/full...5.2019.1583913 for the "executive summary" and https://www.tandfonline.com/toc/utas20/73/sup1 for all 43 supporting articles. Or https://www.nature.com/articles/d41586-019-00857-9 for the tl;dr. I concur with their recommendation. (By the way, it can go in the other direction as well: you can have the joint test be "significant" but the individual tests both not significant.)

There is no totally satisfactory way to interpret these results. Mathematically, it is explicable, because in the 1-2 plane, the region for non-rejection of 1 != 0 and 2 != 0 is a rectangle with sides parallel to the 1 and 2 axes, whereas the region of non-rejection for the joint test is an ellipse whose axes are oblique to the axes. So it is entirely possible for a point to lie entirely outside that rectangle but fall inside the ellipse, as you have encountered here. But that just tells you that these tests aren't telling you anything really useful. You need to carefully revisit your research hypotheses to see whether your goal is to test a joint hypothesis or whether your goal is to estimate the 1 and 2 effects relative to 0.
Comment
Duong Le

Join Date: Apr 2020

Posts: 66
#7

23 Nov 2020, 17:51

Originally posted by Clyde Schechter View Post

In a small sample, it is different. The 90%/10% doesn't matter. But the fact that 10% is only 10 does matter. In a sample of 10 events there simply isn't very much information about what things are associated with events. Moreover, specifically with logistic regression, the coefficient estimates are biased upward when the number of events is small. (To take an extreme, ridiculous case, when there are 0 events, the coefficient is negative infinity, so an infinite bias!) If you are going to do a logistic model with only 10 outcomes, I would recommend using penalized maximum likelihood estimation. You can do that with the -firthlogit- command by Joseph Coveney, available from SSC.

Note: When I say only 10 outcomes, it doesn't matter if it's 10 zeroes or 10 ones. When the split is such that one of the values occurs only a small number of times, logistic regression becomes problematic.

Added: The obvious next question is how small is too small? I don't have a clear answer to that and I don't believe there is any consensus on it. Personally, I would be skeptical of any logistic regression with fewer than 25 outcomes, but that's just an arbitrary number, and I would continue to have qualms when the number is between 25 and 50. After that, I'm generally comfortable, at least if the model doesn't involve too many predictor variables (which raises the spectre of overfitting, a separate problem). Other people have different opinions about this.

Thank you for your detailed explanation as always.
Comment
Adrienne Wold

Join Date: Dec 2016

Posts: 141
#8

24 Nov 2020, 04:20

Yes, this is very helpful. Thank you.
Comment

Announcement

categorical variable sample sizes

Comment

Comment

Comment

Comment

Comment

Comment

Comment