The way Stata chooses which category to omit is not based on the number of observations it represents. In most circumstances it will omit the categories with the smallest numerical value. When circumstances force it to remove more categories, it tends to start from the other end. For example, in a model where i.year is a variable, normally the first year is omitted. If there is a reason a second year has to be omitted it will typically be the last year. I don't know the details of how Stata makes these choices. The important things to remember are:
1. It doesn't matter which ones get omitted. The meanings of the coefficients that remain change, but the model's predictions are not affected by the choice.
2. Whenever you are working with a set of indicator ("dummy") variables, or for that matter, any set of colinear variables, the coefficients of those variables do not mean what they appear to mean in the regression output: they represent "effects" of the corresponding levels only relative to whatever has been omitted. So it is a dicey business looking at these coefficients in any case: interpreting them correctly requires care and some algebra.
3. The outputs of -predict- and -margins-, which are the real results of these models anyway, will be the same (except perhaps for very minor rounding errors) no matter which of the colinear variables is dropped. These are the results you should be looking at in any case.
You do have some degree of control over which categories get omitted. Look at the explanation of the ib*. operators in -help fvvarlist-. However Stata sometimes overrides those choices you make, particularly when there are interactions involved--so control is not complete. But, as already noted, it doesn't matter anyway.
1. It doesn't matter which ones get omitted. The meanings of the coefficients that remain change, but the model's predictions are not affected by the choice.
2. Whenever you are working with a set of indicator ("dummy") variables, or for that matter, any set of colinear variables, the coefficients of those variables do not mean what they appear to mean in the regression output: they represent "effects" of the corresponding levels only relative to whatever has been omitted. So it is a dicey business looking at these coefficients in any case: interpreting them correctly requires care and some algebra.
3. The outputs of -predict- and -margins-, which are the real results of these models anyway, will be the same (except perhaps for very minor rounding errors) no matter which of the colinear variables is dropped. These are the results you should be looking at in any case.
You do have some degree of control over which categories get omitted. Look at the explanation of the ib*. operators in -help fvvarlist-. However Stata sometimes overrides those choices you make, particularly when there are interactions involved--so control is not complete. But, as already noted, it doesn't matter anyway.
Comment