I'm experimenting with non-ordered categorical variable specification in a model and I need guidance on whether different specifications produce statistically equivalent results. More specifically, I want first calculate the mean value of the dependent variable conditional on a categorical independent variable, creating a new continuous variable. Then I want to formally test whether using the categorical variable as a straight-forward control is equivalent to controlling for the conditional mean as a continuous variable which was calculated in the first step. Intuition tells me that the models should be equivalent as both will produce predicted results that are deviations from the mean effect of the categorical variable. But I want to formally test this intuition and am looking for guidance on the best way to do that.
The reason I'm experimenting with these different specifications is that I have a categorical variable that is clearly an important predictor of an outcome variable but it has so many distinct values that I lose too many degrees of freedom when I include it in the regression equation as a categorical variable.
My first thought was to predict the fitted values and residuals from each regression and run a two-sample t-test for equivalence of means on each, but there may be a better way to do this formally. (Since I'm essentially transforming the same variables in this model, I'm not certain if the two specifications should be considered nested or non-nested. Partly because of this, I don't think I can obtain the correct test statistics for my purposes using using _testparm_ or _lrtest_ but I would be happy to be corrected.)
Below is some code showing what I have in mind:
I understand that that in Model 2, when I regress on the mean birthrate by region, the coefficient on mean_brate shows the marginal effect averaged over all regions rather than the independent effect which may vary over regions. I am also no longer strictly comparing effects within regions, as when I control for region as a categorical variable in Model 1.
In all, I therefore have two questions:
1. Is there a better way to formally test the equivalence of Model 1 and Model 2 than running t-tests on the fitted values and residuals?
2. Are there additional implications to Model 2 that I didn't list and should be aware of?
Thank you very much in advance for the help!
The reason I'm experimenting with these different specifications is that I have a categorical variable that is clearly an important predictor of an outcome variable but it has so many distinct values that I lose too many degrees of freedom when I include it in the regression equation as a categorical variable.
My first thought was to predict the fitted values and residuals from each regression and run a two-sample t-test for equivalence of means on each, but there may be a better way to do this formally. (Since I'm essentially transforming the same variables in this model, I'm not certain if the two specifications should be considered nested or non-nested. Partly because of this, I don't think I can obtain the correct test statistics for my purposes using using _testparm_ or _lrtest_ but I would be happy to be corrected.)
Below is some code showing what I have in mind:
Code:
use http://www.stata-press.com/data/r14/census3 /*generate a continuous variable consisting of the mean birthrate by region*/ bysort region : egen mean_brate = mean(brate) /*Model 1: regress birthrate on median age and region as a categorical variable*/ reg brate c.medage##c.medage i.region /*Fitted values and residuals*/ predict p1, xb predict r1, resid /*Model 2: regress birthrate on median age and mean birthrate by region*/ reg brate c.medage##c.medage c.mean_brate predict p2, xb predict r2, resid /*two-sample t-tests for equivalence of means on the fitted values and residuals*/ ttest p1 == p2 ttest r1 == r2
In all, I therefore have two questions:
1. Is there a better way to formally test the equivalence of Model 1 and Model 2 than running t-tests on the fitted values and residuals?
2. Are there additional implications to Model 2 that I didn't list and should be aware of?
Thank you very much in advance for the help!
Comment