Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Formally test equivalence of different model specifications

    I'm experimenting with non-ordered categorical variable specification in a model and I need guidance on whether different specifications produce statistically equivalent results. More specifically, I want first calculate the mean value of the dependent variable conditional on a categorical independent variable, creating a new continuous variable. Then I want to formally test whether using the categorical variable as a straight-forward control is equivalent to controlling for the conditional mean as a continuous variable which was calculated in the first step. Intuition tells me that the models should be equivalent as both will produce predicted results that are deviations from the mean effect of the categorical variable. But I want to formally test this intuition and am looking for guidance on the best way to do that.

    The reason I'm experimenting with these different specifications is that I have a categorical variable that is clearly an important predictor of an outcome variable but it has so many distinct values that I lose too many degrees of freedom when I include it in the regression equation as a categorical variable.

    My first thought was to predict the fitted values and residuals from each regression and run a two-sample t-test for equivalence of means on each, but there may be a better way to do this formally. (Since I'm essentially transforming the same variables in this model, I'm not certain if the two specifications should be considered nested or non-nested. Partly because of this, I don't think I can obtain the correct test statistics for my purposes using using _testparm_ or _lrtest_ but I would be happy to be corrected.)

    Below is some code showing what I have in mind:

    Code:
    use http://www.stata-press.com/data/r14/census3 
    
    /*generate a continuous variable consisting of the mean birthrate by region*/
    bysort region : egen mean_brate = mean(brate) 
    
    /*Model 1: regress birthrate on median age and region as a categorical variable*/
    reg brate c.medage##c.medage i.region 
    
    /*Fitted values and residuals*/
    predict p1, xb
    predict r1, resid
    
    /*Model 2: regress birthrate on median age and mean birthrate by region*/
    reg brate c.medage##c.medage c.mean_brate
    
    predict p2, xb
    predict r2, resid
    
    /*two-sample t-tests for equivalence of means on the fitted values and residuals*/
    ttest p1 == p2
    ttest r1 == r2
    I understand that that in Model 2, when I regress on the mean birthrate by region, the coefficient on mean_brate shows the marginal effect averaged over all regions rather than the independent effect which may vary over regions. I am also no longer strictly comparing effects within regions, as when I control for region as a categorical variable in Model 1.

    In all, I therefore have two questions:

    1. Is there a better way to formally test the equivalence of Model 1 and Model 2 than running t-tests on the fitted values and residuals?
    2. Are there additional implications to Model 2 that I didn't list and should be aware of?

    Thank you very much in advance for the help!

  • #2
    You can take a look at adjusted R-squared.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thanks for the quick response Carlo. The adjusted R-Square values are very close but not identical (not that I expect them to be exactly the same). But this brings up the question of how close do the adjusted R-Squares from two models need to be in order for us to declare the models to be equivalent? What would be the appropriate procedure to formally test whether the adjusted R-Squares are statistically significantly different? (Or perhaps I should just ask, what is the sampling distribution of the adjusted R-squared test statistic?)

      Comment


      • #4
        Edonavot:
        probably the most straightforward approach is to use AIC or BIC criteria (the lower thier values, the better the regression model):
        Code:
        . sysuse auto.dta
        (1978 Automobile Data)
        
        . regress price mpg
        
              Source |       SS           df       MS      Number of obs   =        74
        -------------+----------------------------------   F(1, 72)        =     20.26
               Model |   139449474         1   139449474   Prob > F        =    0.0000
            Residual |   495615923        72  6883554.48   R-squared       =    0.2196
        -------------+----------------------------------   Adj R-squared   =    0.2087
               Total |   635065396        73  8699525.97   Root MSE        =    2623.7
        
        ------------------------------------------------------------------------------
               price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
                 mpg |  -238.8943   53.07669    -4.50   0.000    -344.7008   -133.0879
               _cons |   11253.06   1170.813     9.61   0.000     8919.088    13587.03
        ------------------------------------------------------------------------------
        
        . estat ic
        
        Akaike's information criterion and Bayesian information criterion
        
        -----------------------------------------------------------------------------
               Model |        Obs  ll(null)  ll(model)      df         AIC        BIC
        -------------+---------------------------------------------------------------
                   . |         74 -695.7129  -686.5396       2    1377.079   1381.687
        -----------------------------------------------------------------------------
                       Note: N=Obs used in calculating BIC; see [R] BIC note.
        
        . regress price mpg weight
        
              Source |       SS           df       MS      Number of obs   =        74
        -------------+----------------------------------   F(2, 71)        =     14.74
               Model |   186321280         2  93160639.9   Prob > F        =    0.0000
            Residual |   448744116        71  6320339.67   R-squared       =    0.2934
        -------------+----------------------------------   Adj R-squared   =    0.2735
               Total |   635065396        73  8699525.97   Root MSE        =      2514
        
        ------------------------------------------------------------------------------
               price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
                 mpg |  -49.51222   86.15604    -0.57   0.567    -221.3025     122.278
              weight |   1.746559   .6413538     2.72   0.008      .467736    3.025382
               _cons |   1946.069    3597.05     0.54   0.590    -5226.245    9118.382
        ------------------------------------------------------------------------------
        
        . estat ic
        
        Akaike's information criterion and Bayesian information criterion
        
        -----------------------------------------------------------------------------
               Model |        Obs  ll(null)  ll(model)      df         AIC        BIC
        -------------+---------------------------------------------------------------
                   . |         74 -695.7129  -682.8637       3    1371.727    1378.64
        -----------------------------------------------------------------------------
                       Note: N=Obs used in calculating BIC; see [R] BIC note.
        
        .
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment

        Working...
        X