Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • not getting the i. and o. operators

    Hi – I am struggling to understand how the i. and o. operators work I created a small example below. The first 3 examples of -regress- make sense to me, but not the last two. Why is region==2 being omitted in those cases? Code and log below. -- Paul

    ------------------------------------------------------------------------------------------
    name: <unnamed>
    log: /Users/paulrathouz/Desktop/StataTest/indicatorTest.log
    log type: text
    opened on: 20 May 2025, 14:20:13

    . // Test the i. and o. operators
    . // Time-stamp: <2025-05-20 14:19:29 paulrathouz>
    .
    . sysuse census
    (1980 Census data by state)

    . des

    Contains data from /Applications/Stata/ado/base/c/census.dta
    Observations: 50 1980 Census data by state
    Variables: 13 6 Apr 2022 15:43
    ------------------------------------------------------------------------------------------
    Variable Storage Display Value
    name type format label Variable label
    ------------------------------------------------------------------------------------------
    state str14 %-14s State
    state2 str2 %-2s Two-letter state abbreviation
    region int %-8.0g cenreg Census region
    pop long %12.0gc Population
    poplt5 long %12.0gc Pop, < 5 year
    pop5_17 long %12.0gc Pop, 5 to 17 years
    pop18p long %12.0gc Pop, 18 and older
    pop65p long %12.0gc Pop, 65 and older
    popurban long %12.0gc Urban population
    medage float %9.2f Median age
    death long %12.0gc Number of deaths
    marriage long %12.0gc Number of marriages
    divorce long %12.0gc Number of divorces
    ------------------------------------------------------------------------------------------
    Sorted by:

    . codebook region

    ------------------------------------------------------------------------------------------
    region Census region
    ------------------------------------------------------------------------------------------

    Type: Numeric (int)
    Label: cenreg

    Range: [1,4] Units: 1
    Unique values: 4 Missing .: 0/50

    Tabulation: Freq. Numeric Label
    9 1 NE
    12 2 N Cntrl
    16 3 South
    13 4 West

    . regress medage i.region

    Source | SS df MS Number of obs = 50
    -------------+---------------------------------- F(3, 46) = 7.56
    Model | 46.3961903 3 15.4653968 Prob > F = 0.0003
    Residual | 94.1237947 46 2.04616945 R-squared = 0.3302
    -------------+---------------------------------- Adj R-squared = 0.2865
    Total | 140.519985 49 2.8677548 Root MSE = 1.4304

    ------------------------------------------------------------------------------
    medage | Coefficient Std. err. t P>|t| [95% conf. interval]
    -------------+----------------------------------------------------------------
    region |
    N Cntrl | -1.708333 .6307664 -2.71 0.009 -2.978 -.4386663
    South | -1.614583 .5960182 -2.71 0.009 -2.814306 -.4148606
    West | -2.948718 .620282 -4.75 0.000 -4.197281 -1.700155
    |
    _cons | 31.23333 .4768146 65.50 0.000 30.27356 32.19311
    ------------------------------------------------------------------------------

    . regress medage i1.region

    Source | SS df MS Number of obs = 50
    -------------+---------------------------------- F(1, 48) = 13.85
    Model | 31.4712118 1 31.4712118 Prob > F = 0.0005
    Residual | 109.048773 48 2.27184944 R-squared = 0.2240
    -------------+---------------------------------- Adj R-squared = 0.2078
    Total | 140.519985 49 2.8677548 Root MSE = 1.5073

    ------------------------------------------------------------------------------
    medage | Coefficient Std. err. t P>|t| [95% conf. interval]
    -------------+----------------------------------------------------------------
    region |
    NE | 2.06504 .5548321 3.72 0.001 .9494757 3.180605
    _cons | 29.16829 .2353953 123.91 0.000 28.695 29.64159
    ------------------------------------------------------------------------------

    . regress medage o1.region

    Source | SS df MS Number of obs = 50
    -------------+---------------------------------- F(3, 46) = 7.56
    Model | 46.3961903 3 15.4653968 Prob > F = 0.0003
    Residual | 94.1237947 46 2.04616945 R-squared = 0.3302
    -------------+---------------------------------- Adj R-squared = 0.2865
    Total | 140.519985 49 2.8677548 Root MSE = 1.4304

    ------------------------------------------------------------------------------
    medage | Coefficient Std. err. t P>|t| [95% conf. interval]
    -------------+----------------------------------------------------------------
    region |
    N Cntrl | -1.708333 .6307664 -2.71 0.009 -2.978 -.4386663
    South | -1.614583 .5960182 -2.71 0.009 -2.814306 -.4148606
    West | -2.948718 .620282 -4.75 0.000 -4.197281 -1.700155
    |
    _cons | 31.23333 .4768146 65.50 0.000 30.27356 32.19311
    ------------------------------------------------------------------------------

    . regress medage o2.region

    Source | SS df MS Number of obs = 50
    -------------+---------------------------------- F(2, 47) = 6.76
    Model | 31.3872636 2 15.6936318 Prob > F = 0.0026
    Residual | 109.132721 47 2.3219728 R-squared = 0.2234
    -------------+---------------------------------- Adj R-squared = 0.1903
    Total | 140.519985 49 2.8677548 Root MSE = 1.5238

    ------------------------------------------------------------------------------
    medage | Coefficient Std. err. t P>|t| [95% conf. interval]
    -------------+----------------------------------------------------------------
    region |
    N Cntrl | 0 (omitted)
    South | -.6383927 .5056614 -1.26 0.213 -1.655652 .3788668
    West | -1.972527 .5377578 -3.67 0.001 -3.054356 -.8906981
    |
    _cons | 30.25714 .3325209 90.99 0.000 29.5882 30.92609
    ------------------------------------------------------------------------------

    .
    . regress medage i(2 3 4).region

    Source | SS df MS Number of obs = 50
    -------------+---------------------------------- F(2, 47) = 6.76
    Model | 31.3872636 2 15.6936318 Prob > F = 0.0026
    Residual | 109.132721 47 2.3219728 R-squared = 0.2234
    -------------+---------------------------------- Adj R-squared = 0.1903
    Total | 140.519985 49 2.8677548 Root MSE = 1.5238

    ------------------------------------------------------------------------------
    medage | Coefficient Std. err. t P>|t| [95% conf. interval]
    -------------+----------------------------------------------------------------
    region |
    South | -.6383927 .5056614 -1.26 0.213 -1.655652 .3788668
    West | -1.972527 .5377578 -3.67 0.001 -3.054356 -.8906981
    |
    _cons | 30.25714 .3325209 90.99 0.000 29.5882 30.92609
    ------------------------------------------------------------------------------

    .
    . log close
    name: <unnamed>
    log: /Users/paulrathouz/Desktop/StataTest/indicatorTest.log
    log type: text
    closed on: 20 May 2025, 14:20:13
    ------------------------------------------------------------------------------------------



  • #2
    Yes, this is a common confusion. I've been bitten by it myself on many occasions. I think the manual section on factor variables is not at all clear about how the o. operator works.

    In the context of a regression model, where one of the regressors has to be omitted to identify the model, the o. operator is interpreted as requesting the removal of an additional level. So, you will see that in your o1.region example, 1.region is omitted as the base category (as it would usually be), and then you have requested omission of 1 in addition. But since 1 is already gone, there is nothing more to do. So you get the results with all of the levels except 1 (NE).

    In the o2.region example, 1 is still omitted as the base category, and 2 is omitted in addition. The omission of 1 is not remarked upon in the output, because it basically "comes with the regression." Your second category is then omitted in addition, with explicit marking of that fact in the output. All that remains in that model are levels 3 and 4 of region.

    If you want to omit 2.region and make it the base category, you should specify b2.region. In that case 2 will be omitted as the base category, and 1, 3, and 4 will be retained.

    Added: To further clarify, when you are using factor variable operators in a context where there is not an automatic omission of a base level, then o. behaves the way you would expect: it leaves in everything except the levels specified in the o. prefix. For example, you can see this if you run -summarize o3.region-. You will get summary statistics for levels 1, 2, and 4. In this case, 1 is not omitted because there is no omission of any base category in the operation of the -summarize- command.
    Last edited by Clyde Schechter; 20 May 2025, 13:54.

    Comment


    • #3
      Clyde -- This is very helpful. A few points / follow ups:

      0. It sounds like the way to get the more predictable behavior is to use the b. operator instead.
      1. I guess when the manual says, "When omitted levels are specified with the o. operator, the i. operator is implied, ...", this is where the lowest category is dropped, correct?
      2. I think the way to think about this, for example, with i(2 3 4).region is that Stata first, drops the lowest category, and then makes a 3-level variable and then applies the i. operator anew to it. I can see this with this specification:

      . regress medage i(1 3 4).region

      Source | SS df MS Number of obs = 50
      -------------+---------------------------------- F(2, 47) = 6.76
      Model | 31.3872636 2 15.6936318 Prob > F = 0.0026
      Residual | 109.132721 47 2.3219728 R-squared = 0.2234
      -------------+---------------------------------- Adj R-squared = 0.1903
      Total | 140.519985 49 2.8677548 Root MSE = 1.5238

      ------------------------------------------------------------------------------
      medage | Coefficient Std. err. t P>|t| [95% conf. interval]
      -------------+----------------------------------------------------------------
      region |
      South | -.6383927 .5056614 -1.26 0.213 -1.655652 .3788668
      West | -1.972527 .5377578 -3.67 0.001 -3.054356 -.8906981
      |
      _cons | 30.25714 .3325209 90.99 0.000 29.5882 30.92609
      ------------------------------------------------------------------------------

      Comment


      • #4
        It sounds like the way to get the more predictable behavior is to use the b. operator instead.
        Two problems with this. 1. If you want to omit more than one level, you can't do that with b. 2. I wouldn't call the behavior of o. unpredictable. It's perfectly predictable once you understand it. The problem is that it's counter-intuitive and not well explained in the documentation.

        1. I guess when the manual says, "When omitted levels are specified with the o. operator, the i. operator is implied, ...", this is where the lowest category is dropped, correct?
        I suppose that is a possible interpretation of what's in the manual. It's funny, I generally regard Stata's manuals as really high quality, a model for others to follow. But I find the explanation of factor variable notation to be a glaring exception to that rule. If that phrase does mean what you say it is, then it does seem to explain the behavior of o., but I think that's a pretty obscure way for them to say it.

        2. I think the way to think about this, for example, with i(2 3 4).region is that Stata first, drops the lowest category, and then makes a 3-level variable and then applies the i. operator anew to it. I can see this with this specification:
        Yes, precisely so.

        Comment


        • #5
          Clyde’s explanation is, as usual, very clear. That said, there is really very little reason, if ever, to use to o. notation. All you should need are i./b. notation for well-constructed categorical variables. It’s natural to specify the model with GLM-style factor coding and I would prefer to see if there are other levels dropped from the model for reasons such as lack of data or collinearity.

          The rare time I use the o. notation is when I’m trying to do something very specific with interaction variables that I could otherwise construct by hand but would be cumbersome.

          Comment


          • #6
            Hi Leonardo -- I agree with this now! And, thanks to you and Clyde for the crisp explanations, and to Clyde for the 30k posts!

            One thing I still am unclear on: Sometimes Stata reports a coefficient as "set to 0" and gives a row in the output labeled as "omitted". Other times, it just does not include the reference category. mathematically, these are the same, but I wonder if something is going on under the hood.

            Also, does one ever combine the i. and b.? E.g. -i3.b2.cat- , where -cat- I some categorical variable ? -- P

            Comment


            • #7
              Originally posted by Paul Rathouz View Post
              Hi Leonardo -- I agree with this now! And, thanks to you and Clyde for the crisp explanations, and to Clyde for the 30k posts!

              One thing I still am unclear on: Sometimes Stata reports a coefficient as "set to 0" and gives a row in the output labeled as "omitted". Other times, it just does not include the reference category. mathematically, these are the same, but I wonder if something is going on under the hood.

              Also, does one ever combine the i. and b.? E.g. -i3.b2.cat- , where -cat- I some categorical variable ? -- P
              Omitted is generally what Stata displays when the variable level would normally be expected to be included in the model, but for some reason couldn't be (e.g., collinearity). If something erroneous is detected it will also print out such a messaged in the output. When setting the baseline, no message is displayed because under the GLM-style parameter coding, the default and expected behaviour is to omit one reference category. However the coefficient is still included in the underlying coefficient vector, variance-covariance matrix, etc and can be requested in the output as well.

              On the second question, you can combine the notation as say, ib.#, but it's redundant so I don't bother with it personally. I will usually use one or the other, noting that i. saves 1 character of typing (in most cases) if you already know the lowest level should be the reference category.

              Comment


              • #8
                Here is a somewhat esoteric example of when o. is useful.

                We sometimes want to know if an ordinal independent variable can be treated as continuous. One way to do that is to include both continuous and categorical versions of the variable in the model, and then test whether the categorical version significantly improves the fit over just using the continuous version. Since you are including 2 versions of the same variable, you need to have 2 omitted categories rather than 1. For example,

                Code:
                webuse nhanes2f, clear
                * Wald test of whether continuous version alone is enough
                quietly logit diabetes c.health o(1 2).health, nolog
                testparm i.health
                Code:
                . testparm i.health
                
                 ( 1)  [diabetes]3.health = 0
                 ( 2)  [diabetes]4.health = 0
                 ( 3)  [diabetes]5.health = 0
                
                           chi2(  3) =    1.56
                         Prob > chi2 =    0.6689
                The results suggest that treating health as continuous is ok. You can also do LR tests and they lead to the same conclusion.

                For more discussion, see my paper "Ordinal Independent Variables" at

                https://methods.sagepub.com/Foundati...dent-variables

                Or, if your library foolishly does not pay for all this great Sage online material, an earlier writeup is at

                https://www3.nd.edu/~rwilliam/xsoc73...ndependent.pdf
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                StataNow Version: 19.5 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  Belated response to #6.

                  Also, does one ever combine the i. and b.? E.g. -i3.b2.cat- , where -cat- I some categorical variable ?
                  This is syntactically legal. But I would avoid using constructions like this because it is easy to misconstrue what it actually does.

                  One might think that this is equivalent to asking for an estimate of E(Y | cat = 3) - E(Y | cat = 2). But one would be wrong:
                  Code:
                  . tabstat price, by(rep78)
                  
                  Summary for variables: price
                  Group variable: rep78 (Repair record 1978)
                  
                     rep78 |      Mean
                  ---------+----------
                         1 |    4564.5
                         2 |  5967.625
                         3 |  6429.233
                         4 |    6071.5
                         5 |      5913
                  ---------+----------
                     Total |  6146.043
                  --------------------
                  
                  . display 6429.233 - 5967.625
                  461.608
                  
                  .
                  . regress price i.rep78   // MODEL 1
                  
                        Source |       SS           df       MS      Number of obs   =        69
                  -------------+----------------------------------   F(4, 64)        =      0.24
                         Model |  8360542.63         4  2090135.66   Prob > F        =    0.9174
                      Residual |   568436416        64     8881819   R-squared       =    0.0145
                  -------------+----------------------------------   Adj R-squared   =   -0.0471
                         Total |   576796959        68  8482308.22   Root MSE        =    2980.2
                  
                  ------------------------------------------------------------------------------
                         price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                  -------------+----------------------------------------------------------------
                         rep78 |
                            2  |   1403.125   2356.085     0.60   0.554    -3303.696    6109.946
                            3  |   1864.733   2176.458     0.86   0.395    -2483.242    6212.708
                            4  |       1507   2221.338     0.68   0.500    -2930.633    5944.633
                            5  |     1348.5   2290.927     0.59   0.558    -3228.153    5925.153
                               |
                         _cons |     4564.5   2107.347     2.17   0.034     354.5913    8774.409
                  ------------------------------------------------------------------------------
                  
                  . lincom 3.rep78 - 2.rep78
                  
                   ( 1)  - 2.rep78 + 3.rep78 = 0
                  
                  ------------------------------------------------------------------------------
                         price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                  -------------+----------------------------------------------------------------
                           (1) |   461.6083    1185.87     0.39   0.698     -1907.44    2830.656
                  ------------------------------------------------------------------------------
                  
                  .
                  .
                  . regress price ib2.rep78  // MODEL 2
                  
                        Source |       SS           df       MS      Number of obs   =        69
                  -------------+----------------------------------   F(4, 64)        =      0.24
                         Model |  8360542.63         4  2090135.66   Prob > F        =    0.9174
                      Residual |   568436416        64     8881819   R-squared       =    0.0145
                  -------------+----------------------------------   Adj R-squared   =   -0.0471
                         Total |   576796959        68  8482308.22   Root MSE        =    2980.2
                  
                  ------------------------------------------------------------------------------
                         price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                  -------------+----------------------------------------------------------------
                         rep78 |
                            1  |  -1403.125   2356.085    -0.60   0.554    -6109.946    3303.696
                            3  |   461.6083    1185.87     0.39   0.698     -1907.44    2830.656
                            4  |    103.875   1266.358     0.08   0.935    -2425.965    2633.715
                            5  |    -54.625   1384.798    -0.04   0.969    -2821.077    2711.827
                               |
                         _cons |   5967.625   1053.673     5.66   0.000     3862.671    8072.579
                  ------------------------------------------------------------------------------
                  
                  . lincom 3.rep78 - 2.rep78
                  
                   ( 1)  - 2b.rep78 + 3.rep78 = 0
                  
                  ------------------------------------------------------------------------------
                         price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                  -------------+----------------------------------------------------------------
                           (1) |   461.6083    1185.87     0.39   0.698     -1907.44    2830.656
                  ------------------------------------------------------------------------------
                  
                  .
                  . regress price i3.b2.rep78 // MODEL 3
                  
                        Source |       SS           df       MS      Number of obs   =        69
                  -------------+----------------------------------   F(1, 67)        =      0.50
                         Model |  4256583.14         1  4256583.14   Prob > F        =    0.4828
                      Residual |   572540376        67  8545378.74   R-squared       =    0.0074
                  -------------+----------------------------------   Adj R-squared   =   -0.0074
                         Total |   576796959        68  8482308.22   Root MSE        =    2923.2
                  
                  ------------------------------------------------------------------------------
                         price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                  -------------+----------------------------------------------------------------
                       3.rep78 |   501.0282   709.9002     0.71   0.483    -915.9384    1917.995
                         _cons |   5928.205   468.0943    12.66   0.000     4993.885    6862.525
                  ------------------------------------------------------------------------------
                  So what is that 501.0282? It's not the difference between expected price at rep78 = 3 and rep78 = 2. Presumably it is the difference in expected price at rep78 = 3 and something else. What is that something else?

                  Another incorrect interpretation that looks attractive is that it is equivalent to regressing price on rep78 restricting to rep78 = 3 (i3) and rep78 = 2 (b2). But, again, that's wrong:
                  Code:
                  . regress price ib2.rep78 if inlist(rep78, 2, 3)
                  
                        Source |       SS           df       MS      Number of obs   =        38
                  -------------+----------------------------------   F(1, 36)        =      0.11
                         Model |  1345782.65         1  1345782.65   Prob > F        =    0.7447
                      Residual |   450054279        36  12501507.8   R-squared       =    0.0030
                  -------------+----------------------------------   Adj R-squared   =   -0.0247
                         Total |   451400062        37  12200001.7   Root MSE        =    3535.7
                  
                  ------------------------------------------------------------------------------
                         price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                  -------------+----------------------------------------------------------------
                       3.rep78 |   461.6083   1406.913     0.33   0.745    -2391.744    3314.961
                         _cons |   5967.625   1250.075     4.77   0.000     3432.355    8502.895
                  ------------------------------------------------------------------------------
                  Notice that this one gives a coefficient, 461.6083 that is the correct expected difference in price at rep78 = 3 and rep78 = 2, but by restricting the sample to just those two levels, the standard error is wrong. (N.B. N has dropped from 69 to 38.)


                  The key is to remember that these factor variable operators manipulate the construction of the indicator ("dummy") variables used in the model: they do not change the estimation sample or anything like that. So to interpret i3.b2.rep78, we recognize that the entire range of values of rep78 will be included in the estimation sample. Level 2 will be the base category, and level 3 will be represented by its own indicator. What happens to levels 1, 4, and 5? Because the notation i3.b2. restricts the indicators to one for level 3 and a base level of 2, there are no indicators for levels 1, 4, and 5, even though levels 1, 4, and 5 are included in the estimation sample. So levels 1, 4, and 5 must have 0 values for both the i3 indicator and for the (omitted as baseline) b2 indicator. This is equivalent to recoding levels 1, 4, and 5 as if they were also part of the baseline. And this is in fact what happens:
                  Code:
                  . recode rep78 (1 4 5 = 2), gen(rep78_recode)
                  (31 differences between rep78 and rep78_recode)
                  
                  . regress price i.rep78_recode // MODEL 4
                  
                        Source |       SS           df       MS      Number of obs   =        69
                  -------------+----------------------------------   F(1, 67)        =      0.50
                         Model |  4256583.14         1  4256583.14   Prob > F        =    0.4828
                      Residual |   572540376        67  8545378.74   R-squared       =    0.0074
                  -------------+----------------------------------   Adj R-squared   =   -0.0074
                         Total |   576796959        68  8482308.22   Root MSE        =    2923.2
                  
                  --------------------------------------------------------------------------------
                           price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                  ---------------+----------------------------------------------------------------
                  3.rep78_recode |   501.0282   709.9002     0.71   0.483    -915.9384    1917.995
                           _cons |   5928.205   468.0943    12.66   0.000     4993.885    6862.525
                  --------------------------------------------------------------------------------
                  This reproduces the results of MODEL 3 exactly.

                  So, yes, you can do this. You can puzzle out the meaning of i3.b2. with this line of reasoning. If, working in the opposite direction, you wanted to use everything but level 3 as the baseline and retain all levels of rep78 in the estimation sample, you could puzzle out that i3.b2 will do that. But I suspect that if you did this, and then came back and reviewed your log file 6 months later, when the reviewers of your manuscript are requesting revisions, you will stare at that and wonder what you were thinking. So my advice is, don't go there.

                  Comment


                  • #10
                    Thank you both. I do find Clyde's reasoning compelling. I would note on Richard's post, that there is an even easier way to do what he wants, and the reason is that the o2. prefix both drops the first category and sets the second one to be the reference. See this comparison and and how it gives the same answer:

                    Code:
                    . webuse nhanes2f, clear
                    
                    . * Wald test of whether continuous version alone is enough
                    . quietly logit diabetes c.health o(1 2).health, nolog
                    
                    . testparm i.health
                    
                     ( 1)  [diabetes]3.health = 0
                     ( 2)  [diabetes]4.health = 0
                     ( 3)  [diabetes]5.health = 0
                    
                               chi2(  3) =    1.56
                             Prob > chi2 =    0.6689
                    
                    . 
                    . quietly logit diabetes c.health o2.health, nolog
                    
                    . testparm i.health
                    
                     ( 1)  [diabetes]3.health = 0
                     ( 2)  [diabetes]4.health = 0
                     ( 3)  [diabetes]5.health = 0
                    
                               chi2(  3) =    1.56
                             Prob > chi2 =    0.6689

                    Comment


                    • #11
                      Thanks Paul. It is a shame that you weren’t one of the reviewers for my paper!
                      -------------------------------------------
                      Richard Williams, Notre Dame Dept of Sociology
                      StataNow Version: 19.5 MP (2 processor)

                      EMAIL: [email protected]
                      WWW: https://www3.nd.edu/~rwilliam

                      Comment


                      • #12
                        I had never looked at the o. prefix before, and the examples in #8 and #10 piqued my curiosity. Here are some related examples I generated while tinkering.

                        Code:
                        Code:
                        // Variations on example in #8
                        webuse nhanes2f, clear
                        * Wald test of whether continuous version alone is enough
                        forvalues x = 2/5 {
                          quietly logit diabetes c.health o(1 `x').health
                          testparm i.health
                        }
                        // Result from -testparm- is the same in all cases.
                        // But result is also the same if you just use i.health:
                        quietly logit diabetes c.health i.health
                        testparm i.health
                        
                        * Likelihood ratio test of whether continuous version alone is enough
                        // Model 1: Treat health as continuous
                        quietly logit diabetes c.health
                        estimates store m1
                        // Model 2: Add health as categorical  
                        quietly logit diabetes c.health i.health
                        estimates store m2
                        lrtest m1 m2
                        Output:
                        Code:
                        . // Variations on example in #8
                        . webuse nhanes2f, clear
                        
                        . * Wald test of whether continuous version alone is enough
                        . forvalues x = 2/5 {
                          2.   quietly logit diabetes c.health o(1 `x').health
                          3.   testparm i.health
                          4. }
                        
                         ( 1)  [diabetes]3.health = 0
                         ( 2)  [diabetes]4.health = 0
                         ( 3)  [diabetes]5.health = 0
                        
                                   chi2(  3) =    1.56
                                 Prob > chi2 =    0.6689
                        
                         ( 1)  [diabetes]2.health = 0
                         ( 2)  [diabetes]4.health = 0
                         ( 3)  [diabetes]5.health = 0
                        
                                   chi2(  3) =    1.56
                                 Prob > chi2 =    0.6689
                        
                         ( 1)  [diabetes]2.health = 0
                         ( 2)  [diabetes]3.health = 0
                         ( 3)  [diabetes]5.health = 0
                        
                                   chi2(  3) =    1.56
                                 Prob > chi2 =    0.6689
                        
                         ( 1)  [diabetes]2.health = 0
                         ( 2)  [diabetes]3.health = 0
                         ( 3)  [diabetes]4.health = 0
                        
                                   chi2(  3) =    1.56
                                 Prob > chi2 =    0.6689
                        
                        . // Result from -testparm- is the same in all cases.
                        . // But result is also the same if you just use i.health:
                        . quietly logit diabetes c.health i.health
                        
                        . testparm i.health
                        
                         ( 1)  [diabetes]2.health = 0
                         ( 2)  [diabetes]3.health = 0
                         ( 3)  [diabetes]4.health = 0
                        
                                   chi2(  3) =    1.56
                                 Prob > chi2 =    0.6689
                        
                        .
                        . * Likelihood ratio test of whether continuous version alone is enough
                        . // Model 1: Treat health as continuous
                        . quietly logit diabetes c.health
                        
                        . estimates store m1
                        
                        . // Model 2: Add health as categorical  
                        . quietly logit diabetes c.health i.health
                        
                        . estimates store m2
                        
                        . lrtest m1 m2
                        
                        Likelihood-ratio test
                        Assumption: m1 nested within m2
                        
                         LR chi2(3) =   1.60
                        Prob > chi2 = 0.6599
                        
                        .
                        end of do-file

                        EDIT: And if I had consulted Richard's notes (https://www3.nd.edu/~rwilliam/xsoc73...ndependent.pdf) before posting, I would have seen that they contain similar examples! 🙄

                        Last edited by Bruce Weaver; 23 May 2025, 13:58.
                        --
                        Bruce Weaver
                        Email: [email protected]
                        Version: Stata/MP 19.5 (Windows)

                        Comment


                        • #13
                          Too bad Bruce didn't review my paper either. ;-)

                          I think my approach makes it a little more obvious and intuitive what Stata is doing. But also, notice a potential problem with the alternatives that Bruce and Paul have tossed out that are a little simpler than my original coding.

                          Code:
                          webuse nhanes2f, clear
                          logit diabetes c.health i.health, nolog
                          testparm i.health
                          logit diabetes i.health c.health, nolog
                          testparm i.health
                          Code:
                          . logit diabetes c.health i.health, nolog
                          note: 5.health omitted because of collinearity.
                          
                          Logistic regression                                     Number of obs = 10,335
                                                                                  LR chi2(4)    = 429.74
                                                                                  Prob > chi2   = 0.0000
                          Log likelihood = -1784.1984                             Pseudo R2     = 0.1075
                          
                          ------------------------------------------------------------------------------
                              diabetes | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
                          -------------+----------------------------------------------------------------
                                health |  -.7791143    .056556   -13.78   0.000    -.8899619   -.6682666
                                       |
                                health |
                                 Fair  |   .0297756   .1207477     0.25   0.805    -.2068855    .2664367
                              Average  |  -.0089766   .1437693    -0.06   0.950    -.2907592     .272806
                                 Good  |  -.2166689   .2164641    -1.00   0.317    -.6409308     .207593
                            Excellent  |          0  (omitted)
                                       |
                                 _cons |  -.7024903   .1297495    -5.41   0.000    -.9567947   -.4481859
                          ------------------------------------------------------------------------------
                          
                          . testparm i.health
                          
                           ( 1)  [diabetes]2.health = 0
                           ( 2)  [diabetes]3.health = 0
                           ( 3)  [diabetes]4.health = 0
                          
                                     chi2(  3) =    1.56
                                   Prob > chi2 =    0.6689
                          
                          . logit diabetes i.health c.health, nolog
                          note: health omitted because of collinearity.
                          
                          Logistic regression                                     Number of obs = 10,335
                                                                                  LR chi2(4)    = 429.74
                                                                                  Prob > chi2   = 0.0000
                          Log likelihood = -1784.1984                             Pseudo R2     = 0.1075
                          
                          ------------------------------------------------------------------------------
                              diabetes | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
                          -------------+----------------------------------------------------------------
                                health |
                                 Fair  |  -.7493387   .1262017    -5.94   0.000    -.9966895   -.5019878
                              Average  |  -1.567205   .1302544   -12.03   0.000    -1.822499   -1.311911
                                 Good  |  -2.554012   .1780615   -14.34   0.000    -2.903006   -2.205018
                            Excellent  |  -3.116457   .2262238   -13.78   0.000    -3.559848   -2.673067
                                       |
                                health |          0  (omitted)
                                 _cons |  -1.481605   .0953463   -15.54   0.000     -1.66848   -1.294729
                          ------------------------------------------------------------------------------
                          
                          . testparm i.health
                          
                           ( 1)  [diabetes]2.health = 0
                           ( 2)  [diabetes]3.health = 0
                           ( 3)  [diabetes]4.health = 0
                           ( 4)  [diabetes]5.health = 0
                          
                                     chi2(  4) =  368.90
                                   Prob > chi2 =    0.0000
                          All I did was reverse the positions of c.health and i.health. Yet the testparm results were radically different. Why? Because Stata has to omit something. In the first model it dropped a category of health, and in the 2nd it dropped c.health.

                          The same thing happens if you use Paul's o2 only approach -- reverse the health vars and c.health is dropped instead of 1.health.

                          But, if you use my original approach,

                          Code:
                          logit diabetes c.health o(1 2).health, nolog
                          testparm i.health
                          logit diabetes o(1 2).health c.health , nolog
                          testparm i.health
                          you get the same results regardless of whether the categorical or continuous version comes first. This is because I am explicitly controlling what gets omitted, rather than letting Stata make the call.

                          I suspect I wasn't that brilliant and foresighted when I wrote the paper, but if I wasn't I think I lucked into the best and safest way to do it.

                          As I said at the beginning, this is a sort of esoteric example! But I suppose the moral is, if for some reason it really really really matters what categories or variable get omitted, you may want to use o. so you control the choice rather than leaving it up to Stata.



                          -------------------------------------------
                          Richard Williams, Notre Dame Dept of Sociology
                          StataNow Version: 19.5 MP (2 processor)

                          EMAIL: [email protected]
                          WWW: https://www3.nd.edu/~rwilliam

                          Comment


                          • #14
                            Another sidelight: As Bruce shows, in this case you could use LR tests instead. But if, say, the data were svyset, you would have to use the wald test approach.
                            -------------------------------------------
                            Richard Williams, Notre Dame Dept of Sociology
                            StataNow Version: 19.5 MP (2 processor)

                            EMAIL: [email protected]
                            WWW: https://www3.nd.edu/~rwilliam

                            Comment


                            • #15
                              Roger all of that , Richard!

                              Comment

                              Working...
                              X