not getting the i. and o. operators

Paul Rathouz

Join Date: Oct 2023

Posts: 32
#1

not getting the i. and o. operators

20 May 2025, 13:28

Hi – I am struggling to understand how the i. and o. operators work I created a small example below. The first 3 examples of -regress- make sense to me, but not the last two. Why is region==2 being omitted in those cases? Code and log below. -- Paul

------------------------------------------------------------------------------------------
name: <unnamed>
log: /Users/paulrathouz/Desktop/StataTest/indicatorTest.log
log type: text
opened on: 20 May 2025, 14:20:13

. // Test the i. and o. operators
. // Time-stamp: <2025-05-20 14:19:29 paulrathouz>
.
. sysuse census
(1980 Census data by state)

. des

Contains data from /Applications/Stata/ado/base/c/census.dta
Observations: 50 1980 Census data by state
Variables: 13 6 Apr 2022 15:43
------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------
state str14 %-14s State
state2 str2 %-2s Two-letter state abbreviation
region int %-8.0g cenreg Census region
pop long %12.0gc Population
poplt5 long %12.0gc Pop, < 5 year
pop5_17 long %12.0gc Pop, 5 to 17 years
pop18p long %12.0gc Pop, 18 and older
pop65p long %12.0gc Pop, 65 and older
popurban long %12.0gc Urban population
medage float %9.2f Median age
death long %12.0gc Number of deaths
marriage long %12.0gc Number of marriages
divorce long %12.0gc Number of divorces
------------------------------------------------------------------------------------------
Sorted by:

. codebook region

------------------------------------------------------------------------------------------
region Census region
------------------------------------------------------------------------------------------

Type: Numeric (int)
Label: cenreg

Range: [1,4] Units: 1
Unique values: 4 Missing .: 0/50

Tabulation: Freq. Numeric Label
9 1 NE
12 2 N Cntrl
16 3 South
13 4 West

. regress medage i.region

Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(3, 46) = 7.56
Model | 46.3961903 3 15.4653968 Prob > F = 0.0003
Residual | 94.1237947 46 2.04616945 R-squared = 0.3302
-------------+---------------------------------- Adj R-squared = 0.2865
Total | 140.519985 49 2.8677548 Root MSE = 1.4304

------------------------------------------------------------------------------
medage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
region |
N Cntrl | -1.708333 .6307664 -2.71 0.009 -2.978 -.4386663
South | -1.614583 .5960182 -2.71 0.009 -2.814306 -.4148606
West | -2.948718 .620282 -4.75 0.000 -4.197281 -1.700155
|
_cons | 31.23333 .4768146 65.50 0.000 30.27356 32.19311
------------------------------------------------------------------------------

. regress medage i1.region

Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(1, 48) = 13.85
Model | 31.4712118 1 31.4712118 Prob > F = 0.0005
Residual | 109.048773 48 2.27184944 R-squared = 0.2240
-------------+---------------------------------- Adj R-squared = 0.2078
Total | 140.519985 49 2.8677548 Root MSE = 1.5073

------------------------------------------------------------------------------
medage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
region |
NE | 2.06504 .5548321 3.72 0.001 .9494757 3.180605
_cons | 29.16829 .2353953 123.91 0.000 28.695 29.64159
------------------------------------------------------------------------------

. regress medage o1.region

Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(3, 46) = 7.56
Model | 46.3961903 3 15.4653968 Prob > F = 0.0003
Residual | 94.1237947 46 2.04616945 R-squared = 0.3302
-------------+---------------------------------- Adj R-squared = 0.2865
Total | 140.519985 49 2.8677548 Root MSE = 1.4304

------------------------------------------------------------------------------
medage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
region |
N Cntrl | -1.708333 .6307664 -2.71 0.009 -2.978 -.4386663
South | -1.614583 .5960182 -2.71 0.009 -2.814306 -.4148606
West | -2.948718 .620282 -4.75 0.000 -4.197281 -1.700155
|
_cons | 31.23333 .4768146 65.50 0.000 30.27356 32.19311
------------------------------------------------------------------------------

. regress medage o2.region

Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 47) = 6.76
Model | 31.3872636 2 15.6936318 Prob > F = 0.0026
Residual | 109.132721 47 2.3219728 R-squared = 0.2234
-------------+---------------------------------- Adj R-squared = 0.1903
Total | 140.519985 49 2.8677548 Root MSE = 1.5238

------------------------------------------------------------------------------
medage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
region |
N Cntrl | 0 (omitted)
South | -.6383927 .5056614 -1.26 0.213 -1.655652 .3788668
West | -1.972527 .5377578 -3.67 0.001 -3.054356 -.8906981
|
_cons | 30.25714 .3325209 90.99 0.000 29.5882 30.92609
------------------------------------------------------------------------------

.
. regress medage i(2 3 4).region

Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 47) = 6.76
Model | 31.3872636 2 15.6936318 Prob > F = 0.0026
Residual | 109.132721 47 2.3219728 R-squared = 0.2234
-------------+---------------------------------- Adj R-squared = 0.1903
Total | 140.519985 49 2.8677548 Root MSE = 1.5238

------------------------------------------------------------------------------
medage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
region |
South | -.6383927 .5056614 -1.26 0.213 -1.655652 .3788668
West | -1.972527 .5377578 -3.67 0.001 -3.054356 -.8906981
|
_cons | 30.25714 .3325209 90.99 0.000 29.5882 30.92609
------------------------------------------------------------------------------

.
. log close
name: <unnamed>
log: /Users/paulrathouz/Desktop/StataTest/indicatorTest.log
log type: text
closed on: 20 May 2025, 14:20:13
------------------------------------------------------------------------------------------
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#2

20 May 2025, 13:47

Yes, this is a common confusion. I've been bitten by it myself on many occasions. I think the manual section on factor variables is not at all clear about how the o. operator works.

In the context of a regression model, where one of the regressors has to be omitted to identify the model, the o. operator is interpreted as requesting the removal of an additional level. So, you will see that in your o1.region example, 1.region is omitted as the base category (as it would usually be), and then you have requested omission of 1 in addition. But since 1 is already gone, there is nothing more to do. So you get the results with all of the levels except 1 (NE).

In the o2.region example, 1 is still omitted as the base category, and 2 is omitted in addition. The omission of 1 is not remarked upon in the output, because it basically "comes with the regression." Your second category is then omitted in addition, with explicit marking of that fact in the output. All that remains in that model are levels 3 and 4 of region.

If you want to omit 2.region and make it the base category, you should specify b2.region. In that case 2 will be omitted as the base category, and 1, 3, and 4 will be retained.

Added: To further clarify, when you are using factor variable operators in a context where there is not an automatic omission of a base level, then o. behaves the way you would expect: it leaves in everything except the levels specified in the o. prefix. For example, you can see this if you run -summarize o3.region-. You will get summary statistics for levels 1, 2, and 4. In this case, 1 is not omitted because there is no omission of any base category in the operation of the -summarize- command.

Last edited by Clyde Schechter; 20 May 2025, 13:54.
Comment
Paul Rathouz

Join Date: Oct 2023

Posts: 32
#3

20 May 2025, 14:16

Clyde -- This is very helpful. A few points / follow ups:

0. It sounds like the way to get the more predictable behavior is to use the b. operator instead.
1. I guess when the manual says, "When omitted levels are specified with the o. operator, the i. operator is implied, ...", this is where the lowest category is dropped, correct?
2. I think the way to think about this, for example, with i(2 3 4).region is that Stata first, drops the lowest category, and then makes a 3-level variable and then applies the i. operator anew to it. I can see this with this specification:

. regress medage i(1 3 4).region

Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 47) = 6.76
Model | 31.3872636 2 15.6936318 Prob > F = 0.0026
Residual | 109.132721 47 2.3219728 R-squared = 0.2234
-------------+---------------------------------- Adj R-squared = 0.1903
Total | 140.519985 49 2.8677548 Root MSE = 1.5238

------------------------------------------------------------------------------
medage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
region |
South | -.6383927 .5056614 -1.26 0.213 -1.655652 .3788668
West | -1.972527 .5377578 -3.67 0.001 -3.054356 -.8906981
|
_cons | 30.25714 .3325209 90.99 0.000 29.5882 30.92609
------------------------------------------------------------------------------
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#4

20 May 2025, 14:46

It sounds like the way to get the more predictable behavior is to use the b. operator instead.

Two problems with this. 1. If you want to omit more than one level, you can't do that with b. 2. I wouldn't call the behavior of o. unpredictable. It's perfectly predictable once you understand it. The problem is that it's counter-intuitive and not well explained in the documentation.

1. I guess when the manual says, "When omitted levels are specified with the o. operator, the i. operator is implied, ...", this is where the lowest category is dropped, correct?

I suppose that is a possible interpretation of what's in the manual. It's funny, I generally regard Stata's manuals as really high quality, a model for others to follow. But I find the explanation of factor variable notation to be a glaring exception to that rule. If that phrase does mean what you say it is, then it does seem to explain the behavior of o., but I think that's a pretty obscure way for them to say it.

2. I think the way to think about this, for example, with i(2 3 4).region is that Stata first, drops the lowest category, and then makes a 3-level variable and then applies the i. operator anew to it. I can see this with this specification:

Yes, precisely so.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#5

20 May 2025, 18:34

Clyde’s explanation is, as usual, very clear. That said, there is really very little reason, if ever, to use to o. notation. All you should need are i./b. notation for well-constructed categorical variables. It’s natural to specify the model with GLM-style factor coding and I would prefer to see if there are other levels dropped from the model for reasons such as lack of data or collinearity.

The rare time I use the o. notation is when I’m trying to do something very specific with interaction variables that I could otherwise construct by hand but would be cumbersome.
Comment
Paul Rathouz

Join Date: Oct 2023

Posts: 32
#6

21 May 2025, 06:17

Hi Leonardo -- I agree with this now! And, thanks to you and Clyde for the crisp explanations, and to Clyde for the 30k posts!

One thing I still am unclear on: Sometimes Stata reports a coefficient as "set to 0" and gives a row in the output labeled as "omitted". Other times, it just does not include the reference category. mathematically, these are the same, but I wonder if something is going on under the hood.

Also, does one ever combine the i. and b.? E.g. -i3.b2.cat- , where -cat- I some categorical variable ? -- P
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#7

21 May 2025, 10:30

Originally posted by Paul Rathouz View Post

Hi Leonardo -- I agree with this now! And, thanks to you and Clyde for the crisp explanations, and to Clyde for the 30k posts!

One thing I still am unclear on: Sometimes Stata reports a coefficient as "set to 0" and gives a row in the output labeled as "omitted". Other times, it just does not include the reference category. mathematically, these are the same, but I wonder if something is going on under the hood.

Also, does one ever combine the i. and b.? E.g. -i3.b2.cat- , where -cat- I some categorical variable ? -- P

Omitted is generally what Stata displays when the variable level would normally be expected to be included in the model, but for some reason couldn't be (e.g., collinearity). If something erroneous is detected it will also print out such a messaged in the output. When setting the baseline, no message is displayed because under the GLM-style parameter coding, the default and expected behaviour is to omit one reference category. However the coefficient is still included in the underlying coefficient vector, variance-covariance matrix, etc and can be requested in the output as well.

On the second question, you can combine the notation as say, ib.#, but it's redundant so I don't bother with it personally. I will usually use one or the other, noting that i. saves 1 character of typing (in most cases) if you already know the lowest level should be the reference category.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4985
#8

21 May 2025, 12:14

Here is a somewhat esoteric example of when o. is useful.

We sometimes want to know if an ordinal independent variable can be treated as continuous. One way to do that is to include both continuous and categorical versions of the variable in the model, and then test whether the categorical version significantly improves the fit over just using the continuous version. Since you are including 2 versions of the same variable, you need to have 2 omitted categories rather than 1. For example,

Code:

webuse nhanes2f, clear * Wald test of whether continuous version alone is enough quietly logit diabetes c.health o(1 2).health, nolog testparm i.health

Code:

. testparm i.health ( 1) [diabetes]3.health = 0 ( 2) [diabetes]4.health = 0 ( 3) [diabetes]5.health = 0 chi2( 3) = 1.56 Prob > chi2 = 0.6689

The results suggest that treating health as continuous is ok. You can also do LR tests and they lead to the same conclusion.

For more discussion, see my paper "Ordinal Independent Variables" at

https://methods.sagepub.com/Foundati...dent-variables

Or, if your library foolishly does not pay for all this great Sage online material, an earlier writeup is at

https://www3.nd.edu/~rwilliam/xsoc73...ndependent.pdf

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
2 likes
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30089

22 May 2025, 09:05

Belated response to #6.

Also, does one ever combine the i. and b.? E.g. -i3.b2.cat- , where -cat- I some categorical variable ?

This is syntactically legal. But I would avoid using constructions like this because it is easy to misconstrue what it actually does.

One might think that this is equivalent to asking for an estimate of E(Y | cat = 3) - E(Y | cat = 2). But one would be wrong:

Code:

. tabstat price, by(rep78)

Summary for variables: price
Group variable: rep78 (Repair record 1978)

   rep78 |      Mean
---------+----------
       1 |    4564.5
       2 |  5967.625
       3 |  6429.233
       4 |    6071.5
       5 |      5913
---------+----------
   Total |  6146.043
--------------------

. display 6429.233 - 5967.625
461.608

.
. regress price i.rep78   // MODEL 1

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(4, 64)        =      0.24
       Model |  8360542.63         4  2090135.66   Prob > F        =    0.9174
    Residual |   568436416        64     8881819   R-squared       =    0.0145
-------------+----------------------------------   Adj R-squared   =   -0.0471
       Total |   576796959        68  8482308.22   Root MSE        =    2980.2

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       rep78 |
          2  |   1403.125   2356.085     0.60   0.554    -3303.696    6109.946
          3  |   1864.733   2176.458     0.86   0.395    -2483.242    6212.708
          4  |       1507   2221.338     0.68   0.500    -2930.633    5944.633
          5  |     1348.5   2290.927     0.59   0.558    -3228.153    5925.153
             |
       _cons |     4564.5   2107.347     2.17   0.034     354.5913    8774.409
------------------------------------------------------------------------------

. lincom 3.rep78 - 2.rep78

 ( 1)  - 2.rep78 + 3.rep78 = 0

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |   461.6083    1185.87     0.39   0.698     -1907.44    2830.656
------------------------------------------------------------------------------

.
.
. regress price ib2.rep78  // MODEL 2

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(4, 64)        =      0.24
       Model |  8360542.63         4  2090135.66   Prob > F        =    0.9174
    Residual |   568436416        64     8881819   R-squared       =    0.0145
-------------+----------------------------------   Adj R-squared   =   -0.0471
       Total |   576796959        68  8482308.22   Root MSE        =    2980.2

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       rep78 |
          1  |  -1403.125   2356.085    -0.60   0.554    -6109.946    3303.696
          3  |   461.6083    1185.87     0.39   0.698     -1907.44    2830.656
          4  |    103.875   1266.358     0.08   0.935    -2425.965    2633.715
          5  |    -54.625   1384.798    -0.04   0.969    -2821.077    2711.827
             |
       _cons |   5967.625   1053.673     5.66   0.000     3862.671    8072.579
------------------------------------------------------------------------------

. lincom 3.rep78 - 2.rep78

 ( 1)  - 2b.rep78 + 3.rep78 = 0

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |   461.6083    1185.87     0.39   0.698     -1907.44    2830.656
------------------------------------------------------------------------------

.
. regress price i3.b2.rep78 // MODEL 3

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(1, 67)        =      0.50
       Model |  4256583.14         1  4256583.14   Prob > F        =    0.4828
    Residual |   572540376        67  8545378.74   R-squared       =    0.0074
-------------+----------------------------------   Adj R-squared   =   -0.0074
       Total |   576796959        68  8482308.22   Root MSE        =    2923.2

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     3.rep78 |   501.0282   709.9002     0.71   0.483    -915.9384    1917.995
       _cons |   5928.205   468.0943    12.66   0.000     4993.885    6862.525
------------------------------------------------------------------------------

So what is that 501.0282? It's not the difference between expected price at rep78 = 3 and rep78 = 2. Presumably it is the difference in expected price at rep78 = 3 and something else. What is that something else?

Another incorrect interpretation that looks attractive is that it is equivalent to regressing price on rep78 restricting to rep78 = 3 (i3) and rep78 = 2 (b2). But, again, that's wrong:

Code:

. regress price ib2.rep78 if inlist(rep78, 2, 3)

      Source |       SS           df       MS      Number of obs   =        38
-------------+----------------------------------   F(1, 36)        =      0.11
       Model |  1345782.65         1  1345782.65   Prob > F        =    0.7447
    Residual |   450054279        36  12501507.8   R-squared       =    0.0030
-------------+----------------------------------   Adj R-squared   =   -0.0247
       Total |   451400062        37  12200001.7   Root MSE        =    3535.7

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     3.rep78 |   461.6083   1406.913     0.33   0.745    -2391.744    3314.961
       _cons |   5967.625   1250.075     4.77   0.000     3432.355    8502.895
------------------------------------------------------------------------------

Notice that this one gives a coefficient, 461.6083 that is the correct expected difference in price at rep78 = 3 and rep78 = 2, but by restricting the sample to just those two levels, the standard error is wrong. (N.B. N has dropped from 69 to 38.)

The key is to remember that these factor variable operators manipulate the construction of the indicator ("dummy") variables used in the model: they do not change the estimation sample or anything like that. So to interpret i3.b2.rep78, we recognize that the entire range of values of rep78 will be included in the estimation sample. Level 2 will be the base category, and level 3 will be represented by its own indicator. What happens to levels 1, 4, and 5? Because the notation i3.b2. restricts the indicators to one for level 3 and a base level of 2, there are no indicators for levels 1, 4, and 5, even though levels 1, 4, and 5 are included in the estimation sample. So levels 1, 4, and 5 must have 0 values for both the i3 indicator and for the (omitted as baseline) b2 indicator. This is equivalent to recoding levels 1, 4, and 5 as if they were also part of the baseline. And this is in fact what happens:

Code:

. recode rep78 (1 4 5 = 2), gen(rep78_recode)
(31 differences between rep78 and rep78_recode)

. regress price i.rep78_recode // MODEL 4

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(1, 67)        =      0.50
       Model |  4256583.14         1  4256583.14   Prob > F        =    0.4828
    Residual |   572540376        67  8545378.74   R-squared       =    0.0074
-------------+----------------------------------   Adj R-squared   =   -0.0074
       Total |   576796959        68  8482308.22   Root MSE        =    2923.2

--------------------------------------------------------------------------------
         price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
---------------+----------------------------------------------------------------
3.rep78_recode |   501.0282   709.9002     0.71   0.483    -915.9384    1917.995
         _cons |   5928.205   468.0943    12.66   0.000     4993.885    6862.525
--------------------------------------------------------------------------------

This reproduces the results of MODEL 3 exactly.

So, yes, you can do this. You can puzzle out the meaning of i3.b2. with this line of reasoning. If, working in the opposite direction, you wanted to use everything but level 3 as the baseline and retain all levels of rep78 in the estimation sample, you could puzzle out that i3.b2 will do that. But I suspect that if you did this, and then came back and reviewed your log file 6 months later, when the reviewers of your manuscript are requesting revisions, you will stare at that and wonder what you were thinking. So my advice is, don't go there.

Comment

Paul Rathouz

Join Date: Oct 2023
Posts: 32

#10

23 May 2025, 11:09

Thank you both. I do find Clyde's reasoning compelling. I would note on Richard's post, that there is an even easier way to do what he wants, and the reason is that the o2. prefix both drops the first category and sets the second one to be the reference. See this comparison and and how it gives the same answer:

Code:

. webuse nhanes2f, clear

. * Wald test of whether continuous version alone is enough
. quietly logit diabetes c.health o(1 2).health, nolog

. testparm i.health

 ( 1)  [diabetes]3.health = 0
 ( 2)  [diabetes]4.health = 0
 ( 3)  [diabetes]5.health = 0

           chi2(  3) =    1.56
         Prob > chi2 =    0.6689

. 
. quietly logit diabetes c.health o2.health, nolog

. testparm i.health

 ( 1)  [diabetes]3.health = 0
 ( 2)  [diabetes]4.health = 0
 ( 3)  [diabetes]5.health = 0

           chi2(  3) =    1.56
         Prob > chi2 =    0.6689

Comment

Richard Williams

Join Date: Apr 2014

Posts: 4985
#11

23 May 2025, 11:18

Thanks Paul. It is a shame that you weren’t one of the reviewers for my paper!

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment

Bruce Weaver

Join Date: May 2014
Posts: 1131

#12

23 May 2025, 13:50

I had never looked at the o. prefix before, and the examples in #8 and #10 piqued my curiosity. Here are some related examples I generated while tinkering.

Code:

Code:

// Variations on example in #8
webuse nhanes2f, clear
* Wald test of whether continuous version alone is enough
forvalues x = 2/5 {
  quietly logit diabetes c.health o(1 `x').health
  testparm i.health
}
// Result from -testparm- is the same in all cases.
// But result is also the same if you just use i.health:
quietly logit diabetes c.health i.health
testparm i.health

* Likelihood ratio test of whether continuous version alone is enough
// Model 1: Treat health as continuous
quietly logit diabetes c.health
estimates store m1
// Model 2: Add health as categorical  
quietly logit diabetes c.health i.health
estimates store m2
lrtest m1 m2

Output:

Code:

. // Variations on example in #8
. webuse nhanes2f, clear

. * Wald test of whether continuous version alone is enough
. forvalues x = 2/5 {
  2.   quietly logit diabetes c.health o(1 `x').health
  3.   testparm i.health
  4. }

 ( 1)  [diabetes]3.health = 0
 ( 2)  [diabetes]4.health = 0
 ( 3)  [diabetes]5.health = 0

           chi2(  3) =    1.56
         Prob > chi2 =    0.6689

 ( 1)  [diabetes]2.health = 0
 ( 2)  [diabetes]4.health = 0
 ( 3)  [diabetes]5.health = 0

           chi2(  3) =    1.56
         Prob > chi2 =    0.6689

 ( 1)  [diabetes]2.health = 0
 ( 2)  [diabetes]3.health = 0
 ( 3)  [diabetes]5.health = 0

           chi2(  3) =    1.56
         Prob > chi2 =    0.6689

 ( 1)  [diabetes]2.health = 0
 ( 2)  [diabetes]3.health = 0
 ( 3)  [diabetes]4.health = 0

           chi2(  3) =    1.56
         Prob > chi2 =    0.6689

. // Result from -testparm- is the same in all cases.
. // But result is also the same if you just use i.health:
. quietly logit diabetes c.health i.health

. testparm i.health

 ( 1)  [diabetes]2.health = 0
 ( 2)  [diabetes]3.health = 0
 ( 3)  [diabetes]4.health = 0

           chi2(  3) =    1.56
         Prob > chi2 =    0.6689

.
. * Likelihood ratio test of whether continuous version alone is enough
. // Model 1: Treat health as continuous
. quietly logit diabetes c.health

. estimates store m1

. // Model 2: Add health as categorical  
. quietly logit diabetes c.health i.health

. estimates store m2

. lrtest m1 m2

Likelihood-ratio test
Assumption: m1 nested within m2

 LR chi2(3) =   1.60
Prob > chi2 = 0.6599

.
end of do-file

EDIT: And if I had consulted Richard's notes (https://www3.nd.edu/~rwilliam/xsoc73...ndependent.pdf) before posting, I would have seen that they contain similar examples! 🙄

Last edited by Bruce Weaver; 23 May 2025, 13:58.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)

Comment

Richard Williams

Join Date: Apr 2014
Posts: 4985

#13

23 May 2025, 17:03

Too bad Bruce didn't review my paper either. ;-)

I think my approach makes it a little more obvious and intuitive what Stata is doing. But also, notice a potential problem with the alternatives that Bruce and Paul have tossed out that are a little simpler than my original coding.

Code:

webuse nhanes2f, clear
logit diabetes c.health i.health, nolog
testparm i.health
logit diabetes i.health c.health, nolog
testparm i.health

Code:

. logit diabetes c.health i.health, nolog
note: 5.health omitted because of collinearity.

Logistic regression                                     Number of obs = 10,335
                                                        LR chi2(4)    = 429.74
                                                        Prob > chi2   = 0.0000
Log likelihood = -1784.1984                             Pseudo R2     = 0.1075

------------------------------------------------------------------------------
    diabetes | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
      health |  -.7791143    .056556   -13.78   0.000    -.8899619   -.6682666
             |
      health |
       Fair  |   .0297756   .1207477     0.25   0.805    -.2068855    .2664367
    Average  |  -.0089766   .1437693    -0.06   0.950    -.2907592     .272806
       Good  |  -.2166689   .2164641    -1.00   0.317    -.6409308     .207593
  Excellent  |          0  (omitted)
             |
       _cons |  -.7024903   .1297495    -5.41   0.000    -.9567947   -.4481859
------------------------------------------------------------------------------

. testparm i.health

 ( 1)  [diabetes]2.health = 0
 ( 2)  [diabetes]3.health = 0
 ( 3)  [diabetes]4.health = 0

           chi2(  3) =    1.56
         Prob > chi2 =    0.6689

. logit diabetes i.health c.health, nolog
note: health omitted because of collinearity.

Logistic regression                                     Number of obs = 10,335
                                                        LR chi2(4)    = 429.74
                                                        Prob > chi2   = 0.0000
Log likelihood = -1784.1984                             Pseudo R2     = 0.1075

------------------------------------------------------------------------------
    diabetes | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
      health |
       Fair  |  -.7493387   .1262017    -5.94   0.000    -.9966895   -.5019878
    Average  |  -1.567205   .1302544   -12.03   0.000    -1.822499   -1.311911
       Good  |  -2.554012   .1780615   -14.34   0.000    -2.903006   -2.205018
  Excellent  |  -3.116457   .2262238   -13.78   0.000    -3.559848   -2.673067
             |
      health |          0  (omitted)
       _cons |  -1.481605   .0953463   -15.54   0.000     -1.66848   -1.294729
------------------------------------------------------------------------------

. testparm i.health

 ( 1)  [diabetes]2.health = 0
 ( 2)  [diabetes]3.health = 0
 ( 3)  [diabetes]4.health = 0
 ( 4)  [diabetes]5.health = 0

           chi2(  4) =  368.90
         Prob > chi2 =    0.0000

All I did was reverse the positions of c.health and i.health. Yet the testparm results were radically different. Why? Because Stata has to omit something. In the first model it dropped a category of health, and in the 2nd it dropped c.health.

The same thing happens if you use Paul's o2 only approach -- reverse the health vars and c.health is dropped instead of 1.health.

But, if you use my original approach,

Code:

logit diabetes c.health o(1 2).health, nolog
testparm i.health
logit diabetes o(1 2).health c.health , nolog
testparm i.health

you get the same results regardless of whether the categorical or continuous version comes first. This is because I am explicitly controlling what gets omitted, rather than letting Stata make the call.

I suspect I wasn't that brilliant and foresighted when I wrote the paper, but if I wasn't I think I lucked into the best and safest way to do it.

As I said at the beginning, this is a sort of esoteric example! But I suppose the moral is, if for some reason it really really really matters what categories or variable get omitted, you may want to use o. so you control the choice rather than leaving it up to Stata.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Comment

Richard Williams

Join Date: Apr 2014

Posts: 4985
#14

23 May 2025, 17:06

Another sidelight: As Bruce shows, in this case you could use LR tests instead. But if, say, the data were svyset, you would have to use the wald test approach.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Paul Rathouz

Join Date: Oct 2023

Posts: 32
#15

24 May 2025, 06:29

Roger all of that , Richard!
Comment

Announcement