Problem with factor variables syntax

Richard Williams

Join Date: Apr 2014

Posts: 5008
#16

20 Mar 2016, 07:12

The duplicating a variable approach may be fine if you only want the coefficients. But I am guessing it will cause you grief if you also want the marginal effects.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Richard Williams

Join Date: Apr 2014
Posts: 5008

#17

20 Mar 2016, 07:54

Also, my factor variable has a lots of levels, so indicating a constraint each level to not be interacted is extremely cumbersome.

If I want to impose constraints, I find it very useful to first run the command with the coefl option. Then it is easy to name the parameters I want to constrain. You can just copy and paste the parameter names. For example,

Code:

webuse nhanes2f, clear
logit diabetes i.race weight i.race#c.weight, coefl nolog
constraint 1 _b[3.race#c.weight] = 0
logit diabetes i.race weight i.race#c.weight, constraints(1) nolog

Code:

. webuse nhanes2f, clear

. logit diabetes i.race weight i.race#c.weight, coefl nolog

Logistic regression                             Number of obs     =     10,335
                                                LR chi2(5)        =      65.47
                                                Prob > chi2       =     0.0000
Log likelihood = -1966.3317                     Pseudo R2         =     0.0164

-------------------------------------------------------------------------------
     diabetes |      Coef.  Legend
--------------+----------------------------------------------------------------
         race |
       Black  |   .0257155  _b[2.race]
       Other  |   .3318753  _b[3.race]
              |
       weight |   .0169948  _b[weight]
              |
race#c.weight |
       Black  |   .0064931  _b[2.race#c.weight]
       Other  |  -.0026229  _b[3.race#c.weight]
              |
        _cons |  -4.313413  _b[_cons]
-------------------------------------------------------------------------------

. constraint 1 _b[3.race#c.weight] = 0

. logit diabetes i.race weight i.race#c.weight, constraints(1) nolog

Logistic regression                             Number of obs     =     10,335
                                                Wald chi2(4)      =      73.49
Log likelihood = -1966.3384                     Prob > chi2       =     0.0000

 ( 1)  [diabetes]3.race#c.weight = 0
-------------------------------------------------------------------------------
     diabetes |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
         race |
       Black  |   .0219021   .5441304     0.04   0.968    -1.044574    1.088378
       Other  |    .157999   .3465437     0.46   0.648    -.5212141    .8372122
              |
       weight |   .0169442   .0030932     5.48   0.000     .0108816    .0230068
              |
race#c.weight |
       Black  |   .0065437   .0066256     0.99   0.323    -.0064423    .0195297
       Other  |          0  (omitted)
              |
        _cons |    -4.3096   .2389319   -18.04   0.000    -4.777898   -3.841302
-------------------------------------------------------------------------------

.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Comment

Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#18

22 Mar 2016, 13:40

Tech Support has informed me that a stata developer will comment here shortly
Comment
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 700
#19

22 Mar 2016, 14:27

Level values of a factor variable are a property of said variable
throughout a varlist specification. This means that levels
specified on a factor variable in one term within a varlist will
propagate to the other terms containing that factor variable.

Let's start with a simple model with a factor variable used in a single
main-effects term.

In the following example, the i. operator on rep78
indicates that regress treat rep78 as a factor variable,
and find all its levels from the estimation sample.

Code:

sysuse auto regress price mpg i.rep78

Stata only searched for the levels of rep78 because no levels
were explicitly specified.

You are able to restrict which levels to use in a model by explicitly
specifying them. You can specify the levels as part of the i.
operator

Code:

regress price mpg i(1 3 5).rep78

or by spelling out each indicator variable explicitly

Code:

regress price mpg 1.rep78 3.rep78 5.rep78

By default, the lowest of the levels specified is used as the base level.

You can even specify all levels and pick which ones to "omit" by using
the o. operator. Again, this can be done by specifying the
levels as part of the o. operator

Code:

regress price mpg i(1 3 5)o(2 4).rep78

or by spelling out each indicator variable

Code:

regress price mpg 1.rep78 2o.rep78 3.rep78 4o.rep78 5.rep78

The challenge here is to understand what it means when a factor variable
participates in more than one term within a varlist.

Remember, level values of a factor variable are a property of said
variable throughout a varlist specification.

Here is Ariel's original test case:

Code:

regress price c.mpg##1.rep78 i.rep78

The first regressor term is

Code:

c.mpg##1.rep78

which expands to

Code:

mpg 1.rep78 c.mpg#1.rep78

Since a level for rep78 was specified, the next regressor term

Code:

i.rep78

expands to

Code:

1.rep78

Duplicate elements of factor variable terms reduce down, so Ariel's
original test case translates into

Code:

regress price mpg 1.rep78 c.mpg#1.rep78

This is not what Ariel wanted. Joseph Coveney then pointed out that the
following did not do what was expected either.

Code:

regress price 2b.rep78 3.rep78 4.rep78 5.rep78 c.mpg##1.rep78

This translates to

Code:

regress price 1.rep78 /// 2b.rep78 /// 3.rep78 /// 4.rep78 /// 5.rep78 /// c.mpg##1.rep78 /// c.mpg##2b.rep78 /// c.mpg##3.rep78 /// c.mpg##4.rep78 /// c.mpg##5.rep78

Based on the discussion, it appears that Ariel wants the following:

Code:

regress price mpg /// 1.rep78 /// 2b.rep78 /// 3.rep78 /// 4.rep78 /// 5.rep78 /// c.mpg#1.rep78 /// co.mpg#2o.rep78 /// co.mpg#3o.rep78 /// co.mpg#4o.rep78 /// co.mpg#5o.rep78 /// , allbase

A shorter syntax for this is

Code:

regress price co.mpg##b(2)o(2/5).rep78, allbase

This can be generalized with a few lines of Stata code:

Code:

local case 1 levelsof rep78, local(levs) local levs : list levs - case gettoken base : levs regress price co.mpg##i(`case')b(`base')o(`levs').rep78, allbase
2 likes
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#20

22 Mar 2016, 14:57

Thanks Jeff. This should definitely be in a FAQ or in the manual.

Still, I wonder why it works this way. You say

Remember, level values of a factor variable are a property of said variable throughout a varlist specification.

Why? If i say

Code:

regress price i.rep78 c.mpg c.mpg#1.rep78

why can't I just get the one interaction term I want along with all the terms for rep78?

The code you come up with works but it is far from intuitive, at least to me. Is there some reason that the code I prefer could actually cause some great problems, at least in some situations?

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
2 likes
Comment

Announcement

Comment

Comment

Comment

Comment

Comment