Do I still have to add 'i.' in front of variables that I want to use as dummies, but already encoded in 0 and 1 prior to import?

Park Hwanggyu

Join Date: Dec 2022

Posts: 16
#1

Do I still have to add 'i.' in front of variables that I want to use as dummies, but already encoded in 0 and 1 prior to import?

12 Dec 2022, 02:10

Hello, I am very novice in stata, and struggling to get information but very difficult for current.

I want to make an ordered logit model. Prior to input the dataset, I gathered survey results and encoded all the independent variables(x1~x5)' figures from character form into integer of 0 and 1(level of only one variable consists of 0, 1, and 2).
These are not ordinal, but just categorial.
And also I have a dependent variable(y) formatted from 1 to 7, which are ordinal.

In this case, don't I have to code like 'i.x1' or something like that? As I have studied so far, it creates dummy variables automatically, receiving non-numeric form...Am I right?

Attached Files
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17704

12 Dec 2022, 07:39

Park:
you can follow both the approaches,as you can see in the following toy-example:

Code:

. set obs 10
Number of observations (_N) was 2, now 10.

. g A=runiform()

. g B=0 in 1/5
(5 missing values generated)

. replace B=1 if B==.
(5 real changes made)

. reg A B

      Source |       SS           df       MS      Number of obs   =        10
-------------+----------------------------------   F(1, 8)         =      0.28
       Model |  .027723204         1  .027723204   Prob > F        =    0.6092
    Residual |  .783901264         8  .097987658   R-squared       =    0.0342
-------------+----------------------------------   Adj R-squared   =   -0.0866
       Total |  .811624468         9  .090180496   Root MSE        =    .31303

------------------------------------------------------------------------------
           A | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
           B |   .1053057   .1979774     0.53   0.609    -.3512311    .5618424
       _cons |   .3299788   .1399912     2.36   0.046     .0071585     .652799
------------------------------------------------------------------------------

. reg A i.B

      Source |       SS           df       MS      Number of obs   =        10
-------------+----------------------------------   F(1, 8)         =      0.28
       Model |  .027723204         1  .027723204   Prob > F        =    0.6092
    Residual |  .783901264         8  .097987658   R-squared       =    0.0342
-------------+----------------------------------   Adj R-squared   =   -0.0866
       Total |  .811624468         9  .090180496   Root MSE        =    .31303

------------------------------------------------------------------------------
           A | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         1.B |   .1053057   .1979774     0.53   0.609    -.3512311    .5618424
       _cons |   .3299788   .1399912     2.36   0.046     .0071585     .652799
------------------------------------------------------------------------------

.

That said, my preference goes out to -fvvarlist- notation, even when it is not strictly necessary (as in the abovementioned toy-example), just to avoid loosing familiarity with this wonderful command.

Kind regards,
Carlo
(Stata 19.0)

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10188

12 Dec 2022, 09:16

Also, if you intend on using margins, you need to use factor variable notation to indicate categorical variables. If your reluctance stems from more typing, note that factor variable notation simplifies this (see the following syntax example):

Code:

webuse lbw, clear
probit low age i.(race smoke ht ui)

Res.:

Code:

. probit low age i.(race smoke ht ui)

Iteration 0:   log likelihood =   -117.336  
Iteration 1:   log likelihood = -105.05402  
Iteration 2:   log likelihood =  -104.9827  
Iteration 3:   log likelihood = -104.98269  

Probit regression                               Number of obs     =        189
                                                LR chi2(6)        =      24.71
                                                Prob > chi2       =     0.0004
Log likelihood = -104.98269                     Pseudo R2         =     0.1053

------------------------------------------------------------------------------
         low |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |  -.0196885   .0206365    -0.95   0.340    -.0601353    .0207582
             |
        race |
      black  |    .595869   .3050848     1.95   0.051    -.0020863    1.193824
      other  |   .6256342   .2455134     2.55   0.011     .1444369    1.106832
             |
       smoke |
     smoker  |   .6572914   .2257025     2.91   0.004     .2149225     1.09966
        1.ht |   .8292397   .3888207     2.13   0.033     .0671651    1.591314
        1.ui |   .5950939   .2692187     2.21   0.027      .067435    1.122753
       _cons |  -.7957992   .5371271    -1.48   0.138    -1.848549    .2569506
------------------------------------------------------------------------------

Comment

Park Hwanggyu

Join Date: Dec 2022

Posts: 16
#4

17 Dec 2022, 10:38

Carlo Lazzaro Andrew Musau Thank you for answer. But the reason that I am reluctant about using 'i.' command is because the message : omitted because of collinearity.
This doesn't come out when I don't use that one...But yes I want to get information about the margin and because of that this is necessary..
I posted about this matter in detail on the link below, and I would be very appreciate if you give me a hand.

https://www.statalist.org/forums/for...f-collinearity
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30075
#5

17 Dec 2022, 10:58

For the variables that are only 0/1, the regression itself will give identical results whether you use the i. operator in the command or not. Nor will there be any difference in terms of what, if anything, gets omitted due to colinearity.

But for the variable coded 0/1/2, it makes a big difference. And you need to decide which way is appropriate. If you leave out the i., Stata will treat that variable (accident_rsomething) as a continuous variable. This may or may not be appropriate: it is a modeling issue. Suffice it to say, that i.accident_rate and accident_rate are not equivalent and will give different regression results here. It is also possible that by treating it as a discrete variable (i.e. using the i. operator) you may introduce a colinearity with some other variable(s). But if the variable is best modeled as discrete, then avoiding that colinearity would not be a valid reason to treat it as continuous. Instead you would need to identify where the colinearity is coming from and figure out some other way to break it. There isn't enough information in your post to give more specific advice on how one might go about that. To do that would require usable sample data (i.e. use the -dataex- command), and the exact command and results you are getting. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
Park Hwanggyu

Join Date: Dec 2022

Posts: 16
#6

17 Dec 2022, 11:07

Clyde Schechter Thank you. I added more information on here :
https://www.statalist.org/forums/for...f-collinearity
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30075
#7

17 Dec 2022, 11:22

I"m sorry, but the link in #6 seems to be broken.
Comment
Park Hwanggyu

Join Date: Dec 2022

Posts: 16
#8

17 Dec 2022, 11:33

Clyde Schechter https://www.statalist.org/forums/for...f-collinearity Please click this or find question name Reason of error occurrence : omitted because of collinearity?

And I agree that your opinion that accident rate(I changed its name as x5 at there) is the reason. When I changed the code like

'oprobit choice accident_rate i.(fare travel_time automation inner_noise)',
it didn't show the messege.

But still the problem is, I encoded to use this as a categorical variable, but this is treated as continous in this situation...Or won't it be matter because I already coded into discretized form of 0,1,2?

Thank you for kind answer.

Last edited by Park Hwanggyu; 17 Dec 2022, 11:40.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30075
#9

17 Dec 2022, 12:04

OK, I see the -oprobit- command and outcomes there. From that I can infer that the colinearity must arise with some or all of x1, x2, x3, or x4, as there are no other variables involved.

But still the problem is, I encoded to use this as a categorical variable, but this is treated as continous in this situation...Or won't it be matter because I already coded into discretized form of 0,1,2?

A variable named accident_rate, assuming it really is a numerical rate of occurrence of accident events, is usually best treated as a continuous variable. The question in my mind is why you wanted to discretize it into a 0/1/2 variable. What do the 0, 1, and 2 categories represent? Why did you want to do that? It is probably wrong to treat the discretized version as if it were continuous, but it is probably also wrong to discretize the variable in the first place. If you explain what you were trying to accomplish by doing that, we might be able to resolve the modeling question better.
1 like
Comment
Park Hwanggyu

Join Date: Dec 2022

Posts: 16
#10

18 Dec 2022, 06:06

Clyde Schechter Rate is not literally rate in here. It is just a name of variable, and it consists of 3 levels, given by me to respondents at survey stage(Ofc each level propose differed accident rate, though).
Did you ask me if I discretized some data constructed as originally continuous form, by myself? If it is, no. And since my basic intention was using it as categorical, I thought it must treated as so in software, too.
Thank you.

Last edited by Park Hwanggyu; 18 Dec 2022, 06:10.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30075
#11

18 Dec 2022, 11:23

OK, so it sounds like it really is a discrete variable and should be treated as such. So the next step is to figure out where the colinearity is coming from. For that you will have to show example data. The screenshot in #1 is not usable for the purpose (there is no way to import data from a screenshot into Stata) and, even if it were, there aren't enough observations to see what is going on. The helpful way to show example data is with the -dataex- command. Please post back using it. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Announcement