Dummy Variables Vs Factor Variables

christiana

Join Date: Jun 2014

Posts: 46
#1

Dummy Variables Vs Factor Variables

30 Jun 2014, 11:24

Hi,

I am a bit confused as to what is the difference between dummy and factor variables and whether they are the same. For example, is generating a dummy variable by doing 'gen dummy=0' and then 'replace dummy=1 if var1<3' for example the same as keeping all the categories of a certain variable and just specifying the prefix i. infront of it in the regression? Similarly, is this also equivalent to doing: 'tabulate dummy, gen(m)' for example?
Tags: None
FLuca

Join Date: May 2014

Posts: 35
#2

30 Jun 2014, 12:12

I think it is the same.
Although, generating dummy variables by yourself ( 'gen dummy=0' and then 'replace dummy=1 if var1<3' ) is a more flexible solution.
While, "xi: reg i.VAR" and "tabulate VAR, gen(IVAR)" will create a dummy for each value of VAR.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#3

30 Jun 2014, 12:33

Besides convenience, there are many advantages to using factor variables rather than computing variables yourself. For some highlights, see

http://www3.nd.edu/~rwilliam/stats/Margins01.pdf

For more on what you can do with factor variables, type -help fvvarlist-. Or better yet, just read section 11.4.3 of the User Guide.

Incidentally, Christiana's code would cause dummy to get coded 1 if var1 was missing, which might or might not be what she wants. Further, her code basically collapses var1; factor variable coding would create dummies for the different integer values of var1,

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#4

30 Jun 2014, 12:38

A dummy (indicator) variable we can define as having values 0 and 1 and at some point you need to create that variable by entering data or using generate. Stata commands don't know in advance that any such variable is an indicator variable; there is no flag or tag or Stata piece of information, other than the values themselves, indicating that status.

The idea of a factor variable is that you flag to Stata in a command that a given variable is be treated as one or more indicator variables on the fly.

Thus with

Code:

sysuse auto regress mpg foreign i.rep78

rep78 is flagged (by using i.) as a factor variable and it will in this example be treated as defined by four indicator variables. Precisely how that is done is tunable with further syntax. When the modelling is done, those indicator variables don't survive as permanent additions to the dataset. (You can, separately from this procedure, create indicator variables from that categorical variable, but that is different.)

In this example,

Code:

regress mpg i.foreign i.rep78

would be entirely legal but no different in effect.

So, an indicator variable could be flagged as a factor variable, with in this example no different effect. A multicategory variable could be flagged as a factor, and it would be treated as a bundle of indicator variables for a modelling purpose.

The ideas of factor variable and indicator variable are thus on different levels, and only coincide insofar as an single indicator variable may be tagged as a factor variable.
Comment
Dick Campbell

Join Date: Apr 2014

Posts: 279
#5

30 Jun 2014, 12:49

You need to be careful about missing data when you generate your own dummies. In the example you cite, dummy will = 0 if var1 >= 3 which means that dummy will = 0 if var1 = . Factor variables generate a set of dummy variables but missing data are properly taken into account. Generating dummies via the tabulate command also handles missing data correctly in that cases which are missing on the tabulated variable generate missing data codes on the set of dummies. It may appear that creating your own dummies is more flexible because you can code them to reflect contrasts of interest, but the "i" notation allows that as well. For example:

Code:

sysuse auto reg mpg weight ib5.rep78

results in the fifth category of rep78 being used as the reference category rather than the first which is the default. The moral is that creating your own dummy variables may be useful in some situations, but you need to be careful.

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#6

30 Jun 2014, 12:50

Nick's two regress commands will produce the same results. However, the difference between the two commands is very important when you use post-estimation commands like margins. For example, try running

margins foreign

after running each regress command. After the first, you will get an error, after the 2nd it will run fine. These sorts of things become even more important as the model gets a little more complicated, e.g. when the independent variable has more than 2 categories, or you have interaction terms, or have squared terms. If you never plan to run a post-estimation command it may not matter if you use factor variables or generate the terms yourself, but the use or non-use of factor variables can make a big difference in the accuracy of post-estimation results.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#7

30 Jun 2014, 13:14

Incidentally, it may seem silly that you should use the i. Notation with a variable that is already coded 0/1. However, as Nick pointed out once, you don't know whether a variable coded 0/1 really does have only 2 values or whether those just happened to be the only two values that were observed in the sample.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
christiana

Join Date: Jun 2014

Posts: 46
#8

30 Jun 2014, 14:34

Many thanks to everyone who replied. I now clearly understand the distinction and I will be using the factor variable approach for all the reasons you mentioned
Comment
Guest
#9

29 Jun 2017, 21:50

Hello! Can you please explain this comment? Specifically, I am wondering whether it is better to use or to not use factor variables to ensure accuracy of post-estimation results?

Originally posted by Richard Williams View Post

Nick's two regress commands will produce the same results. However, the difference between the two commands is very important when you use post-estimation commands like margins. [...] If you never plan to run a post-estimation command it may not matter if you use factor variables or generate the terms yourself, but the use or non-use of factor variables can make a big difference in the accuracy of post-estimation results.

Thank you!

(PS: I posted earlier about the problem that I am currently personally dealing with related to these issues: https://www.statalist.org/forums/for...es?view=stream)
Comment

Announcement

Dummy Variables Vs Factor Variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment