Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dummy Variables Vs Factor Variables

    Hi,

    I am a bit confused as to what is the difference between dummy and factor variables and whether they are the same. For example, is generating a dummy variable by doing 'gen dummy=0' and then 'replace dummy=1 if var1<3' for example the same as keeping all the categories of a certain variable and just specifying the prefix i. infront of it in the regression? Similarly, is this also equivalent to doing: 'tabulate dummy, gen(m)' for example?

  • #2
    I think it is the same.
    Although, generating dummy variables by yourself ( 'gen dummy=0' and then 'replace dummy=1 if var1<3' ) is a more flexible solution.
    While, "xi: reg i.VAR" and "tabulate VAR, gen(IVAR)" will create a dummy for each value of VAR.

    Comment


    • #3
      Besides convenience, there are many advantages to using factor variables rather than computing variables yourself. For some highlights, see

      http://www3.nd.edu/~rwilliam/stats/Margins01.pdf

      For more on what you can do with factor variables, type -help fvvarlist-. Or better yet, just read section 11.4.3 of the User Guide.

      Incidentally, Christiana's code would cause dummy to get coded 1 if var1 was missing, which might or might not be what she wants. Further, her code basically collapses var1; factor variable coding would create dummies for the different integer values of var1,
      -------------------------------------------
      Richard Williams, Notre Dame Dept of Sociology
      Stata Version: 17.0 MP (2 processor)

      EMAIL: [email protected]
      WWW: https://www3.nd.edu/~rwilliam

      Comment


      • #4
        A dummy (indicator) variable we can define as having values 0 and 1 and at some point you need to create that variable by entering data or using generate. Stata commands don't know in advance that any such variable is an indicator variable; there is no flag or tag or Stata piece of information, other than the values themselves, indicating that status.

        The idea of a factor variable is that you flag to Stata in a command that a given variable is be treated as one or more indicator variables on the fly.

        Thus with

        Code:
         
        sysuse auto
        regress mpg foreign i.rep78
        rep78 is flagged (by using i.) as a factor variable and it will in this example be treated as defined by four indicator variables. Precisely how that is done is tunable with further syntax. When the modelling is done, those indicator variables don't survive as permanent additions to the dataset. (You can, separately from this procedure, create indicator variables from that categorical variable, but that is different.)

        In this example,

        Code:
         
        regress mpg i.foreign i.rep78
        would be entirely legal but no different in effect.

        So, an indicator variable could be flagged as a factor variable, with in this example no different effect. A multicategory variable could be flagged as a factor, and it would be treated as a bundle of indicator variables for a modelling purpose.

        The ideas of factor variable and indicator variable are thus on different levels, and only coincide insofar as an single indicator variable may be tagged as a factor variable.








        Comment


        • #5
          You need to be careful about missing data when you generate your own dummies. In the example you cite, dummy will = 0 if var1 >= 3 which means that dummy will = 0 if var1 = . Factor variables generate a set of dummy variables but missing data are properly taken into account. Generating dummies via the tabulate command also handles missing data correctly in that cases which are missing on the tabulated variable generate missing data codes on the set of dummies. It may appear that creating your own dummies is more flexible because you can code them to reflect contrasts of interest, but the "i" notation allows that as well. For example:

          Code:
          sysuse auto
          reg mpg weight ib5.rep78
          results in the fifth category of rep78 being used as the reference category rather than the first which is the default. The moral is that creating your own dummy variables may be useful in some situations, but you need to be careful.
          Richard T. Campbell
          Emeritus Professor of Biostatistics and Sociology
          University of Illinois at Chicago

          Comment


          • #6
            Nick's two regress commands will produce the same results. However, the difference between the two commands is very important when you use post-estimation commands like margins. For example, try running

            margins foreign

            after running each regress command. After the first, you will get an error, after the 2nd it will run fine. These sorts of things become even more important as the model gets a little more complicated, e.g. when the independent variable has more than 2 categories, or you have interaction terms, or have squared terms. If you never plan to run a post-estimation command it may not matter if you use factor variables or generate the terms yourself, but the use or non-use of factor variables can make a big difference in the accuracy of post-estimation results.
            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            Stata Version: 17.0 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment


            • #7
              Incidentally, it may seem silly that you should use the i. Notation with a variable that is already coded 0/1. However, as Nick pointed out once, you don't know whether a variable coded 0/1 really does have only 2 values or whether those just happened to be the only two values that were observed in the sample.
              -------------------------------------------
              Richard Williams, Notre Dame Dept of Sociology
              Stata Version: 17.0 MP (2 processor)

              EMAIL: [email protected]
              WWW: https://www3.nd.edu/~rwilliam

              Comment


              • #8
                Many thanks to everyone who replied. I now clearly understand the distinction and I will be using the factor variable approach for all the reasons you mentioned

                Comment


                • #9
                  Hello! Can you please explain this comment? Specifically, I am wondering whether it is better to use or to not use factor variables to ensure accuracy of post-estimation results?

                  Originally posted by Richard Williams View Post
                  Nick's two regress commands will produce the same results. However, the difference between the two commands is very important when you use post-estimation commands like margins. [...] If you never plan to run a post-estimation command it may not matter if you use factor variables or generate the terms yourself, but the use or non-use of factor variables can make a big difference in the accuracy of post-estimation results.
                  Thank you!

                  (PS: I posted earlier about the problem that I am currently personally dealing with related to these issues: https://www.statalist.org/forums/for...es?view=stream)

                  Comment

                  Working...
                  X