Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Convert categorical variable to dummy variables in a large dataset

    Dear everyone,

    I have a dataset of 350 categorical variables, e.g: How satisfied are you with your life? 1. Unhappy .... 10. Very happy.
    I want to do regression on them, but my professor advised me to convert them into dummy variables before doing it.
    However, I only know one command to convert 1 categorical variable to dummy variable: tabulate varname, gen(newvarname). How could I convert such a large number of variables to dummy at once?
    I also tried loop with foreach, but It was not correct.

    Could you please kindly advised me?
    Thank you very much!

  • #2
    1) You don't tell us if there is a naming convention for your variables. You'll either need to type out the 350 variables or somehow find a pattern that you can use to identify the variables in a macro.
    2) Do you really want to create a new variable for each category of your 350 variables? Or do you want to create something like values 1 to 5=0 (Unhappy), and values 6-10=1 (Happy)?
    3) Do you want the coding to be the same for all variables (1-5=0 for all variables) (6-10=1 for all variables)?

    Stata/MP 14.1 (64-bit x86-64)
    Revision 19 May 2016
    Win 8.1

    Comment


    • #3
      Carole J. Wilson : Thank you very much for your response!
      1. My variables are named V1, V2, ... V265.
      2. I think that I have to create a new variable for each category. Some are not ordinal, e.g: race. And happiness variable is chosen the dependent variable.

      One more question is that: If I want to select feature with lassopack, do I need to convert these variables to dummy first, or just remain them unchanged?
      Thank you very much!

      Comment


      • #4
        In modern Stata some commands such as -regress- and other estimation commands accept what is called "factor notation" or "factor variables".

        This means that most of the time you do not need to manually create the dummies, but you just use factor variable notation, and Stata creates the dummies for you on the fly.

        Type "factor variables" in the help or -help fvvarlist- and you will see who those are used.

        Comment


        • #5
          As Joro mentions, using factor notation, your regression will look something like this (I list OLS regression, but it would work for multinomial logit as well):

          Code:
          regress happiness age sex i.income_level i.educ_level i.marital_status i.religion

          Comment


          • #6
            Joro Kolev David Benson : Thank you very much for your help!
            Before doing regression, I want to do lasso2 (in the lassopack) first, to select the variables. I was advised to do the following steps:
            1. convert categorical variables to dummy variables.
            2. standardize these dummy variables
            3. lasso2
            Can I also use the factor notation to do standardization in the 2nd step? Can I use the loop (foreach) to do standardization? It is too complicated for me.
            Thank you very much!

            Comment


            • #7
              I've never used lasso2 or lassopack, so I'm not going to be of much help. I also don't know what it means to standardize a dummy variable

              Also, it's not clear that standardizing really helps anything. See posts here, here, here, and here.

              However, assuming you really need to do these things:
              Code:
              * To create indicator (dummy) variables for each categorical variables
              tabulate income_level, gen(inc)
              
              * Loop to standardize variables
              * You can list the variables in the loop, or put them in a local macro and then loop over the local macro
              
              local my_vars "income_level educ_level marital_status religion"
              
              foreach v in `my_vars' {
                   egen std_`v' = std(`v')
              }
              
              foreach var of varlist income_level educ_level marital_status religion {
                   egen std_`var' = std(`var')
              }

              Comment


              • #8
                Originally posted by Carole J. Wilson View Post
                1) You don't tell us if there is a naming convention for your variables. You'll either need to type out the 350 variables or somehow find a pattern that you can use to identify the variables in a macro.
                2) Do you really want to create a new variable for each category of your 350 variables? Or do you want to create something like values 1 to 5=0 (Unhappy), and values 6-10=1 (Happy)?
                3) Do you want the coding to be the same for all variables (1-5=0 for all variables) (6-10=1 for all variables)?
                How would you go about doing 2) ? I want to create a dummy where 1 to 2 of my happy variable will be = 0 for unhappy and 3 to 4 will be = 1 for happy. I want to do a similar thing for my number of children variable too where a participant having up to 3 children is few children = 0 and >4 children = 1 for many children.


                happy
                tabulation: Freq. Numeric Label
                6,807 1 More than usual
                41,444 2 Same as usual
                6,422 3 Less so
                991 4 Much less


                child
                tabulation: Freq. Numeric Label
                37,862 0
                7,419 1
                7,514 2
                2,413 3
                357 4
                70 5
                26 6
                3 7


                Hope this makes sense, thanks

                Comment

                Working...
                X