Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a new categorical variable from multiple binary variables

    Hello,

    I could use some help combining 9 variables into one new variable.

    So, I have 9 variables vprob_financial, vprob_familial.... coded no=0 and yes = 1.

    I am hoping to create a new categorical variable that will combine all these binary variables, so I can get a proportion for 0=no and 1=financial 2=familial 3=marital...ect.

    Things I have tried so far with no success:

    I recoded the original variables vrpob_financial is now e2financial no=0 yes=1, vprob_social is now no=0 and yes=2 ect so that all the "YES" values have a different numeric coding.

    Then I tried to gen a new variable

    gen prob = 0
    foreach var of varlist e2financial e2social e2familial e2marital e2physical e2illspirit e2mental e2dontknow e2other {
    replace prob=1 if `var'==1
    replace prob=2 if `var'==2
    replace prob=3 if `var'==3
    replace prob=4 if `var'==4
    replace prob=5 if `var'==5
    replace prob=6 if `var'==6
    replace prob=7 if `var'==7
    replace prob=8 if `var'==8
    replace prob=9 if `var'==9
    }

    But this leads to a lot of missing values.

    -----

    So I tried stacking the data instead

    stack e2financial e2social e2familial e2marital e2physical e2illspirit e2mental e2dontknow e2other, into(e2) clear wide

    Which achieves what I need, but while it retains the e2varaibles (clear wide), it clears all the variables I don't want to stack including demographic data and other varaibles.

    So how do I either create a new variable that combines all these binary variables to only give me a categorical variable with 10 levels (0=no 1=financial....) that I can ultimately use in a linear regression.

    Alternatively, how do I use the stack command and make sure none of my other variables get deleted?

    Thanks for the help in advance!

  • #2
    Best to give us an explicit data example! See FAQ Advice #12.;

    It sounds as if you want something more like

    Code:
    gen prob = 0
    
    tokenize "e2financial e2social e2familial e2marital e2physical e2illspirit e2mental e2dontknow e2other" 
    
    forval j = 1/9 
        replace prob = `j' if ``j''== `j'
    }

    Comment


    • #3
      If the variables are mutually exclusive, i.e. if the respondents were only allowed to check one single option, this simple command should give you the same result:
      Code:
      egen prob = rowtotal(e2*)

      Comment


      • #4
        I've attached a sample of the original data and the data after creating the new categories.

        And, the variables are not mutually exclusive.

        Also, I have never seen the tokenize command so I just looked it up and it says it for string variables. The variables are numerically coded, for that's how I recoded them.

        Here is a sample of code I used to generate the new variables with numeric codes that are not overlapping.

        // Financial

        gen e2financial=vprob_financial
        recode e2financial 0=0 1=1
        label define vprob_financiallbl 0 "No" 1 "Financial"
        label value e2financial vprob_financiallbl
        tab e2financial, nolabel

        // Social

        gen e2social=vprob_social
        recode e2social 0=0 1=2
        label define vprob_sociallbl 0 "No" 2 "Social"
        label value e2social vprob_sociallbl
        tab e2social, nolabel



        Attached Files

        Comment


        • #5
          tokenize is for splitting a string into strings; contrary to your inference, or perhaps it's an assertion, splitting a string containing names of numeric variables is an entirely valid operation.

          If your variables are not always mutually exclusive, then you need rules for exactly how to combine those variables and I can't see that you have said anything on that score. Giving us an example in which at most one variable is positive doesn't help there. Conversely, the technique already given is entirely adequate for that special case.

          Please do read FAQ Advice #12 before going any further. Not only does it explain how to give a data example, it explains why image attachments are not as helpful.
          Last edited by Nick Cox; 02 Aug 2016, 05:14.

          Comment


          • #6
            Lakshmi,

            This is not as difficult as it seems, but we're still struggling to figure out what you want to do. You have 9 0/1 variables. I doubt if you need to recode them - you probably can just as easily work with your original 0/1 variables. In your example, they are not mutually exclusive.

            So what exactly are you trying to do with them? Are you looking for some kind of sum of these 0/1 variables (which is what Mathias did), or are you just creating an indicator variable that =1 if e2financial=1 and =2 if e2social =1, etc.? This second operation seems to only make sense if the original variables are mutually exclusive or if the sequence of the variables has some particular meaning so that a 1 on e2other means you want to ignore all the other variables. What do you want the variable to be if e2financial=1 and e2social =1? If they are mutually exclusive you could simply write:

            g prob=1 if e2financial==1
            replace prob=2 if e2social==1
            replace prob=3 if e2familal==1
            etc.

            What Nick wants is a nice clear data example that shows what you have and what you want in a format that we can work with easily. Include no extraneous stuff (like your labels).



            Comment


            • #7
              Hello,

              Sorry for being unclear, but I am very new to STATA and cannot provide the dataset because of patient identifiers. So thank you for your patience and help in advance.

              @Mathias, I originally tried that before I came to this forum, as it wasn't accomplishing what I set out to do. The variables are not mutually exclusive, so that command din't work for me.

              So here is a made up sample of the data in its original form:
              vpob_financial vprob_social vprob_family vprob_marital vprob_physical vprob_ill vprob_mental vprob_dontknow vprob_other
              No No No No No Yes Yes No No
              No Yes Yes No No Yes Yes No Yes
              No No No No No No No Yes No
              Yes Yes No Yes No Yes No No Yes
              No No Yes No No No No No No
              Yes Yes No No Yes No Yes Yes No
              Yes No No No No Yes No No Yes
              Where "No" corresponds to 0 and "Yes" corresponds to 1 for each variable.

              Code:
              . tab vprob_financial, nolabel
              
                  X has a |
                financial |
                  problem |      Freq.     Percent        Cum.
              ------------+-----------------------------------
                        0 |      1,928       59.88       59.88
                        1 |      1,292       40.12      100.00
              ------------+-----------------------------------
                    Total |      3,220      100.00
              So in the original questionnaire, all these are components form one question. But since they are not mutually exclusive, they were broken apart and were turned into 9 variables at the data collection/entry phase. What I need to do is create a new variable that will give me a proportion of all the "NO" and a proportion for all the "Yes" for financial "Yes" for Social "Yes" for marital...

              For example, the stack command I used (with the new variables I created) achieves this, but deletes all the other data which I need in order to run a logistic regression down the line with this newly created variable.

              Code:
              stack e2financial e2social e2familial e2marital e2physical e2illspirit e2mental e2dontknow e2other, into(e2) clear wide
              
               tab e2
              
                       e2 |      Freq.     Percent        Cum.
              ------------+-----------------------------------
                        0 |     23,052       79.54       79.54
                        1 |      1,292        4.46       84.00
                        2 |        156        0.54       84.54
                        3 |      1,094        3.78       88.32
                        4 |        161        0.56       88.87
                        5 |        595        2.05       90.92
                        6 |        417        1.44       92.36
                        7 |      1,740        6.00       98.37
                        8 |         87        0.30       98.67
                        9 |        386        1.33      100.00
              ------------+-----------------------------------
                    Total |     28,980      100.00
              I basically need to create a new variable which will give me an output like e2 which I can use for further analysis.

              @Nick, I tried using the following code you provided, but I must be doing something wrong as I still haven't been able to use it correctly.
              gen prob = 0 tokenize "e2financial e2social e2familial e2marital e2physical e2illspirit e2mental e2dontknow e2other" forval j = 1/9 replace prob = `j' if ``j''== `j' }
              I hope I have explained my problem better and the dummby dataset is clear.




              Comment


              • #8
                I am going to keep asking you, please, to read FAQ Advice #12 until you show signs that you have absorbed all that it says! (The whole of the FAQ Advice is worth a skim, all the way to #18.) Being new to Stata is fine, but precisely why you need to study the advice, to get good answers to your questions more quickly, and indeed to give an incentive to people to answer by posing clear questions.

                FAQ Advice #12 does address the possibility that your real data are confidential and advises what you have done, provide a fake dataset showing the same problem. Thanks for that, but the same applies, you should use dataex (SSC) or something equivalent to allow people to enter such data easily into their Stata.

                For the benefit of anybody else, as well as to underline the point, here is an example of the kind of engineering needed to turn your table into something that can be made relevant to the current problem, and I don't include the copy and paste and editing.

                Code:
                clear
                input str3 (financial social family    marital physical ill mental dontknow other)
                No    No    No    No    No    Yes    Yes    No    No
                No    Yes    Yes    No    No    Yes    Yes    No    Yes
                No    No    No    No    No    No    No    Yes    No
                Yes    Yes    No    Yes    No    Yes    No    No    Yes
                No    No    Yes    No    No    No    No    No    No
                Yes    Yes    No    No    Yes    No    Yes    Yes    No
                Yes    No    No    No    No    Yes    No    No    Yes
                end
                label def YesNo 1 Yes 0 No
                
                foreach v in financial social family marital physical    ill mental dontknow other {
                   encode `v', gen(vprob_`v') label(YesNo)
                }
                
                list vprob*, nola
                
                     +----------------------------------------------------------------------------+
                  1. | vp~ncial | vp~ocial | vprob_~y | vpr~ital | vpro~cal | vprob~ll | vpr~ntal |
                     |        0 |        0 |        0 |        0 |        0 |        1 |        1 |
                     |----------------------------------------------------------------------------|
                     |              vprob_~w               |               vprob_~r               |
                     |                     0               |                      0               |
                     +----------------------------------------------------------------------------+
                
                     +----------------------------------------------------------------------------+
                  2. | vp~ncial | vp~ocial | vprob_~y | vpr~ital | vpro~cal | vprob~ll | vpr~ntal |
                     |        0 |        1 |        1 |        0 |        0 |        1 |        1 |
                     |----------------------------------------------------------------------------|
                     |              vprob_~w               |               vprob_~r               |
                     |                     0               |                      1               |
                     +----------------------------------------------------------------------------+
                
                     +----------------------------------------------------------------------------+
                  3. | vp~ncial | vp~ocial | vprob_~y | vpr~ital | vpro~cal | vprob~ll | vpr~ntal |
                     |        0 |        0 |        0 |        0 |        0 |        0 |        0 |
                     |----------------------------------------------------------------------------|
                     |              vprob_~w               |               vprob_~r               |
                     |                     1               |                      0               |
                     +----------------------------------------------------------------------------+
                
                     +----------------------------------------------------------------------------+
                  4. | vp~ncial | vp~ocial | vprob_~y | vpr~ital | vpro~cal | vprob~ll | vpr~ntal |
                     |        1 |        1 |        0 |        1 |        0 |        1 |        0 |
                     |----------------------------------------------------------------------------|
                     |              vprob_~w               |               vprob_~r               |
                     |                     0               |                      1               |
                     +----------------------------------------------------------------------------+
                
                     +----------------------------------------------------------------------------+
                  5. | vp~ncial | vp~ocial | vprob_~y | vpr~ital | vpro~cal | vprob~ll | vpr~ntal |
                     |        0 |        0 |        1 |        0 |        0 |        0 |        0 |
                     |----------------------------------------------------------------------------|
                     |              vprob_~w               |               vprob_~r               |
                     |                     0               |                      0               |
                     +----------------------------------------------------------------------------+
                
                     +----------------------------------------------------------------------------+
                  6. | vp~ncial | vp~ocial | vprob_~y | vpr~ital | vpro~cal | vprob~ll | vpr~ntal |
                     |        1 |        1 |        0 |        0 |        1 |        0 |        1 |
                     |----------------------------------------------------------------------------|
                     |              vprob_~w               |               vprob_~r               |
                     |                     1               |                      0               |
                     +----------------------------------------------------------------------------+
                
                     +----------------------------------------------------------------------------+
                  7. | vp~ncial | vp~ocial | vprob_~y | vpr~ital | vpro~cal | vprob~ll | vpr~ntal |
                     |        1 |        0 |        0 |        0 |        0 |        1 |        0 |
                     |----------------------------------------------------------------------------|
                     |              vprob_~w               |               vprob_~r               |
                     |                     0               |                      1               |
                     +----------------------------------------------------------------------------+
                Now in terms of your questions: it remains unclear what variable you are thinking about that preserves the information in the original variables, as having one problem naturally doesn't rule out others.

                You could concatenate the variables:

                Code:
                egen vprob = concat(vprob_*)
                tab vprob
                
                      vprob |      Freq.     Percent        Cum.
                ------------+-----------------------------------
                  000000010 |          1       14.29       14.29
                  000001100 |          1       14.29       28.57
                  001000000 |          1       14.29       42.86
                  011001101 |          1       14.29       57.14
                  100001001 |          1       14.29       71.43
                  110010110 |          1       14.29       85.71
                  110101001 |          1       14.29      100.00
                ------------+-----------------------------------
                      Total |          7      100.00
                There are in principle 2^9 = 512 distinct possibilities for 9 indicators.

                On the other hand, if you just want to know the proportions of Yes and No for each indicator, that doesn't need any new variable at all. The means of indicators are just the proportions Yes.

                Code:
                . tabstat vprob_*, c(s)
                
                    variable |      mean
                -------------+----------
                vprob_fina~l |  .4285714
                vprob_social |  .4285714
                vprob_family |  .2857143
                vprob_mari~l |  .1428571
                vprob_phys~l |  .1428571
                   vprob_ill |  .5714286
                vprob_mental |  .4285714
                vprob_dont~w |  .2857143
                 vprob_other |  .4285714
                ------------------------
                To see more, you could e.g. install tabm

                Code:
                * installs tabm and other commands
                . ssc inst tab_chi
                
                . tabm vprob_*, row
                
                +----------------+
                | Key |
                |----------------|
                | frequency |
                | row percentage |
                +----------------+
                
                | values
                variable | No Yes | Total
                ----------------+----------------------+----------
                vprob_financial | 4 3 | 7
                | 57.14 42.86 | 100.00
                ----------------+----------------------+----------
                vprob_social | 4 3 | 7
                | 57.14 42.86 | 100.00
                ----------------+----------------------+----------
                vprob_family | 5 2 | 7
                | 71.43 28.57 | 100.00
                ----------------+----------------------+----------
                vprob_marital | 6 1 | 7
                | 85.71 14.29 | 100.00
                ----------------+----------------------+----------
                vprob_physical | 6 1 | 7
                | 85.71 14.29 | 100.00
                ----------------+----------------------+----------
                vprob_ill | 3 4 | 7
                | 42.86 57.14 | 100.00
                ----------------+----------------------+----------
                vprob_mental | 4 3 | 7
                | 57.14 42.86 | 100.00
                ----------------+----------------------+----------
                vprob_dontknow | 5 2 | 7
                | 71.43 28.57 | 100.00
                ----------------+----------------------+----------
                vprob_other | 4 3 | 7
                | 57.14 42.86 | 100.00
                ----------------+----------------------+----------
                Total | 41 22 | 63
                | 65.08 34.92 | 100.00
                On applying the code I suggested earlier:

                There is a typo in that code:

                Code:
                gen prob = 0
                
                tokenize "e2financial e2social e2familial e2marital e2physical e2illspirit e2mental e2dontknow e2other"
                
                forval j = 1/9 {
                replace prob = `j' if ``j''== `j'
                }
                I missed out a brace. Sorry about that. My fault, but if you had supplied a data example, I would have tested the code and found the problem! However, the code was based on the idea that the indicators were mutually exclusive, and does not help your real problem.

                .
                Last edited by Nick Cox; 03 Aug 2016, 03:01.

                Comment


                • #9
                  I have had to change my initial data plan for a few reasons and will not need to do this anymore. But I will likely have to use the tokenize command soon and did not know about it before, so thank you for all your time and help. I sincerely appreciate it.

                  Comment

                  Working...
                  X