Creating a new categorical variable from multiple binary variables

Lakshmi Gopalakrishnan

Join Date: Aug 2016

Posts: 4
#1

Creating a new categorical variable from multiple binary variables

02 Aug 2016, 03:32

Hello,

I could use some help combining 9 variables into one new variable.

So, I have 9 variables vprob_financial, vprob_familial.... coded no=0 and yes = 1.

I am hoping to create a new categorical variable that will combine all these binary variables, so I can get a proportion for 0=no and 1=financial 2=familial 3=marital...ect.

Things I have tried so far with no success:

I recoded the original variables vrpob_financial is now e2financial no=0 yes=1, vprob_social is now no=0 and yes=2 ect so that all the "YES" values have a different numeric coding.

Then I tried to gen a new variable

gen prob = 0
foreach var of varlist e2financial e2social e2familial e2marital e2physical e2illspirit e2mental e2dontknow e2other {
replace prob=1 if `var'==1
replace prob=2 if `var'==2
replace prob=3 if `var'==3
replace prob=4 if `var'==4
replace prob=5 if `var'==5
replace prob=6 if `var'==6
replace prob=7 if `var'==7
replace prob=8 if `var'==8
replace prob=9 if `var'==9
}

But this leads to a lot of missing values.

-----

So I tried stacking the data instead

stack e2financial e2social e2familial e2marital e2physical e2illspirit e2mental e2dontknow e2other, into(e2) clear wide

Which achieves what I need, but while it retains the e2varaibles (clear wide), it clears all the variables I don't want to stack including demographic data and other varaibles.

So how do I either create a new variable that combines all these binary variables to only give me a categorical variable with 10 levels (0=no 1=financial....) that I can ultimately use in a linear regression.

Alternatively, how do I use the stack command and make sure none of my other variables get deleted?

Thanks for the help in advance!
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35754
#2

02 Aug 2016, 04:01

Best to give us an explicit data example! See FAQ Advice #12.;

It sounds as if you want something more like

Code:

gen prob = 0 tokenize "e2financial e2social e2familial e2marital e2physical e2illspirit e2mental e2dontknow e2other" forval j = 1/9 replace prob = `j' if ``j''== `j' }
1 like
Comment
Mathias Pedersen Heinze

Join Date: Jun 2015

Posts: 78
#3

02 Aug 2016, 04:12

If the variables are mutually exclusive, i.e. if the respondents were only allowed to check one single option, this simple command should give you the same result:

Code:

egen prob = rowtotal(e2*)
Comment
Lakshmi Gopalakrishnan

Join Date: Aug 2016

Posts: 4
#4

02 Aug 2016, 04:49

I've attached a sample of the original data and the data after creating the new categories.

And, the variables are not mutually exclusive.

Also, I have never seen the tokenize command so I just looked it up and it says it for string variables. The variables are numerically coded, for that's how I recoded them.

Here is a sample of code I used to generate the new variables with numeric codes that are not overlapping.

// Financial

gen e2financial=vprob_financial
recode e2financial 0=0 1=1
label define vprob_financiallbl 0 "No" 1 "Financial"
label value e2financial vprob_financiallbl
tab e2financial, nolabel

// Social

gen e2social=vprob_social
recode e2social 0=0 1=2
label define vprob_sociallbl 0 "No" 2 "Social"
label value e2social vprob_sociallbl
tab e2social, nolabel

Attached Files
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#5

02 Aug 2016, 05:10

tokenize is for splitting a string into strings; contrary to your inference, or perhaps it's an assertion, splitting a string containing names of numeric variables is an entirely valid operation.

If your variables are not always mutually exclusive, then you need rules for exactly how to combine those variables and I can't see that you have said anything on that score. Giving us an example in which at most one variable is positive doesn't help there. Conversely, the technique already given is entirely adequate for that special case.

Please do read FAQ Advice #12 before going any further. Not only does it explain how to give a data example, it explains why image attachments are not as helpful.

Last edited by Nick Cox; 02 Aug 2016, 05:14.
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#6

02 Aug 2016, 13:36

Lakshmi,

This is not as difficult as it seems, but we're still struggling to figure out what you want to do. You have 9 0/1 variables. I doubt if you need to recode them - you probably can just as easily work with your original 0/1 variables. In your example, they are not mutually exclusive.

So what exactly are you trying to do with them? Are you looking for some kind of sum of these 0/1 variables (which is what Mathias did), or are you just creating an indicator variable that =1 if e2financial=1 and =2 if e2social =1, etc.? This second operation seems to only make sense if the original variables are mutually exclusive or if the sequence of the variables has some particular meaning so that a 1 on e2other means you want to ignore all the other variables. What do you want the variable to be if e2financial=1 and e2social =1? If they are mutually exclusive you could simply write:

g prob=1 if e2financial==1
replace prob=2 if e2social==1
replace prob=3 if e2familal==1
etc.

What Nick wants is a nice clear data example that shows what you have and what you want in a format that we can work with easily. Include no extraneous stuff (like your labels).
1 like
Comment

Lakshmi Gopalakrishnan

Join Date: Aug 2016
Posts: 4

03 Aug 2016, 01:28

Hello,

Sorry for being unclear, but I am very new to STATA and cannot provide the dataset because of patient identifiers. So thank you for your patience and help in advance.

@Mathias, I originally tried that before I came to this forum, as it wasn't accomplishing what I set out to do. The variables are not mutually exclusive, so that command din't work for me.

So here is a made up sample of the data in its original form:

vpob_financial	vprob_social	vprob_family	vprob_marital	vprob_physical	vprob_ill	vprob_mental	vprob_dontknow	vprob_other
No	No	No	No	No	Yes	Yes	No	No
No	Yes	Yes	No	No	Yes	Yes	No	Yes
No	No	No	No	No	No	No	Yes	No
Yes	Yes	No	Yes	No	Yes	No	No	Yes
No	No	Yes	No	No	No	No	No	No
Yes	Yes	No	No	Yes	No	Yes	Yes	No
Yes	No	No	No	No	Yes	No	No	Yes

Where "No" corresponds to 0 and "Yes" corresponds to 1 for each variable.

Code:

. tab vprob_financial, nolabel

    X has a |
  financial |
    problem |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      1,928       59.88       59.88
          1 |      1,292       40.12      100.00
------------+-----------------------------------
      Total |      3,220      100.00

So in the original questionnaire, all these are components form one question. But since they are not mutually exclusive, they were broken apart and were turned into 9 variables at the data collection/entry phase. What I need to do is create a new variable that will give me a proportion of all the "NO" and a proportion for all the "Yes" for financial "Yes" for Social "Yes" for marital...

For example, the stack command I used (with the new variables I created) achieves this, but deletes all the other data which I need in order to run a logistic regression down the line with this newly created variable.

Code:

stack e2financial e2social e2familial e2marital e2physical e2illspirit e2mental e2dontknow e2other, into(e2) clear wide

 tab e2

         e2 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |     23,052       79.54       79.54
          1 |      1,292        4.46       84.00
          2 |        156        0.54       84.54
          3 |      1,094        3.78       88.32
          4 |        161        0.56       88.87
          5 |        595        2.05       90.92
          6 |        417        1.44       92.36
          7 |      1,740        6.00       98.37
          8 |         87        0.30       98.67
          9 |        386        1.33      100.00
------------+-----------------------------------
      Total |     28,980      100.00

I basically need to create a new variable which will give me an output like e2 which I can use for further analysis.

@Nick, I tried using the following code you provided, but I must be doing something wrong as I still haven't been able to use it correctly.
gen prob = 0 tokenize "e2financial e2social e2familial e2marital e2physical e2illspirit e2mental e2dontknow e2other" forval j = 1/9 replace prob = `j' if ``j''== `j' }
I hope I have explained my problem better and the dummby dataset is clear.

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35754

03 Aug 2016, 02:53

I am going to keep asking you, please, to read FAQ Advice #12 until you show signs that you have absorbed all that it says! (The whole of the FAQ Advice is worth a skim, all the way to #18.) Being new to Stata is fine, but precisely why you need to study the advice, to get good answers to your questions more quickly, and indeed to give an incentive to people to answer by posing clear questions.

FAQ Advice #12 does address the possibility that your real data are confidential and advises what you have done, provide a fake dataset showing the same problem. Thanks for that, but the same applies, you should use dataex (SSC) or something equivalent to allow people to enter such data easily into their Stata.

For the benefit of anybody else, as well as to underline the point, here is an example of the kind of engineering needed to turn your table into something that can be made relevant to the current problem, and I don't include the copy and paste and editing.

Code:

clear
input str3 (financial social family    marital physical ill mental dontknow other)
No    No    No    No    No    Yes    Yes    No    No
No    Yes    Yes    No    No    Yes    Yes    No    Yes
No    No    No    No    No    No    No    Yes    No
Yes    Yes    No    Yes    No    Yes    No    No    Yes
No    No    Yes    No    No    No    No    No    No
Yes    Yes    No    No    Yes    No    Yes    Yes    No
Yes    No    No    No    No    Yes    No    No    Yes
end
label def YesNo 1 Yes 0 No

foreach v in financial social family marital physical    ill mental dontknow other {
   encode `v', gen(vprob_`v') label(YesNo)
}

list vprob*, nola

     +----------------------------------------------------------------------------+
  1. | vp~ncial | vp~ocial | vprob_~y | vpr~ital | vpro~cal | vprob~ll | vpr~ntal |
     |        0 |        0 |        0 |        0 |        0 |        1 |        1 |
     |----------------------------------------------------------------------------|
     |              vprob_~w               |               vprob_~r               |
     |                     0               |                      0               |
     +----------------------------------------------------------------------------+

     +----------------------------------------------------------------------------+
  2. | vp~ncial | vp~ocial | vprob_~y | vpr~ital | vpro~cal | vprob~ll | vpr~ntal |
     |        0 |        1 |        1 |        0 |        0 |        1 |        1 |
     |----------------------------------------------------------------------------|
     |              vprob_~w               |               vprob_~r               |
     |                     0               |                      1               |
     +----------------------------------------------------------------------------+

     +----------------------------------------------------------------------------+
  3. | vp~ncial | vp~ocial | vprob_~y | vpr~ital | vpro~cal | vprob~ll | vpr~ntal |
     |        0 |        0 |        0 |        0 |        0 |        0 |        0 |
     |----------------------------------------------------------------------------|
     |              vprob_~w               |               vprob_~r               |
     |                     1               |                      0               |
     +----------------------------------------------------------------------------+

     +----------------------------------------------------------------------------+
  4. | vp~ncial | vp~ocial | vprob_~y | vpr~ital | vpro~cal | vprob~ll | vpr~ntal |
     |        1 |        1 |        0 |        1 |        0 |        1 |        0 |
     |----------------------------------------------------------------------------|
     |              vprob_~w               |               vprob_~r               |
     |                     0               |                      1               |
     +----------------------------------------------------------------------------+

     +----------------------------------------------------------------------------+
  5. | vp~ncial | vp~ocial | vprob_~y | vpr~ital | vpro~cal | vprob~ll | vpr~ntal |
     |        0 |        0 |        1 |        0 |        0 |        0 |        0 |
     |----------------------------------------------------------------------------|
     |              vprob_~w               |               vprob_~r               |
     |                     0               |                      0               |
     +----------------------------------------------------------------------------+

     +----------------------------------------------------------------------------+
  6. | vp~ncial | vp~ocial | vprob_~y | vpr~ital | vpro~cal | vprob~ll | vpr~ntal |
     |        1 |        1 |        0 |        0 |        1 |        0 |        1 |
     |----------------------------------------------------------------------------|
     |              vprob_~w               |               vprob_~r               |
     |                     1               |                      0               |
     +----------------------------------------------------------------------------+

     +----------------------------------------------------------------------------+
  7. | vp~ncial | vp~ocial | vprob_~y | vpr~ital | vpro~cal | vprob~ll | vpr~ntal |
     |        1 |        0 |        0 |        0 |        0 |        1 |        0 |
     |----------------------------------------------------------------------------|
     |              vprob_~w               |               vprob_~r               |
     |                     0               |                      1               |
     +----------------------------------------------------------------------------+

Now in terms of your questions: it remains unclear what variable you are thinking about that preserves the information in the original variables, as having one problem naturally doesn't rule out others.

You could concatenate the variables:

Code:

egen vprob = concat(vprob_*)
tab vprob

      vprob |      Freq.     Percent        Cum.
------------+-----------------------------------
  000000010 |          1       14.29       14.29
  000001100 |          1       14.29       28.57
  001000000 |          1       14.29       42.86
  011001101 |          1       14.29       57.14
  100001001 |          1       14.29       71.43
  110010110 |          1       14.29       85.71
  110101001 |          1       14.29      100.00
------------+-----------------------------------
      Total |          7      100.00

There are in principle 2^9 = 512 distinct possibilities for 9 indicators.

On the other hand, if you just want to know the proportions of Yes and No for each indicator, that doesn't need any new variable at all. The means of indicators are just the proportions Yes.

Code:

. tabstat vprob_*, c(s)

    variable |      mean
-------------+----------
vprob_fina~l |  .4285714
vprob_social |  .4285714
vprob_family |  .2857143
vprob_mari~l |  .1428571
vprob_phys~l |  .1428571
   vprob_ill |  .5714286
vprob_mental |  .4285714
vprob_dont~w |  .2857143
 vprob_other |  .4285714
------------------------

To see more, you could e.g. install tabm

Code:

* installs tabm and other commands
. ssc inst tab_chi

. tabm vprob_*, row

+----------------+
| Key |
|----------------|
| frequency |
| row percentage |
+----------------+

| values
variable | No Yes | Total
----------------+----------------------+----------
vprob_financial | 4 3 | 7
| 57.14 42.86 | 100.00
----------------+----------------------+----------
vprob_social | 4 3 | 7
| 57.14 42.86 | 100.00
----------------+----------------------+----------
vprob_family | 5 2 | 7
| 71.43 28.57 | 100.00
----------------+----------------------+----------
vprob_marital | 6 1 | 7
| 85.71 14.29 | 100.00
----------------+----------------------+----------
vprob_physical | 6 1 | 7
| 85.71 14.29 | 100.00
----------------+----------------------+----------
vprob_ill | 3 4 | 7
| 42.86 57.14 | 100.00
----------------+----------------------+----------
vprob_mental | 4 3 | 7
| 57.14 42.86 | 100.00
----------------+----------------------+----------
vprob_dontknow | 5 2 | 7
| 71.43 28.57 | 100.00
----------------+----------------------+----------
vprob_other | 4 3 | 7
| 57.14 42.86 | 100.00
----------------+----------------------+----------
Total | 41 22 | 63
| 65.08 34.92 | 100.00

On applying the code I suggested earlier:

There is a typo in that code:

Code:

gen prob = 0

tokenize "e2financial e2social e2familial e2marital e2physical e2illspirit e2mental e2dontknow e2other"

forval j = 1/9 {
replace prob = `j' if ``j''== `j'
}

I missed out a brace. Sorry about that. My fault, but if you had supplied a data example, I would have tested the code and found the problem! However, the code was based on the idea that the indicators were mutually exclusive, and does not help your real problem.

.

Last edited by Nick Cox; 03 Aug 2016, 03:01.

Comment

Lakshmi Gopalakrishnan

Join Date: Aug 2016

Posts: 4
#9

08 Aug 2016, 22:21

I have had to change my initial data plan for a few reasons and will not need to do this anymore. But I will likely have to use the tokenize command soon and did not know about it before, so thank you for all your time and help. I sincerely appreciate it.
Comment

Announcement