Coding an overall categorical variable from several mutually nonexclusive variables

Mads Moring

Join Date: Apr 2017

Posts: 44
#1

Coding an overall categorical variable from several mutually nonexclusive variables

05 May 2017, 08:55

I'm currently working with a dataset of 233 observations and 300 variables. This data originates from a survey with a great deal of multiple choice questions, which is the reason why there are so relatively many variables. Each possible answer for each multiple choice questions appears as a binary variable in my dataset saying whether that answer was chosen or not. I have provided an example that I asses demonstrates the problem in a fair, although much smaller scaled, manner.

My respondents have, in a multiple choice question, been asked what sodas they've consumed over the last year. For each soda they crossed off, they've been asked to which degree, on an ordinal scale from 1-5, they liked it and similarly how often they buy that soda.

The data, therefore, looks something like this.

Code:

input byte(Cola Fanta Cola_Citrus) float(Like_Cola Like_Fanta Like_ColaCitrus Buy_Cola Buy_Fanta Buy_ColaCitrus) 1 1 1 1 2 3 1 3 1 . . 1 . . 4 . . 3 1 . . 2 1 . 5 . . 1 1 . 3 3 . 2 3 . 1 . . 4 5 . 1 . . . . 1 . . 2 . . 1 . 1 . . . . . 1 . . 1 1 . . 1 . 1 1 . . . . . . . . . 1 1 . 5 4 . 5 3 . end label values Cola Cola label def Cola 1 "Cola", modify label values Fanta Fanta label def Fanta 1 ".", modify label values Cola_Citrus Cola_Citrus label def Cola_Citrus 1 "Cola Citrus", modify label values Like_Cola ordinal label values Like_Fanta ordinal label values Like_ColaCitrus ordinal label values Buy_Cola ordinal label values Buy_Fanta ordinal label values Buy_ColaCitrus ordinal label def ordinal 1 "Often", modify label def ordinal 2 "Somehow Often", modify label def ordinal 3 "Neither", modify label def ordinal 4 "Rarely", modify label def ordinal 5 "Never", modify

I'm aware that I can use the mrtab command in order to gain an overview of the answers, but I desire a new variable, as I wish to make a twoway table with other variables such as the "Like_var".

Thus, I wish to make a overall "category" for sodas. This could be "Cola" and "Orange sodas". But the Cola Citrus would have to figure ind both (!). Now, I can code individual binary variables on whether you drink Cola or Orange.

Code:

gen Cola2 =. recode Cola2 (.=1) if Cola==1 | Cola_Citrus ==1 gen Orange=. recode vurdering (.=1) if Fanta==1 | Cola_Citrus==1 *Attempting to make overall category variable gen Category_Var =. recode Category_Var (.=1) if Cola2==1 recode kategorier (.=2) if Orange==1 tab Category_var

Upon tabulating the new category variable, and within the process of making it, there is no error message, but the second value of the variable, "orange", does not have all the observations that it should. I assume that each respondent can only appear once in each variable. And as you see several respondents enjoy sodas from both categories. Therefore, my question is:

"How may I code a variable that allow for each respondent to figure more than once, or alternatively, how would you go about analyzing the question of whether there are difference between the categories in regards to how often they are consumed?"

This is even more complicated by the fact that I am not allowed do use regression, but should stick to two, or perhaps, threeway tables.

Additionally, how may I produce one collected "how often do you buy soda"-variable, when this too dispersed over three variables? (I also would want to analyse the relationship between how good a soda is rated and how often it is bought).

Hopeful regards
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35724
#2

05 May 2017, 09:01

http://www.stata-journal.com/sjpdf.h...iclenum=dm0034 is a generic discussion.
Comment
Mads Moring

Join Date: Apr 2017

Posts: 44
#3

05 May 2017, 09:12

Originally posted by Nick Cox View Post

http://www.stata-journal.com/sjpdf.h...iclenum=dm0034 is a generic discussion.

I'll have a look - thanks a lot. Would you mind if posted follow up questions in this thread?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#4

05 May 2017, 09:39

Feel free to post or not.
Comment
Mads Moring

Join Date: Apr 2017

Posts: 44
#5

08 May 2017, 05:55

Originally posted by Nick Cox View Post

Feel free to post or not.

So I've read the article you posted, Mr. Cox. I'm not sure that what the egen "var" = group (), label function does, is what I desire.

Code:

gen Cola2 =. (10 missing values generated) recode Cola2 (.=1) if Cola==1 | Cola_Citrus ==1 (Cola2: 8 changes made) gen Orange=. (10 missing values generated) recode Orange (.=1) if Fanta==1 | Cola_Citrus==1 (Orange: 7 changes made) egen Category = group(Cola2 Orange), label (4 missing values generated)

What I get when I tabulate the "category" var is this:
Group (Cola2 Orange freq perceent cum

1 1 6 100 100

Total 6 100

What I'm trying to achieve is more in the line of this:
Category Freq Percent

Cola2 obs %

Orange obs %

Total obs %

And, hopefully be able to cross tabulate the category variable with another variable of the same sort.
Additionally, Is there a way to prevent that it creates so many missing variables? I dont need the respondents to having answer all variables.

Last edited by Mads Moring; 08 May 2017, 06:08.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#6

08 May 2017, 06:11

Your original post #1 suggests some problems here.

1. Your variables such as Cola and Fanta are coded 1 and missing. Perhaps that should be 1 and 0. Stata will ignore missing values to the extent possible.

2. Your code includes the line

Code:

label def Fanta 1 ".", modify

and it's hard to see any reason for that or for treating Fanta differently.
Comment
Mads Moring

Join Date: Apr 2017

Posts: 44
#7

08 May 2017, 07:09

Originally posted by Nick Cox View Post

Your original post #1 suggests some problems here.

1. Your variables such as Cola and Fanta are coded 1 and missing. Perhaps that should be 1 and 0. Stata will ignore missing values to the extent possible.

2. Your code includes the line

Code:

label def Fanta 1 ".", modify

and it's hard to see any reason for that or for treating Fanta differently.

The Fanta 1 "." is indeed a mistake. It was my first time creating data in the editor (and using the dataex software). Thanks! I've now changed all the variables, so that what was former coded as missing is now coded as 0.

And I'm now very close to having the result that I intended.

Code:

recode Cola (.=0) recode Fanta (.=0) recode Cola_Citrus (.=0) recode Buy_Cola (.=0) recode Buy_Fanta (.=0) recode Buy_ColaCitrus (.=0) recode Like_Cola (.=0) recode Like_Fanta (.=0) recode Like_ColaCitrus (.=0) * gen Cola2 = 0 recode Cola2 (0=1) if Cola==1 | Cola_Citrus ==1 gen Orange= 0 recode Orange (0=1) if Fanta==1 | Cola_Citrus==1 egen Category = group(Cola2 Orange), label tab Category

This gives me something like the following:
Categoy freq pecent valid cum

00

01

10

11

Code:

gen Category2 = 0 recode Category2 (0=1) if Category == 2 | Category == 4 gen Category3 = 0 recode Category3 (0=1) if Category == 3 | Category == 4 egen Last_Category = group (Category2 Category3), label tab Last_Category

Does not give the desired outcome as explained in post#5
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#8

08 May 2017, 07:20

Please give a complete script starting from your data example. I didn't understand #5, so I don't promise to answer, but people won't want to read the entire thread to understand.
Comment

Mads Moring

Join Date: Apr 2017
Posts: 44

08 May 2017, 07:51

Originally posted by Nick Cox View Post

Please give a complete script starting from your data example. I didn't understand #5, so I don't promise to answer, but people won't want to read the entire thread to understand.

What I was trying to say in #5 was, that I wish to have a table, that does not state the freq. of respondents that have 1 in Orange, the freq. of respondents who have 1 ind Cola, and lastly the ones who have 1 in both. I want the ones who have 1 in both to appear in / count towards the freq. in both Orange and Cola.

Code:

input byte(Cola Fanta Cola_Citrus) float(Like_Cola Like_Fanta Like_ColaCitrus Buy_Cola Buy_Fanta Buy_ColaCitrus)
1 1 1 1 2 3 1 3 1
. . 1 . . 4 . . 3
1 . . 2 1 . 5 . .
1 1 . 3 3 . 2 3 .
1 . . 4 5 . 1 . .
. . 1 . . 2 . . 1
. 1 . . . . . 1 .
. 1 1 . . 1 . 1 1
. . . . . . . . .
1 1 . 5 4 . 5 3 .
end
label values Cola Cola
label def Cola 1 "Cola", modify
label values Fanta Fanta
label def Fanta 1 "Fanta", modify
label values Cola_Citrus Cola_Citrus
label def Cola_Citrus 1 "Cola Citrus", modify
label values Like_Cola ordinal
label values Like_Fanta ordinal
label values Like_ColaCitrus ordinal
label values Buy_Cola ordinal
label values Buy_Fanta ordinal
label values Buy_ColaCitrus ordinal
label def ordinal 1 "Often", modify
label def ordinal 2 "Somehow Often", modify
label def ordinal 3 "Neither", modify
label def ordinal 4 "Rarely", modify
label def ordinal 5 "Never", modify

recode Cola (.=0)
recode Fanta (.=0)
recode Cola_Citrus (.=0)
recode Buy_Cola (.=0)
recode Buy_Fanta (.=0)
recode Buy_ColaCitrus (.=0)
recode Like_Cola (.=0)
recode Like_Fanta (.=0)
recode Like_ColaCitrus (.=0)

*
gen Cola2 = 0


recode  Cola2 (0=1) if Cola==1 | Cola_Citrus ==1


gen Orange= 0


recode Orange (0=1) if Fanta==1 | Cola_Citrus==1

egen Category = group(Cola2 Orange), label
tab Category
fre Category


gen Category2 = 0
recode Category2 (0=1) if Category == 2 |  Category == 4
gen Category3 = 0
recode Category3 (0=1) if Category == 3 |  Category == 4
egen Last_Category = group (Category2 Category3), label
tab Last_Category

The table at the end is therefore not fare away from the table that is wanted, but its not quite there either. I want it down to two possible results of the last_category var

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

#10

08 May 2017, 08:20

Sorry, still missing what you want that's different. This is cleaner and I think equivalent code.

Code:

clear 

input byte(Cola Fanta Cola_Citrus) float(Like_Cola Like_Fanta Like_ColaCitrus Buy_Cola Buy_Fanta Buy_ColaCitrus)
1 1 1 1 2 3 1 3 1
. . 1 . . 4 . . 3
1 . . 2 1 . 5 . .
1 1 . 3 3 . 2 3 .
1 . . 4 5 . 1 . .
. . 1 . . 2 . . 1
. 1 . . . . . 1 .
. 1 1 . . 1 . 1 1
. . . . . . . . .
1 1 . 5 4 . 5 3 .
end
label values Cola Cola
label def Cola 1 "Cola", modify
label values Fanta Fanta
label def Fanta 1 "Fanta", modify
label values Cola_Citrus Cola_Citrus
label def Cola_Citrus 1 "Cola Citrus", modify
label values Like_Cola ordinal
label values Like_Fanta ordinal
label values Like_ColaCitrus ordinal
label values Buy_Cola ordinal
label values Buy_Fanta ordinal
label values Buy_ColaCitrus ordinal
label def ordinal 1 "Often", modify
label def ordinal 2 "Somehow Often", modify
label def ordinal 3 "Neither", modify
label def ordinal 4 "Rarely", modify
label def ordinal 5 "Never", modify

* some differences in what follows 
mvencode *, mv(0) 

gen Cola2 = inlist(1, Cola, Cola_Citrus) 
gen Orange= inlist(1, Fanta, Cola_Citrus) 

egen Category = group(Cola2 Orange), label
tab Category

gen Category2 = inlist(Category, 2, 4) 
gen Category3 = inlist(Category, 3, 4) 

egen Last_Category = group(Category2 Category3), label
tab Last_Category

Comment

Mads Moring

Join Date: Apr 2017

Posts: 44
#11

10 May 2017, 02:25

Originally posted by Nick Cox View Post

Sorry, still missing what you want that's different. This is cleaner and I think equivalent code.
[/CODE]

I'll attempt to explain myself again, and thank you so much for you patience, Mr. Cox.

The table that the last command "tab Last_Category" (post#10) produces has four outcomes {00;01;10;11}. If I'm not mistaken "00" shows the freq. of people who are in neither category. The outcome "01" shows the freq. of people who are not in the first category, but the second and so forth. The outcome "11" then shows how many people are in both (?).

The table summarizes to 10, which is the total number of observations. Thus, these outcomes seems to be mutually exclusive. When looking in the data, we see that several respondents should be in both categories, and they are now in the "11" outcome. I do not mind counting every respondent several times in the table I desire, I actually want the categories not to be mutually exclusive. So what I would like is that the 6 respondents who are in the "11" outcome appears in both the "01" and the "10" outcomes. This would give Last_Category variable two outcomes "01"(Cola2) and "10" (Orange) with 8 and 7 in freq. respectively.

My question then, is how I get to that last table from the one we have now?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#12

10 May 2017, 03:38

(On Statalist, there is no need to use titles such as Mr unless people insist on them, and I've never seen that in 23 years here. It's all too likely that people will get the title wrong in any case. For myself "Nick" is fine.)

Your categories have to be mutually exclusive with tabulate. No way round that. Otherwise I fear I still don't understand. Sorry.
Comment

Mads Moring

Join Date: Apr 2017
Posts: 44

#13

12 May 2017, 06:30

Originally posted by Nick Cox View Post

Your categories have to be mutually exclusive with tabulate. No way round that. Otherwise I fear I still don't understand. Sorry.

I've found a way that I think works, even though it is not aesthetically pleasing.

Code:

clear all
input byte(Cola Fanta Cola_Citrus) float(Like_Cola Like_Fanta Like_ColaCitrus Buy_Cola Buy_Fanta Buy_ColaCitrus)
1 1 1 1 2 3 1 3 1
. . 1 . . 4 . . 3
1 . . 2 1 . 5 . .
1 1 . 3 3 . 2 3 .
1 . . 4 5 . 1 . .
. . 1 . . 2 . . 1
. 1 . . . . . 1 .
. 1 1 . . 1 . 1 1
. . . . . . . . .
1 1 . 5 4 . 5 3 .
end
label values Cola Cola
label def Cola 1 "Cola", modify
label values Fanta Fanta
label def Fanta 1 "Fanta", modify
label values Cola_Citrus Cola_Citrus
label def Cola_Citrus 1 "Cola Citrus", modify
label values Like_Cola ordinal
label values Like_Fanta ordinal
label values Like_ColaCitrus ordinal
label values Buy_Cola ordinal
label values Buy_Fanta ordinal
label values Buy_ColaCitrus ordinal
label def ordinal 1 "Often", modify
label def ordinal 2 "Somehow Often", modify
label def ordinal 3 "Neither", modify
label def ordinal 4 "Rarely", modify
label def ordinal 5 "Never", modify


cd "PUT IN YOUR OWN LOCATION"
save forsog0.dta, replace

gen soda1=.
replace soda1=1 if Fanta==1
replace soda1=2 if Cola_Citrus==1 & soda1==.
replace soda1=3 if Cola==1 & soda1==.

gen soda2=.
replace soda2=2 if Cola_Citrus==1 & soda1<2
replace soda2=3 if Cola==1 & soda1<3

gen soda3=.
replace soda3=3 if Cola==1 & soda1<3 & soda2<3

lab def SODA 1 "Fanta" 2 "Cola_Citrus" 3 "Cola"
lab val soda1 soda2 soda3 SODA

*

gen Category_Orange=1 if soda1==1| soda1==2
replace Category_Orange=2 if soda2==2

gen Category_Cola=1 if soda1==2 | soda1==3
replace Category_Cola=2 if soda2==2 | soda2==3
replace Category_Cola=3 if soda3==3


*Saving Category_Orange
keep Category_Orange

cd "PUT IN YOUR OWN LOCATION"

save forsog1.dta, replace

*Saving Category_Cola
use forsog0.dta, clear
gen soda1=.
replace soda1=1 if Fanta==1
replace soda1=2 if Cola_Citrus==1 & soda1==.
replace soda1=3 if Cola==1 & soda1==.

gen soda2=.
replace soda2=2 if Cola_Citrus==1 & soda1<2
replace soda2=3 if Cola==1 & soda1<3

gen soda3=.
replace soda3=3 if Cola==1 & soda1<3 & soda2<3

lab def SODA 1 "Fanta" 2 "Cola_Citrus" 3 "Cola"
lab val soda1 soda2 soda3 SODA

gen Category_Orange=1 if soda1==1| soda1==2
replace Category_Orange=2 if soda2==2

gen Category_Cola=1 if soda1==2 | soda1==3
replace Category_Cola=2 if soda2==2 | soda2==3
replace Category_Cola=3 if soda3==3

keep Category_Cola

cd "PUT IN YOUR OWN LOCATION"

save forsog2.dta, replace

*
use "PUT IN YOUR OWN LOCATION \forsog1.dta" , clear

append using "PUT IN YOUR OWN LOCATION\forsog2.dta"

recode Category_Orange (2=1)
recode Category_Cola (1=2)

gen CATEGORY=Category_Orange if Category_Orange!=.
replace CATEGORY=Category_Cola if Category_Cola!=.
lab def CAT_LAB 1 "Orange" 2 "Cola"
lab val CATEGORY CAT_LAB

tab CATEGORY

Last edited by Mads Moring; 12 May 2017, 07:08.

Group (Cola2 Orange	freq	perceent	cum
1 1	6	100	100
Total	6	100

Category	Freq	Percent
Cola2	obs	%
Orange	obs	%
Total	obs	%

Categoy	freq	pecent	valid	cum
00
01
10
11

Announcement