problems with egen group() function

Maddie R

Join Date: Jul 2014

Posts: 27
#1

problems with egen group() function

16 Mar 2015, 19:34

Hi!
So the survey data I'm looking at had a question on occupation which included a tick box for each of 8 different types of occupations. Each occupation option has been coded as a separate variable in my data set however and just gives the 'yes' responses.

HTML Code:

type: numeric (int) label: q107_1 range: [1,1] units: 1 unique values: 1 missing .: 415/471 tabulation: Freq. Numeric Label 56 1 Yes 415 .

Not great for working out the overall relative proportions of who does what as a job so want to have a new composite variable that lists each occupation as 1,2,3,...8.

Have tried

HTML Code:

egen occupation=group( q107_1 q107_2 q107_3 q107_4 q107_5 q107_6 q107_7 q107_8 q107_9), label (471 missing values generated)

but what happens is that instead of var q107_1 being coded as 1, q107_2 coded as 2 and so on, it just creates a variable with no observations whatsoever.

Not sure if there's something I'm not doing before the egen command that I should be? Or perhaps if there is a better command to achieve this variable creation?

Not a very confident user of the egen command so any tips and advice would be very appreciated!

Cheers,

Maddie
Tags: None
ben earnhart

Join Date: May 2014

Posts: 1027
#2

16 Mar 2015, 19:56

You are facing the infamous "implicit zero" problem. We don't know for a fact whether the people skipped the question entirely, or if they legitimately meant "no" to each. Assuming you don't have a "none of the above" option, one compromise I have found is to first check to see if they answered at least one, and if they didn't, check to see if they answered the question before the check-all-that-apply and answered the question after the check-all-that-apply. Something like:

Code:

egen answeredOne=rowmean( q107_1 q107_2 q107_3 q107_4 q107_5 q107_6 q107_7 q107_8 q107_9) gen beforeandAfter=1 if q106!=. & q108!=. forvalues i=1/8 { gen occup`i'=1 if q107_`i'==1 replace occup`i'=0 if q107_`i'!=1 & ((answeredOne>0 & answeredOne!=.) | beforeandAfter==1) }

That is, they certainly get a 1 for that occupation if they said they had that occupation. They get a 0 for that occupation if they either answered at least one occupation question or answered the lead-in and lead-out questions. If they answered the questions before and after, it is reasonable (though not certain) that they seriously considered the options in the check all that apply.

No perfect way to solve the implicit zero, but this seems to be a reasonable compromise others I have discussed it with find acceptable. Wish I had an explicit citation for you.

Last edited by ben earnhart; 16 Mar 2015, 19:58.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#3

16 Mar 2015, 20:03

ps. If you adapted the code, you could just fill in zeros on the original variables (Q107_x). Many people just do that. But don't do that! Keeping the original variables intact is very important for numerous reasons, for example, if you changed your criteria for "well, they considered answering the question but didn't find a suitable answer."
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

16 Mar 2015, 20:09

Just to pile on top of Ben here, the check box response is an atrocity and should be banned from surveys entirely. The only way it's even remotely usable is if one of the items at the end of the list is "None of the above." Even then, you never really know what the respondent was thinking.

Ben's advice is good and is quite similar to what I do when confronted with this kind of wretched data in a circumstance where I really can't just eliminate those questions from consideration. But burn it into your memory, and when you are designing your own surveys, don't use check box responses.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#5

16 Mar 2015, 20:36

pps. If you were a little loose with your terms, and they weren't actually check-boxes (e.g. we can count on them only answering one, and if they didn't answer *any* then they are really truly missing) then it's as simple as:

Code:

egen occup=group(q107*), mi lab

which is what I think you were shooting for. But if they could answer more than one, or legitimately answer none, then my more complex solution is relevant.
Comment
Maddie R

Join Date: Jul 2014

Posts: 27
#6

16 Mar 2015, 20:48

Excellent, thanks for the feedback everyone! Unfortunately, not my data - have just come on board to help out. You're absolutely right Clyde - wouldn't have done it like that myself but gotta just work with what you've got! Will give it a go and see if I get some better results.
Comment
Maddie R

Join Date: Jul 2014

Posts: 27
#7

16 Mar 2015, 21:35

Righto so used your first code Ben and was met with success at creating new binary variables for each occupation.
Tried the egen group() function again with these new variables ....and got the output below.
It seems the respondents were able to select more than one response and the egen command has listed every possible combination of occupations people listed.
Now what I guess I'm stuck with is how to just make the occupation variable say

1 = housework
2= farming
3= teacher
4=.....
.......
9= (realised there were 9 options!) I guess taking into account that people could have multiple responses.

Any further help would be so appreciated! Had no idea how complicated it was to get around this problem!

. egen occupation= group(occup1 occup2 occup3 occup4 occup5 occup6 occup7 occup8 occup9), label

. tab occupation

group(occup1 |
occup2 occup3 |
occup4 occup5 |
occup6 occup7 |
occup8 occup9) | Freq. Percent Cum.
------------------+-----------------------------------
0 0 0 0 0 0 0 0 0 | 3 0.64 0.64
0 0 0 0 0 0 0 0 1 | 51 10.83 11.46
0 0 0 0 0 0 0 1 0 | 5 1.06 12.53
0 0 0 0 0 0 1 0 0 | 2 0.42 12.95
0 0 0 0 0 1 0 0 0 | 7 1.49 14.44
0 0 0 0 0 1 0 0 1 | 1 0.21 14.65
0 0 0 0 1 0 0 0 0 | 2 0.42 15.07
0 0 0 1 0 0 0 0 0 | 16 3.40 18.47
0 0 1 0 0 0 0 0 0 | 7 1.49 19.96
0 1 0 0 0 0 0 0 0 | 248 52.65 72.61
0 1 0 0 0 0 0 0 1 | 36 7.64 80.25
0 1 0 0 0 0 0 1 0 | 1 0.21 80.47
0 1 0 0 0 0 1 0 0 | 2 0.42 80.89
0 1 0 0 0 0 1 0 1 | 1 0.21 81.10
0 1 0 0 0 1 0 0 0 | 4 0.85 81.95
0 1 0 0 1 0 0 0 0 | 1 0.21 82.17
1 0 0 0 0 0 0 0 0 | 18 3.82 85.99
1 0 0 0 0 0 0 0 1 | 10 2.12 88.11
1 0 0 0 0 0 1 0 0 | 1 0.21 88.32
1 0 0 0 0 1 0 0 0 | 1 0.21 88.54
1 0 0 0 1 0 0 0 0 | 1 0.21 88.75
1 0 0 1 0 0 0 0 0 | 1 0.21 88.96
1 1 0 0 0 0 0 0 0 | 41 8.70 97.66
1 1 0 0 0 0 0 0 1 | 7 1.49 99.15
1 1 0 0 0 0 0 1 0 | 1 0.21 99.36
1 1 0 0 0 0 1 0 0 | 1 0.21 99.58
1 1 0 1 0 0 0 0 0 | 2 0.42 100.00
------------------+-----------------------------------
Total | 471 100.00
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#8

16 Mar 2015, 22:06

There is no proper or pretty solution I can see.

For most purposes, the best you can do is the set of dummies, and say "percentages add to greater than one since people may claim multiple occupations."

Apart from that, if you needed to (for example) use the dummies in a regression, randomly assign multiple-category people into one of their categories. I don't have a well-thought-out or peer-tested (note, not peer-reviewed; my before-and-after approach is purely from discussions) approach to splitting somebody who has multiple memberships.

I'll sleep on it and see if I have any revelations, but realistically, seems like maybe you need to sit down and see if *substantively* the combos can be lumped. Maybe you end up with 11 categories or something like that. Combos 010000001 and 11000000 for example seem popular enough that they might deserve their own classifications.

I'm guessing there was inadequate pretesting of this instrument
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4992
#9

16 Mar 2015, 22:10

If there are 9 options and you can pick more than 1, then there are potentially 512 different combos. I guess you should be grateful that only around 29 actually showed up in the data!

Maybe you want to stick with the 9 original variables, having them coded 1/0.

Otherwise I guess you have to figure out what to do with all those combos where more than 1 was selected. Do you have any criteria for only choosing 1 of the multiple choices?

FYI, output would be easier to read using code tags. Click on the underlined A on the upper right of the message pane and then click on the # sign.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Maddie R

Join Date: Jul 2014

Posts: 27
#10

16 Mar 2015, 22:18

Whoops, sorry for that guys - bit out of practice on the forum and forgot which was the correct output sign!
Thank you though again for your insight and advice, really appreciate your thoughts.
I think that I will have a chat to the CI on this project - with any luck this variable might not be a big deal and I can stop trying to fiddle around with it and focus on the actual important parts of my data analysis!
Comment

Announcement

problems with egen group() function

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment