generating dummy variables for multiple categories when number of categories is quite large

Nico Ochmann

Join Date: Sep 2015

Posts: 119
#1

generating dummy variables for multiple categories when number of categories is quite large

04 Aug 2017, 07:31

Dear Forum,

I have individual level data (pidp) and I know where the person was born (country) and whether or not she holds a UK or non-UK degree (degree=1). I would like to define dummies for country of origin of the degree assuming that a non-UK degree is obtained from the country of birth. All UK natives (country=500) received their degree in the UK, hence all UK folks who got a degree from abroad are dropped. As the reference or omitted dummy variable category, I would like to have UK degree. Countries 7 and 8 are just some generic countries.
I have 120 countries in my sample plus the UK (500). Below are the data for four generic individuals 1,2,3, and 4. D7 is a dummy if degree obtained in country 7 and D8 if degree obtained in country 8.
So far, I have programmed this while I have generated indicator variables for each country:
local x 1
while `x'<=120 {

gen D`x'= 0
replace D`x' = 1 if country`x' ==1 & degree==1

local x=`x'+1
}

Now here comes my problem. How do generate the reference dummy category D500?
Note, it is turned on, when either UK person is present or a non-UK person with a UK degree (see D500 below).
It seems to me that I am doing something wrong because I have in the example below three location of degree dummies but I have four countries I am dealing with.
Not sure if I bring my point across, but I hope I have given all the info needed for people to know what I am trying to do.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str4 pidp str2(D7 D8) str4 D500 str7 country str6 degree "1" "1" "0" "0" "7" "1" "2" "0" "1" "0" "8" "1" "3" "0" "0" "1" "7" "0" "4" "0" "0" "1" "500" "0" "" "" "" "" "" "" end

Help is as usual much appreciated.

Thanks in advance.

Nico
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30068
#2

04 Aug 2017, 08:48

I suggest an entirely different approach. The data structure you are working with seems inappropriate to your needs. Also, if you are using a modern version of Stata there is almost never any need to actually create dummy variables, as Stata's factor variable notation (-help fvvarlist-) will create "virtual" dummy variables on the fly as you need them in regression and many other commands.

For some reason you have all of your variables as strings, when they clearly (except for possibly pidp) have meaningful numeric content that you will be unable to exploit in their current form. So the first step is to convert them to numeric. (I will leave pdip as a string because it looks like its numeric content is merely a coding convention and that you will never be relying on numeric properties of those values.)

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str4 pidp str2(D7 D8) str4 D500 str7 country str6 degree "1" "1" "0" "0" "7" "1" "2" "0" "1" "0" "8" "1" "3" "0" "0" "1" "7" "0" "4" "0" "0" "1" "500" "0" "" "" "" "" "" "" end ds pidp, not destring `r(varlist)', replace

This will set you up with numeric variables. Now, let's say you want to code some regression command and you need the country indicators in the model, and you want 500 to be the reference category. Then you can just do this:

Code:

regression_command outcome_variable predictor_variables ib500.country, options

Everything in italics is pseudo-code that you will replace with whatever is appropriate to your situation. The ib500.country term you will use literally: it tells Stata to create virtual "dummy" variables for the values of country and to use the value 500 as the reference category. No muss, no fuss, no errors.

Factor variable notation can be used with all official Stata estimation commands, and with most user-written commands of recent vintage. If you find yourself having to use a command that does not support factor variables (it will give you a message telling you that when you try to run it with this notation), then you can repost your original question in that context.

Last edited by Clyde Schechter; 04 Aug 2017, 08:50.
Comment
Nico Ochmann

Join Date: Sep 2015

Posts: 119
#3

04 Aug 2017, 09:46

Dear Sir,

I would like to thank you very much for your immediate reply. It is quite useful, but I do not think I am there yet. I understand everything you write, but how do I generate the dummies as you describe it when I have two pieces of information that define a dummy to be one. You describe the case where I only look at the country of birth, as one piece of information that defines the indicator. But in my case, I try to define a dummy where I have two pieces of information, country of birth and uk or non-uk degree. This appears to be more complicated.

Further assistance is much appreciated.

Thanks.

Nico
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30068
#4

04 Aug 2017, 10:08

So, it seems I don't understand what you are trying to do. Can you write out in words what you want under each of the following conditions:

1. country == 500 and degree from UK
2. country == 500 and degree not from UK
3. country != 500 and degree from UK
4. country != 500 and degree not from UK
Comment
Nico Ochmann

Join Date: Sep 2015

Posts: 119
#5

04 Aug 2017, 10:42

Hi,
you are quick and I really appreciate your help. I am sure I am not expressing myself well enough. So let me try again, please. Your conditions are very useful and I shall write this:

2. dropped from the sample, this condition does not apply. For various reasons, I have decided to get rid of these observations.

So, I would like to combine 1., 3., and 4. in a way that I get dummy variables for the countries a non UK degree was obtained (condition 4) with a dummy for UK degrees being the reference dummy (condition 1 and condition 3).

Basically, I would like to generate dummies for the location or origin of the degree, with UK location or origin being the reference category.

I hope this post helps a bit.

I highly appreciate your further help.

Nico
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30068
#6

04 Aug 2017, 11:28

Perhaps I understand it a little better now. It seems you want your indicators (dummies) to represent separate categories such as:

French Born Non-UK Degree
French Born UK Degree
German Born Non-UK Degree
German Born UK Degree
etc.
except that the category UK Born Non-UK Degree does not occur.

So, except for the last clause, you are actually asking for the interaction of birth country and uk/non-uk. In factor-variable notation this would be denoted as i.country#i.degree. (I understand your variable degree to mean not whether the person has a degree, but whether degree was a UK degree--do I have that right?) While it is not a good idea to use interactions where many of the cross-classified categories are not instantiated in the data, it sounds as if the only systematically missing group here is UK Born Non-UK Degree. If there aren't many other categories like that which are, just by happenstance, not instantiated in the data, it would be reasonable to use ib500.country#i.degree. This can sometimes lead to technical difficulties, but is workable for most purposes.

Another approach would be to do this:

Code:

egen country_UK_degree = group(country degree) summ country_UK_degree if country == 500, meanonly local base_cat `r(mean)'

and then use ib`base_cat'.country_UK_degree to represent the indicator variables.

This code first creates a new variable that assigns a different integer value to each combination of variables country and degree. The next two lines ascertain which value of this newly created variable corresponds to country == 500 and saves that values in local macro base_cat. Subsequent references to ib`base_cat'.country_UK_degree will cause Stata to create "virtual" indicator variables with UK born as the base category.

The drawback to this last approach is that if you are at some point going to be interested in the separate effects of birth country and UK degree, they are too entangled with each other to separate. But if that is not an issue, then this has the advantage of avoiding indicators that encode empty classifications.
Is this on the right track?
Comment
Nico Ochmann

Join Date: Sep 2015

Posts: 119
#7

04 Aug 2017, 12:01

Hello, I think we are almost there, but not entirely.

French Born Non-UK Degree: yes this should be equal to one for a dummy defined as "French degree" and zero otherwise
German Born Non-UK Degree: yes equal to one for a dummy v. define as "German degree" and zero otherwise

UK person with UK Degree , French Born UK Degree, German Born UK Degree, Russian UK degree, all these should be combined in one dummy equal to one stating "UK degree" and zero otherwise.

I think we are pretty close now. This is what I want and thanks for helping me clarifying my thoughts.

Thanks again for your help.

Nico
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30068
#8

04 Aug 2017, 14:17

Code:

gen country_degree = country if degree == 0 replace country_degree == 500 if degree == 1

Now refer to ib500.country_degree.
Comment
Nico Ochmann

Join Date: Sep 2015

Posts: 119
#9

04 Aug 2017, 16:30

Oh yes, that is it. Brilliant! I cannot thank you enough for your help!
If I look at it now, not that difficult after all, but afterwards things are always easy.

Thank you very much.

Have a great weekend!

Nico
Comment

Announcement

generating dummy variables for multiple categories when number of categories is quite large

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment