Generating dummy variables!

Vijay Kumar

Join Date: Jul 2016

Posts: 24
#1

Generating dummy variables!

05 Aug 2016, 11:50

Hi all! I am working on a student school admission data set which has admission status of the child, the order of preference of the schools (applied to), and a bunch of SES variables for each child. I want to create a dummy variable for each school. To be clear, this is how my data looks like-
A,B,C,D,E are school names;a,b,c,d,e are student id's; P1-P5 are the school preferences.
id P1 P2 P3 P4 P5

a A B C

b C B A D

c A B C D E

d D E

e C A E

I now want to create dummy variables for A, B,C,D,E and make my data set look like this
id P1 P2 P3 P4 P5 A B C D E

a A B C 1 1 1 0 0

b C B A D 1 1 1 1 0

c A B C D E 1 1 1 1 1

d D E 0 0 0 1 1

e C A E 1 0 1 0 1

Obviously tab P1, generate (s) doesn't word here.
Thank you!
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35520

05 Aug 2016, 12:03

Using dataex (SSC) is preferred please (FAQ Advice #12).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str1(id p1 p2 p3 p4 p5)
"a" "A" "B" "C" ""  "" 
"b" "C" "B" "A" "D" "" 
"c" "A" "B" "C" "D" "E"
"d" "D" "E" ""  ""  "" 
"e" "C" "A" "E" ""  "" 
end

foreach v in A B C D E {
     gen `v' = 0
     quietly forval j = 1/5 {
           replace `v' = 1 if p`j' == "`v'"
     }
}


 list 

     +-------------------------------------------------+
     | id   p1   p2   p3   p4   p5   A   B   C   D   E |
     |-------------------------------------------------|
  1. |  a    A    B    C             1   1   1   0   0 |
  2. |  b    C    B    A    D        1   1   1   1   0 |
  3. |  c    A    B    C    D    E   1   1   1   1   1 |
  4. |  d    D    E                  0   0   0   1   1 |
  5. |  e    C    A    E             1   0   1   0   1 |
     +-------------------------------------------------+

Comment

Mathias Pedersen Heinze

Join Date: Jun 2015

Posts: 78
#3

05 Aug 2016, 12:25

Dear Vijay,

This should do the trick:

Code:

foreach x in A B C D E { gen `x' = 0 replace `x' = 1 if inlist("`x'", p1, p2, p3, p4, p5) }
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35520
#4

05 Aug 2016, 13:25

Mathias' trick is better. See also (e.g.) http://www.stata-journal.com/sjpdf.h...iclenum=dm0058 for more discussion.

In fact, this will work too:

Code:

foreach x in A B C D E { gen `x' = inlist("`x'", p1, p2, p3, p4, p5) }
2 likes
Comment
Mathias Pedersen Heinze

Join Date: Jun 2015

Posts: 78
#5

05 Aug 2016, 17:24

Originally posted by Nick Cox View Post

See also (e.g.) http://www.stata-journal.com/sjpdf.h...iclenum=dm0058 for more discussion.

That is indeed a great column, Nick. I will make sure to distribute it next time I teach Stata.
1 like
Comment
Vijay Kumar

Join Date: Jul 2016

Posts: 24
#6

05 Aug 2016, 20:04

Dear Nick and Mathias.. Thank you so much for the code. I am however not able to use it as I have a large number of schools (503) and preferences from P1-P178. So, how do I get all my school id/names in the first line of the code?

foreach x in A B C D E.

I am hoping this will work for the second line
gen `x' = inlist("`x'", p1-p178) Thanks again!
Comment

Vijay Kumar

Join Date: Jul 2016
Posts: 24

05 Aug 2016, 21:15

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double id str7(p1 p2 p3 p4 p5 p6 p7)
20160000015 "1412148" "1413329" "1413240" "1413245" "1413215" "1411253" "1411219"
20160000041 "1821202" "1821210" "1821232" "."       "."       "."       "."      
20160000053 "1617224" "1617166" "1617186" "1617204" "1617220" "1411253" "1617192"
20160000058 "1821202" "1821146" "1821210" "1618189" "."       "."       "."      
20160000088 "1618249" "1821143" "1618232" "1514087" "1720147" "1514073" "1617192"
20160000090 "1720158" "1720136" "1514085" "1514086" "1515105" "1516110" "1516113"
20160000097 "1003262" "1002318" "1002294" "1001197" "1002306" "1003217" "1003225"
20160000106 "1617204" "1617186" "1617192" "1617181" "1617229" "1618253" "1515114"
20160000107 "1617192" "1617204" "."       "."       "."       "."       "."      
20160000115 "1412257" "1412152" "1413330" "1413225" "1413227" "1413329" "1412148"
20160000122 "1617229" "1411253" "1617182" "1617186" "1413283" "1413183" "1413215"
20160000149 "1411219" "."       "."       "."       "."       "."       "."      
20160000150 "1923255" "1720142" "1720145" "."       "."       "."       "."      
20160000157 "1413329" "1411253" "1617229" "."       "."       "."       "."      
20160000228 "1411183" "1411214" "1411201" "1207188" "1411199" "1207186" "1207181"
20160000264 "1617192" "1617229" "1617186" "1617182" "1617181" "1411253" "1514086"
20160000270 "1516116" "."       "."       "."       "."       "."       "."      
20160000276 "1105208" "1106190" "1104301" "1105233" "1106218" "1104310" "1104275"
20160000283 "1411253" "1412135" "1412141" "1412151" "1412156" "1413205" "1413215"
20160000288 "1821141" "1821232" "1821168" "1821145" "1821151" "1821152" "1821143"
end

This is how my data looks like- though I have 178 preference variables from p1-p178. Thanks in advance for your help.

Last edited by Vijay Kumar; 05 Aug 2016, 21:20.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35520
#8

06 Aug 2016, 02:31

If you change the question radically the answer may change! You want how many dummies? Is it one for every school mentioned?
Comment
Vijay Kumar

Join Date: Jul 2016

Posts: 24
#9

06 Aug 2016, 02:47

Hi Nick. I want to create a dummy for each of my 503 schools. The dummy should take a value 1 if the school is in the preference list of the student and 0 otherwise. Thank you.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35520
#10

06 Aug 2016, 02:52

Are you sure you know how you want to use those variables? You can do it but it may not be a good idea. Do you have the school identifiers anywhere in a single variable or do they appear only in the preference variables?
Comment

Vijay Kumar

Join Date: Jul 2016
Posts: 24

#11

06 Aug 2016, 05:25

Hi Nick.. I want to run a linear probability/ logit model to estimate the predicted probabilities of admission of each of the 15761 children in my data with dummies for neighborhoods ( 1283 neighborhoods) and schools (503) as the independent variables. I do have 503 school identifiers in the variable schoolid which takes the value of the school identifier if the child got admitted and "." if not admitted. My data with 7 preferences, schoolid, neighborhood variables looks like this.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double id str50 N1 str7(p1 p2 p3 p4 p5 p6 p7 schoolid) float treatment
20160000015 "Rohini Extension" "1412148" "1413329" "1413240" "1413245" "1413215" "1411253" "1411219" "1412148" 1
20160000041 "Kakraula Village" "1821202" "1821210" "1821232" "."       "."       "."       "."       "."       0
20160000053 "JJ Colony 1"      "1617224" "1617166" "1617186" "1617204" "1617220" "1411253" "1617192" "."       0
20160000058 "Sadh Nagar I"     "1821202" "1821146" "1821210" "1618189" "."       "."       "."       "."       0
20160000088 "Uttam Nagar East" "1618249" "1821143" "1618232" "1514087" "1720147" "1514073" "1617192" "."       0
20160000090 "Budh Nagar"       "1720158" "1720136" "1514085" "1514086" "1515105" "1516110" "1516113" "1720158" 1
20160000097 "BlockWb"          "1003262" "1002318" "1002294" "1001197" "1002306" "1003217" "1003225" "1105191" 1
20160000106 "Nihal Vihar"      "1617204" "1617186" "1617192" "1617181" "1617229" "1618253" "1515114" "1617204" 1
20160000107 "GH 6"             "1617192" "1617204" "."       "."       "."       "."       "."       "."       0
20160000115 "Karala"           "1412257" "1412152" "1413330" "1413225" "1413227" "1413329" "1412148" "1413330" 1
20160000122 "Block O"          "1617229" "1411253" "1617182" "1617186" "1413283" "1413183" "1413215" "."       0
20160000149 "Pitampura"        "1411219" "."       "."       "."       "."       "."       "."       "1411219" 1
20160000150 "Block A"          "1923255" "1720142" "1720145" "."       "."       "."       "."       "."       0
20160000157 "Sector 20"        "1413329" "1411253" "1617229" "."       "."       "."       "."       "."       0
20160000228 "Block A"          "1411183" "1411214" "1411201" "1207188" "1411199" "1207186" "1207181" "."       0
20160000264 "GH 6"             "1617192" "1617229" "1617186" "1617182" "1617181" "1411253" "1514086" "."       0
20160000270 "BlockB"           "1516116" "."       "."       "."       "."       "."       "."       "."       0
20160000276 "Taharpur Village" "1105208" "1106190" "1104301" "1105233" "1106218" "1104310" "1104275" "1104284" 1
20160000283 "Block F"          "1411253" "1412135" "1412141" "1412151" "1412156" "1413205" "1413215" "."       0
20160000288 "Palam Village"    "1821141" "1821232" "1821168" "1821145" "1821151" "1821152" "1821143" "1821173" 1
end

Thank you co much!

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35520
#12

06 Aug 2016, 05:35

I am not convinced in your own best interests that fitting a model with so many predictors is a good idea. Even with plain regression observations/parameters should be much more than about 10; with logit-type models you should want many, many more.

It's a personal rule not to give code willingly for what looks like a bad idea.

But

1. What's clear with so many schools is that the inlist() trick is ruled out.

2. The basic coding principle is just like the code in #2. You clearly don't want to type all the values which is trivial with A B C D E but you can try automating their collection with levelsof.

Nevertheless there are many people here who do this kind of analysis routinely and they should have better and more detailed advice, but I think you would need to start a new thread to get that.

Last edited by Nick Cox; 06 Aug 2016, 05:37.
Comment
Vijay Kumar

Join Date: Jul 2016

Posts: 24
#13

06 Aug 2016, 05:45

Thanks a lot Nick. Really appreciate your time and advice. I will rethink my strategy and start a new thread, if required.
Comment

Announcement