Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating dummy variables!

    Hi all! I am working on a student school admission data set which has admission status of the child, the order of preference of the schools (applied to), and a bunch of SES variables for each child. I want to create a dummy variable for each school. To be clear, this is how my data looks like-
    A,B,C,D,E are school names;a,b,c,d,e are student id's; P1-P5 are the school preferences.
    id P1 P2 P3 P4 P5
    a A B C
    b C B A D
    c A B C D E
    d D E
    e C A E
    I now want to create dummy variables for A, B,C,D,E and make my data set look like this
    id P1 P2 P3 P4 P5 A B C D E
    a A B C 1 1 1 0 0
    b C B A D 1 1 1 1 0
    c A B C D E 1 1 1 1 1
    d D E 0 0 0 1 1
    e C A E 1 0 1 0 1
    Obviously tab P1, generate (s) doesn't word here.
    Thank you!

  • #2
    Using dataex (SSC) is preferred please (FAQ Advice #12).

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str1(id p1 p2 p3 p4 p5)
    "a" "A" "B" "C" ""  "" 
    "b" "C" "B" "A" "D" "" 
    "c" "A" "B" "C" "D" "E"
    "d" "D" "E" ""  ""  "" 
    "e" "C" "A" "E" ""  "" 
    end
    
    foreach v in A B C D E {
         gen `v' = 0
         quietly forval j = 1/5 {
               replace `v' = 1 if p`j' == "`v'"
         }
    }
    
    
     list 
    
         +-------------------------------------------------+
         | id   p1   p2   p3   p4   p5   A   B   C   D   E |
         |-------------------------------------------------|
      1. |  a    A    B    C             1   1   1   0   0 |
      2. |  b    C    B    A    D        1   1   1   1   0 |
      3. |  c    A    B    C    D    E   1   1   1   1   1 |
      4. |  d    D    E                  0   0   0   1   1 |
      5. |  e    C    A    E             1   0   1   0   1 |
         +-------------------------------------------------+

    Comment


    • #3
      Dear Vijay,

      This should do the trick:
      Code:
      foreach x in A B C D E {
          gen `x' = 0
          replace `x' = 1 if inlist("`x'", p1, p2, p3, p4, p5)
      }

      Comment


      • #4
        Mathias' trick is better. See also (e.g.) http://www.stata-journal.com/sjpdf.h...iclenum=dm0058 for more discussion.

        In fact, this will work too:

        Code:
        foreach x in A B C D E {
            gen `x' = inlist("`x'", p1, p2, p3, p4, p5)
        }

        Comment


        • #5
          Originally posted by Nick Cox View Post
          See also (e.g.) http://www.stata-journal.com/sjpdf.h...iclenum=dm0058 for more discussion.
          That is indeed a great column, Nick. I will make sure to distribute it next time I teach Stata.

          Comment


          • #6
            Dear Nick and Mathias.. Thank you so much for the code. I am however not able to use it as I have a large number of schools (503) and preferences from P1-P178. So, how do I get all my school id/names in the first line of the code?

            foreach x in A B C D E.

            I am hoping this will work for the second line
            gen `x' = inlist("`x'", p1-p178) Thanks again!

            Comment


            • #7
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input double id str7(p1 p2 p3 p4 p5 p6 p7)
              20160000015 "1412148" "1413329" "1413240" "1413245" "1413215" "1411253" "1411219"
              20160000041 "1821202" "1821210" "1821232" "."       "."       "."       "."      
              20160000053 "1617224" "1617166" "1617186" "1617204" "1617220" "1411253" "1617192"
              20160000058 "1821202" "1821146" "1821210" "1618189" "."       "."       "."      
              20160000088 "1618249" "1821143" "1618232" "1514087" "1720147" "1514073" "1617192"
              20160000090 "1720158" "1720136" "1514085" "1514086" "1515105" "1516110" "1516113"
              20160000097 "1003262" "1002318" "1002294" "1001197" "1002306" "1003217" "1003225"
              20160000106 "1617204" "1617186" "1617192" "1617181" "1617229" "1618253" "1515114"
              20160000107 "1617192" "1617204" "."       "."       "."       "."       "."      
              20160000115 "1412257" "1412152" "1413330" "1413225" "1413227" "1413329" "1412148"
              20160000122 "1617229" "1411253" "1617182" "1617186" "1413283" "1413183" "1413215"
              20160000149 "1411219" "."       "."       "."       "."       "."       "."      
              20160000150 "1923255" "1720142" "1720145" "."       "."       "."       "."      
              20160000157 "1413329" "1411253" "1617229" "."       "."       "."       "."      
              20160000228 "1411183" "1411214" "1411201" "1207188" "1411199" "1207186" "1207181"
              20160000264 "1617192" "1617229" "1617186" "1617182" "1617181" "1411253" "1514086"
              20160000270 "1516116" "."       "."       "."       "."       "."       "."      
              20160000276 "1105208" "1106190" "1104301" "1105233" "1106218" "1104310" "1104275"
              20160000283 "1411253" "1412135" "1412141" "1412151" "1412156" "1413205" "1413215"
              20160000288 "1821141" "1821232" "1821168" "1821145" "1821151" "1821152" "1821143"
              end
              This is how my data looks like- though I have 178 preference variables from p1-p178. Thanks in advance for your help.
              Last edited by Vijay Kumar; 05 Aug 2016, 21:20.

              Comment


              • #8
                If you change the question radically the answer may change! You want how many dummies? Is it one for every school mentioned?

                Comment


                • #9
                  Hi Nick. I want to create a dummy for each of my 503 schools. The dummy should take a value 1 if the school is in the preference list of the student and 0 otherwise. Thank you.

                  Comment


                  • #10
                    Are you sure you know how you want to use those variables? You can do it but it may not be a good idea. Do you have the school identifiers anywhere in a single variable or do they appear only in the preference variables?

                    Comment


                    • #11
                      Hi Nick.. I want to run a linear probability/ logit model to estimate the predicted probabilities of admission of each of the 15761 children in my data with dummies for neighborhoods ( 1283 neighborhoods) and schools (503) as the independent variables. I do have 503 school identifiers in the variable schoolid which takes the value of the school identifier if the child got admitted and "." if not admitted. My data with 7 preferences, schoolid, neighborhood variables looks like this.
                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input double id str50 N1 str7(p1 p2 p3 p4 p5 p6 p7 schoolid) float treatment
                      20160000015 "Rohini Extension" "1412148" "1413329" "1413240" "1413245" "1413215" "1411253" "1411219" "1412148" 1
                      20160000041 "Kakraula Village" "1821202" "1821210" "1821232" "."       "."       "."       "."       "."       0
                      20160000053 "JJ Colony 1"      "1617224" "1617166" "1617186" "1617204" "1617220" "1411253" "1617192" "."       0
                      20160000058 "Sadh Nagar I"     "1821202" "1821146" "1821210" "1618189" "."       "."       "."       "."       0
                      20160000088 "Uttam Nagar East" "1618249" "1821143" "1618232" "1514087" "1720147" "1514073" "1617192" "."       0
                      20160000090 "Budh Nagar"       "1720158" "1720136" "1514085" "1514086" "1515105" "1516110" "1516113" "1720158" 1
                      20160000097 "BlockWb"          "1003262" "1002318" "1002294" "1001197" "1002306" "1003217" "1003225" "1105191" 1
                      20160000106 "Nihal Vihar"      "1617204" "1617186" "1617192" "1617181" "1617229" "1618253" "1515114" "1617204" 1
                      20160000107 "GH 6"             "1617192" "1617204" "."       "."       "."       "."       "."       "."       0
                      20160000115 "Karala"           "1412257" "1412152" "1413330" "1413225" "1413227" "1413329" "1412148" "1413330" 1
                      20160000122 "Block O"          "1617229" "1411253" "1617182" "1617186" "1413283" "1413183" "1413215" "."       0
                      20160000149 "Pitampura"        "1411219" "."       "."       "."       "."       "."       "."       "1411219" 1
                      20160000150 "Block A"          "1923255" "1720142" "1720145" "."       "."       "."       "."       "."       0
                      20160000157 "Sector 20"        "1413329" "1411253" "1617229" "."       "."       "."       "."       "."       0
                      20160000228 "Block A"          "1411183" "1411214" "1411201" "1207188" "1411199" "1207186" "1207181" "."       0
                      20160000264 "GH 6"             "1617192" "1617229" "1617186" "1617182" "1617181" "1411253" "1514086" "."       0
                      20160000270 "BlockB"           "1516116" "."       "."       "."       "."       "."       "."       "."       0
                      20160000276 "Taharpur Village" "1105208" "1106190" "1104301" "1105233" "1106218" "1104310" "1104275" "1104284" 1
                      20160000283 "Block F"          "1411253" "1412135" "1412141" "1412151" "1412156" "1413205" "1413215" "."       0
                      20160000288 "Palam Village"    "1821141" "1821232" "1821168" "1821145" "1821151" "1821152" "1821143" "1821173" 1
                      end
                      Thank you co much!

                      Comment


                      • #12
                        I am not convinced in your own best interests that fitting a model with so many predictors is a good idea. Even with plain regression observations/parameters should be much more than about 10; with logit-type models you should want many, many more.

                        It's a personal rule not to give code willingly for what looks like a bad idea.

                        But

                        1. What's clear with so many schools is that the inlist() trick is ruled out.

                        2. The basic coding principle is just like the code in #2. You clearly don't want to type all the values which is trivial with A B C D E but you can try automating their collection with levelsof.

                        Nevertheless there are many people here who do this kind of analysis routinely and they should have better and more detailed advice, but I think you would need to start a new thread to get that.
                        Last edited by Nick Cox; 06 Aug 2016, 05:37.

                        Comment


                        • #13
                          Thanks a lot Nick. Really appreciate your time and advice. I will rethink my strategy and start a new thread, if required.

                          Comment

                          Working...
                          X