Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Split issue

    I have a dataset with data by Congressional district:

    . dataex congdist in 1/5


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str5 congdist
    "AK-0"
    "AL-1"
    "AL-2"
    "AL-3"
    "AL-4"
    end


    I need to split the district state and district numbers (for merging with another data set) so that the hyphen is dropped and I get, e.g.,

    AK 0
    AL:1,

    etc. Tried split, didn't work. Any help would be appreciated.

    Thanks,

    Ric


  • #2
    Did you specify p(-) in your split command?

    The FAQ Advice spells out that you should give the exact syntax you used and not limit error reports to wording such as “didn’t work”.


    If you want strings such as AK 0 or AK:1 then the function subinstr() should be used.

    Comment


    • #3
      Here is what I tried to do with split:

      . split congdist1,ignore("-") destring
      variable born as string:
      congdist11
      congdist11: contains characters not specified in ignore(); no replace

      I read the manual on subinstr but could not figure out what to do.

      Comment


      • #4
        in #1, I don't understand why the 2 lines you show as desired are different:

        I need to split the district state and district numbers (for merging with another data set) so that the hyphen is dropped and I get, e.g.,

        AK 0
        AL:1
        that is, why is there a space between the "K" and the "0" in the first line but a colon between the "L" and the "1" in the second line?

        Comment


        • #5
          As in #2 the option you need is parse(-) not ignore(-) (assuming that is that you two variables out of one.

          If you want to replace (e.g.) dashes with colons

          Code:
          replace congdist1 = subinstr(congdist1, “-“, “:”,.)

          Comment


          • #6
            Youre looking for parse, not the ignore option.
            Code:
            split congdist, parse("-") destring

            Comment


            • #7
              Thanks.

              Comment


              • #8
                Sorry for another post. I now have data on states and CDs on a a single variable. But they are organized in a way that I cannot use. Here is what I have:

                [CODE]
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str5 congdist1 str22 statecdagg
                "CA 1" "California 1"
                "CA 10" "California 10"
                "CA 11" "California 11"
                "CA 12" "California 12"
                "CA 13" "California 13"
                "CA 14" "California 14"
                "CA 15" "California 15"
                "CA 16" "California 16"
                "CA 17" "California 17"
                "CA 18" "California 18"
                "CA 19" "California 19"
                "CA 2" "California 2"
                "CA 20" "California 20"
                "CA 21" "California 21"
                "CA 22" "California 22"
                "CA 23" "California 23"
                "CA 24" "California 24"
                "CA 25" "California 25"
                "CA 26" "California 26"
                "CA 27" "California 27"
                "CA 28" "California 28"
                "CA 29" "California 29"
                "CA 3" "California 3"
                "CA 30" "California 30"
                "CA 31" "California 31"
                "CA 32" "California 32"
                "CA 33" "California 33"
                "CA 34" "California 34"
                "CA 35" "California 35"
                "CA 36" "California 36"
                "CA 37" "California 37"
                "CA 38" "California 38"
                "CA 39" "California 39"
                "CA 4" "California 4"
                "CA 40" "California 40"
                "CA 41" "California 41"
                "CA 42" "California 42"
                "CA 43" "California 43"
                "CA 44" "California 44"
                en
                The problem is that the 2nd district of California (same problem for NY, NJ Ohio, PA, and Washington comes after the 19th, the 3rd after the 29th, the 4th after the 39th. Changing 2 to 02 doesn't help. Is there a way to get the data organized in the proper numerical order maintaining the state name in a single variable. I tried concatenate and split and could not get either to work.

                Thanks.

                Comment


                • #9
                  The principle here is that if you want something to sort in numerical order, it has to be a number, not a number that is stored as a string. (Strings sort character by character; numbers sort by the entire numeric value.) And, splitting things into separate variables can facilitate sorting.

                  Code:
                  // working from your most recent data example
                  split congdist1, generate(temp) parse(" ")
                  gen str stateabbrev = temp1
                  gen distnum = real(temp2) // -destring- also would work
                  drop temp1 temp2
                  sort stateabbrev distnum


                  Last edited by Mike Lacy; 23 Sep 2019, 11:45.

                  Comment


                  • #10
                    Perfect, thanks Mike.

                    Comment


                    • #11
                      @Mike Lacy's solution is broadly what I would favour here.

                      As a footnote, I address the assertion that "Changing 2 to 02 doesn't help". As commented in #2 it's hard to address such comments in the absence of code that you tried or an explanation of what went wrong.

                      I did this

                      Code:
                      clear
                      input str5 congdist1 str22 statecdagg
                      "CA 1" "California 1" 
                      "CA 10" "California 10"
                      "CA 11" "California 11"
                      "CA 12" "California 12"
                      "CA 13" "California 13"
                      "CA 14" "California 14"
                      "CA 15" "California 15"
                      "CA 16" "California 16"
                      "CA 17" "California 17"
                      "CA 18" "California 18"
                      "CA 19" "California 19"
                      "CA 2" "California 2" 
                      end 
                      
                      foreach v in congdist1 statecdagg { 
                          replace `v' = word(`v',1) + " 0" + word(`v', 2) if length(word(`v', 2)) == 1 
                      }
                      
                      sort congdist1 
                      
                      list, sep(0) 
                      
                           +--------------------------+
                           | congdi~1      statecdagg |
                           |--------------------------|
                        1. |    CA 01   California 01 |
                        2. |    CA 02   California 02 |
                        3. |    CA 10   California 10 |
                        4. |    CA 11   California 11 |
                        5. |    CA 12   California 12 |
                        6. |    CA 13   California 13 |
                        7. |    CA 14   California 14 |
                        8. |    CA 15   California 15 |
                        9. |    CA 16   California 16 |
                       10. |    CA 17   California 17 |
                       11. |    CA 18   California 18 |
                       12. |    CA 19   California 19 |
                           +--------------------------+
                      showing that inserting zeros before single digits 1 to 9 achieves the goal of sorting correctly. The codes 01 to 09 are surely awkward if not ugly otherwise, but that's not the small point.

                      Comment


                      • #12
                        A space would also be sorted before any numbers. Slightly easier for a human to read, but more prone to future errors when referring to specific districts.
                        Code:
                        foreach v in congdist1 statecdagg { 
                            replace `v' = word(`v',1) + "  " + word(`v', 2) if length(word(`v', 2)) == 1 
                        }

                        Comment


                        • #13
                          Sorry when I try to merge two data sets the problem remains.

                          For the master data set, cces2018aggnew., I have the following:

                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input str22 statecdnew str2 stateabbrevnew str22 statecdagg str2 stateabbrev str5(congdist congdist1) int congdist2 float(black agnostic)
                          "Alabama 1"     "AK" "Alaska 1"     "AK" "AK-0" "AK 0"  1  .22522523 .06306306
                          "Alabama 2"     "AL" "Alabama 1"    "AL" "AL-1" "AL 1"  2  .24561404 .02631579
                          "Alabama 3"     "AL" "Alabama 2"    "AL" "AL-2" "AL 2"  3   .0879121 .05555556
                          "Alabama 4"     "AL" "Alabama 3"    "AL" "AL-3" "AL 3"  4  .03030303 .04081633
                          "Alabama 5"     "AL" "Alabama 4"    "AL" "AL-4" "AL 4"  5  .11538462 .06153846
                          "Alabama 6"     "AL" "Alabama 5"    "AL" "AL-5" "AL 5"  6       .104      .064
                          "Alabama 7"     "AL" "Alabama 6"    "AL" "AL-6" "AL 6"  7  .55172414 .06034483
                          "Alaska 1"      "AL" "Alabama 7"    "AL" "AL-7" "AL 7"  8  .02247191  .2022472
                          "Arizona 1"     "AR" "Arkansas 1"   "AR" "AR-1" "AR 1"  9  .01904762 .07619048
                          "Arizona 2"     "AR" "Arkansas 2"   "AR" "AR-2" "AR 2" 10    .046875 .11518325
                          "Arizona 3"     "AR" "Arkansas 3"   "AR" "AR-3" "AR 3" 11 .032608695 .10869565
                          "Arizona 4"     "AR" "Arkansas 4"   "AR" "AR-4" "AR 4" 12 .016574586 .07734807
                          "Arizona 5"     "AZ" "Arizona 1"    "AZ" "AZ-1" "AZ 1" 13 .018404908 .08588957
                          "Arizona 6"     "AZ" "Arizona 2"    "AZ" "AZ-2" "AZ 2" 14  .01734104  .0867052
                          "Arizona 7"     "AZ" "Arizona 3"    "AZ" "AZ-3" "AZ 3" 15  .12698413  .0952381
                          "Arizona 8"     "AZ" "Arizona 4"    "AZ" "AZ-4" "AZ 4" 16  .04040404 .05076142
                          "Arizona 9"     "AZ" "Arizona 5"    "AZ" "AZ-5" "AZ 5" 17  .02484472  .0931677
                          "Arkansas 1"    "AZ" "Arizona 6"    "AZ" "AZ-6" "AZ 6" 18  .08333334    .03125
                          "Arkansas 2"    "AZ" "Arizona 7"    "AZ" "AZ-7" "AZ 7" 19  .16528925 .07438017
                          "Arkansas 3"    "AZ" "Arizona 8"    "AZ" "AZ-8" "AZ 8" 20  .03846154 .05384615
                          "Arkansas 4"    "AZ" "Arizona 9"    "AZ" "AZ-9" "AZ 9" 21   .1868132 .04395605
                          "California 1"  "CA" "California 1" "CA" "CA-1" "CA 1" 22          0  .0990991
                          "California 10" "CA" "California 2" "CA" "CA-2" "CA 2" 33 .065789476       .04
                          "California 11" "CA" "California 3" "CA" "CA-3" "CA 3" 44  .04901961 .11764706
                          "California 12" "CA" "California 4" "CA" "CA-4" "CA 4" 55  .06557377 .13114753
                          end
                          label values congdist2 congdist2
                          label def congdist2 1 "AK-0", modify
                          label def congdist2 2 "AL-1", modify
                          label def congdist2 3 "AL-2", modify
                          label def congdist2 4 "AL-3", modify
                          label def congdist2 5 "AL-4", modify
                          label def congdist2 6 "AL-5", modify
                          label def congdist2 7 "AL-6", modify
                          label def congdist2 8 "AL-7", modify
                          label def congdist2 9 "AR-1", modify
                          label def congdist2 10 "AR-2", modify
                          label def congdist2 11 "AR-3", modify
                          label def congdist2 12 "AR-4", modify
                          label def congdist2 13 "AZ-1", modify
                          label def congdist2 14 "AZ-2", modify
                          label def congdist2 15 "AZ-3", modify
                          label def congdist2 16 "AZ-4", modify
                          label def congdist2 17 "AZ-5", modify
                          label def congdist2 18 "AZ-6", modify
                          label def congdist2 19 "AZ-7", modify
                          label def congdist2 20 "AZ-8", modify
                          label def congdist2 21 "AZ-9", modify
                          label def congdist2 22 "CA-1", modify
                          label def congdist2 33 "CA-2", modify
                          label def congdist2 44 "CA-3", modify
                          label def congdist2 55 "CA-4", modify
                          I included the shares of blacks and agnostics to ensure that the ordering makes sense (Alaska has more agnostics, Alabama more blacks).

                          So the variable statecdnew is correct for Alabama and Alaska (Alabama comes first) but not for California (CD 10 comes after 1, 2 comes later--not shown).
                          cdstateagg is correct for California, but not for Alaska/Alabama (and statecdnew is also wrong for other states with more than 10 Congressional districts, 1 always followed by 10 rather than by 2). I need to get this right--to have Alabama first and then to get California right not just for this data set but for another I expect to have to merge with it that is to come.

                          Any help would be appreciated. Sorry for the trouble and thanks.


                          Comment


                          • #14
                            In order to merge you'll have to make sure the district identifier variable or variables are exactly he same across all datasets.

                            So if you applied the code suggested by Mike in post #9 on one of your datasets, you'll need o apply it on all datasets included in the merging.
                            Of if you did what Nick suggested in post #11, apply it to both/all datasets.

                            An as an extra word of warning: don't be tempted to merge on your congdist2 variable. Encoded variables may have different value labels attached to different numeric values.

                            Comment


                            • #15
                              The problem as I stated is that there are not identical variables in the two data sets so I need to create them. I used the code Mike suggested but it did not resolve the problem. I am asking if there is any way to align the codes for the state and CD numbers.

                              Comment

                              Working...
                              X