Split issue

Euslaner

Join Date: Apr 2014

Posts: 189
#1

Split issue

21 Sep 2019, 03:26

I have a dataset with data by Congressional district:

. dataex congdist in 1/5

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str5 congdist "AK-0" "AL-1" "AL-2" "AL-3" "AL-4" end

I need to split the district state and district numbers (for merging with another data set) so that the hyphen is dropped and I get, e.g.,

AK 0
AL:1,

etc. Tried split, didn't work. Any help would be appreciated.

Thanks,

Ric
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35724
#2

21 Sep 2019, 03:51

Did you specify p(-) in your split command?

The FAQ Advice spells out that you should give the exact syntax you used and not limit error reports to wording such as “didn’t work”.

If you want strings such as AK 0 or AK:1 then the function ^{subinstr() should be used.}
Comment
Euslaner

Join Date: Apr 2014

Posts: 189
#3

21 Sep 2019, 06:10

Here is what I tried to do with split:

. split congdist1,ignore("-") destring
variable born as string:
congdist11
congdist11: contains characters not specified in ignore(); no replace

I read the manual on subinstr but could not figure out what to do.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#4

21 Sep 2019, 07:21

in #1, I don't understand why the 2 lines you show as desired are different:

I need to split the district state and district numbers (for merging with another data set) so that the hyphen is dropped and I get, e.g.,

AK 0
AL:1

that is, why is there a space between the "K" and the "0" in the first line but a colon between the "L" and the "1" in the second line?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

21 Sep 2019, 08:35

As in #2 the option you need is parse(-) not ignore(-) (assuming that is that you two variables out of one.

If you want to replace (e.g.) dashes with colons

Code:

replace congdist1 = subinstr(congdist1, “-“, “:”,.)
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#6

21 Sep 2019, 08:39

Youre looking for parse, not the ignore option.

Code:

split congdist, parse("-") destring
Comment
Euslaner

Join Date: Apr 2014

Posts: 189
#7

21 Sep 2019, 13:25

Thanks.
Comment
Euslaner

Join Date: Apr 2014

Posts: 189
#8

23 Sep 2019, 10:49

Sorry for another post. I now have data on states and CDs on a a single variable. But they are organized in a way that I cannot use. Here is what I have:

[CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input str5 congdist1 str22 statecdagg
"CA 1" "California 1"
"CA 10" "California 10"
"CA 11" "California 11"
"CA 12" "California 12"
"CA 13" "California 13"
"CA 14" "California 14"
"CA 15" "California 15"
"CA 16" "California 16"
"CA 17" "California 17"
"CA 18" "California 18"
"CA 19" "California 19"
"CA 2" "California 2"
"CA 20" "California 20"
"CA 21" "California 21"
"CA 22" "California 22"
"CA 23" "California 23"
"CA 24" "California 24"
"CA 25" "California 25"
"CA 26" "California 26"
"CA 27" "California 27"
"CA 28" "California 28"
"CA 29" "California 29"
"CA 3" "California 3"
"CA 30" "California 30"
"CA 31" "California 31"
"CA 32" "California 32"
"CA 33" "California 33"
"CA 34" "California 34"
"CA 35" "California 35"
"CA 36" "California 36"
"CA 37" "California 37"
"CA 38" "California 38"
"CA 39" "California 39"
"CA 4" "California 4"
"CA 40" "California 40"
"CA 41" "California 41"
"CA 42" "California 42"
"CA 43" "California 43"
"CA 44" "California 44"
en
The problem is that the 2nd district of California (same problem for NY, NJ Ohio, PA, and Washington comes after the 19th, the 3rd after the 29th, the 4th after the 39th. Changing 2 to 02 doesn't help. Is there a way to get the data organized in the proper numerical order maintaining the state name in a single variable. I tried concatenate and split and could not get either to work.

Thanks.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#9

23 Sep 2019, 11:43

The principle here is that if you want something to sort in numerical order, it has to be a number, not a number that is stored as a string. (Strings sort character by character; numbers sort by the entire numeric value.) And, splitting things into separate variables can facilitate sorting.

Code:

// working from your most recent data example split congdist1, generate(temp) parse(" ") gen str stateabbrev = temp1 gen distnum = real(temp2) // -destring- also would work drop temp1 temp2 sort stateabbrev distnum

Last edited by Mike Lacy; 23 Sep 2019, 11:45.
Comment
Euslaner

Join Date: Apr 2014

Posts: 189
#10

23 Sep 2019, 12:02

Perfect, thanks Mike.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

#11

23 Sep 2019, 12:03

@Mike Lacy's solution is broadly what I would favour here.

As a footnote, I address the assertion that "Changing 2 to 02 doesn't help". As commented in #2 it's hard to address such comments in the absence of code that you tried or an explanation of what went wrong.

I did this

Code:

clear
input str5 congdist1 str22 statecdagg
"CA 1" "California 1" 
"CA 10" "California 10"
"CA 11" "California 11"
"CA 12" "California 12"
"CA 13" "California 13"
"CA 14" "California 14"
"CA 15" "California 15"
"CA 16" "California 16"
"CA 17" "California 17"
"CA 18" "California 18"
"CA 19" "California 19"
"CA 2" "California 2" 
end 

foreach v in congdist1 statecdagg { 
    replace `v' = word(`v',1) + " 0" + word(`v', 2) if length(word(`v', 2)) == 1 
}

sort congdist1 

list, sep(0) 

     +--------------------------+
     | congdi~1      statecdagg |
     |--------------------------|
  1. |    CA 01   California 01 |
  2. |    CA 02   California 02 |
  3. |    CA 10   California 10 |
  4. |    CA 11   California 11 |
  5. |    CA 12   California 12 |
  6. |    CA 13   California 13 |
  7. |    CA 14   California 14 |
  8. |    CA 15   California 15 |
  9. |    CA 16   California 16 |
 10. |    CA 17   California 17 |
 11. |    CA 18   California 18 |
 12. |    CA 19   California 19 |
     +--------------------------+

showing that inserting zeros before single digits 1 to 9 achieves the goal of sorting correctly. The codes 01 to 09 are surely awkward if not ugly otherwise, but that's not the small point.

Comment

Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#12

24 Sep 2019, 00:53

A space would also be sorted before any numbers. Slightly easier for a human to read, but more prone to future errors when referring to specific districts.

Code:

foreach v in congdist1 statecdagg { replace `v' = word(`v',1) + " " + word(`v', 2) if length(word(`v', 2)) == 1 }
1 like
Comment

Euslaner

Join Date: Apr 2014
Posts: 189

#13

24 Sep 2019, 06:21

Sorry when I try to merge two data sets the problem remains.

For the master data set, cces2018aggnew., I have the following:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str22 statecdnew str2 stateabbrevnew str22 statecdagg str2 stateabbrev str5(congdist congdist1) int congdist2 float(black agnostic) "Alabama 1" "AK" "Alaska 1" "AK" "AK-0" "AK 0" 1 .22522523 .06306306 "Alabama 2" "AL" "Alabama 1" "AL" "AL-1" "AL 1" 2 .24561404 .02631579 "Alabama 3" "AL" "Alabama 2" "AL" "AL-2" "AL 2" 3 .0879121 .05555556 "Alabama 4" "AL" "Alabama 3" "AL" "AL-3" "AL 3" 4 .03030303 .04081633 "Alabama 5" "AL" "Alabama 4" "AL" "AL-4" "AL 4" 5 .11538462 .06153846 "Alabama 6" "AL" "Alabama 5" "AL" "AL-5" "AL 5" 6 .104 .064 "Alabama 7" "AL" "Alabama 6" "AL" "AL-6" "AL 6" 7 .55172414 .06034483 "Alaska 1" "AL" "Alabama 7" "AL" "AL-7" "AL 7" 8 .02247191 .2022472 "Arizona 1" "AR" "Arkansas 1" "AR" "AR-1" "AR 1" 9 .01904762 .07619048 "Arizona 2" "AR" "Arkansas 2" "AR" "AR-2" "AR 2" 10 .046875 .11518325 "Arizona 3" "AR" "Arkansas 3" "AR" "AR-3" "AR 3" 11 .032608695 .10869565 "Arizona 4" "AR" "Arkansas 4" "AR" "AR-4" "AR 4" 12 .016574586 .07734807 "Arizona 5" "AZ" "Arizona 1" "AZ" "AZ-1" "AZ 1" 13 .018404908 .08588957 "Arizona 6" "AZ" "Arizona 2" "AZ" "AZ-2" "AZ 2" 14 .01734104 .0867052 "Arizona 7" "AZ" "Arizona 3" "AZ" "AZ-3" "AZ 3" 15 .12698413 .0952381 "Arizona 8" "AZ" "Arizona 4" "AZ" "AZ-4" "AZ 4" 16 .04040404 .05076142 "Arizona 9" "AZ" "Arizona 5" "AZ" "AZ-5" "AZ 5" 17 .02484472 .0931677 "Arkansas 1" "AZ" "Arizona 6" "AZ" "AZ-6" "AZ 6" 18 .08333334 .03125 "Arkansas 2" "AZ" "Arizona 7" "AZ" "AZ-7" "AZ 7" 19 .16528925 .07438017 "Arkansas 3" "AZ" "Arizona 8" "AZ" "AZ-8" "AZ 8" 20 .03846154 .05384615 "Arkansas 4" "AZ" "Arizona 9" "AZ" "AZ-9" "AZ 9" 21 .1868132 .04395605 "California 1" "CA" "California 1" "CA" "CA-1" "CA 1" 22 0 .0990991 "California 10" "CA" "California 2" "CA" "CA-2" "CA 2" 33 .065789476 .04 "California 11" "CA" "California 3" "CA" "CA-3" "CA 3" 44 .04901961 .11764706 "California 12" "CA" "California 4" "CA" "CA-4" "CA 4" 55 .06557377 .13114753 end label values congdist2 congdist2 label def congdist2 1 "AK-0", modify label def congdist2 2 "AL-1", modify label def congdist2 3 "AL-2", modify label def congdist2 4 "AL-3", modify label def congdist2 5 "AL-4", modify label def congdist2 6 "AL-5", modify label def congdist2 7 "AL-6", modify label def congdist2 8 "AL-7", modify label def congdist2 9 "AR-1", modify label def congdist2 10 "AR-2", modify label def congdist2 11 "AR-3", modify label def congdist2 12 "AR-4", modify label def congdist2 13 "AZ-1", modify label def congdist2 14 "AZ-2", modify label def congdist2 15 "AZ-3", modify label def congdist2 16 "AZ-4", modify label def congdist2 17 "AZ-5", modify label def congdist2 18 "AZ-6", modify label def congdist2 19 "AZ-7", modify label def congdist2 20 "AZ-8", modify label def congdist2 21 "AZ-9", modify label def congdist2 22 "CA-1", modify label def congdist2 33 "CA-2", modify label def congdist2 44 "CA-3", modify label def congdist2 55 "CA-4", modify

I included the shares of blacks and agnostics to ensure that the ordering makes sense (Alaska has more agnostics, Alabama more blacks).

So the variable statecdnew is correct for Alabama and Alaska (Alabama comes first) but not for California (CD 10 comes after 1, 2 comes later--not shown).
cdstateagg is correct for California, but not for Alaska/Alabama (and statecdnew is also wrong for other states with more than 10 Congressional districts, 1 always followed by 10 rather than by 2). I need to get this right--to have Alabama first and then to get California right not just for this data set but for another I expect to have to merge with it that is to come.

Any help would be appreciated. Sorry for the trouble and thanks.

Comment

Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#14

24 Sep 2019, 07:54

In order to merge you'll have to make sure the district identifier variable or variables are exactly he same across all datasets.

So if you applied the code suggested by Mike in post #9 on one of your datasets, you'll need o apply it on all datasets included in the merging.
Of if you did what Nick suggested in post #11, apply it to both/all datasets.

An as an extra word of warning: don't be tempted to merge on your congdist2 variable. Encoded variables may have different value labels attached to different numeric values.
1 like
Comment
Euslaner

Join Date: Apr 2014

Posts: 189
#15

24 Sep 2019, 09:50

The problem as I stated is that there are not identical variables in the two data sets so I need to create them. I used the code Mike suggested but it did not resolve the problem. I am asking if there is any way to align the codes for the state and CD numbers.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment