Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keep only first part of string up until any number, comma, or before the state name

    Hi All,

    I am trying to separate an area name in a dataset so I just have the particular city and not any station/coordinate info. I have used split before, but usually use it to split before a delimiter.

    Here is a example of my data:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float monthly str51 name double prcp float(tavg latitude elevation longitude)
    645 "ABERDEEN 0.7 WSW, WA US"              2.6900000497698784        . 46.97401    6.1 -123.82332
    645 "ABERDEEN 0.9 S, WA US"                  2.90999998152256        .  46.9639    7.9 -123.80592
    645 "ABERDEEN 1.6 WSW, WA US"               2.890000009909272        . 46.97211    4.3 -123.84068
    645 "ABERDEEN, WA US"                       2.430000051856041        .  46.9658      3  -123.8291
    645 "ABERNATHY MOUNTAIN WASHINGTON, WA US"                  0 48.03226    46.35  883.9     -123.1
    645 "ACME 0.0 NNE, WA US"                   1.569999998435378        .  48.7175   92.4 -122.20486
    645 "ALDER RIDGE WASHINGTON, WA US"                         0 44.77419  46.2733 1371.6   -117.495
    645 "ALMIRA 0.2 NNW, WA US"                .07999999821186066        . 47.71303  583.4 -118.94116
    645 "ALPINE MEADOWS, WA US"                 5.299999974668026 45.45161    47.78 1066.8     -121.7
    645 "ANACORTES 0.8 ESE, WA US"               .630000002682209        . 48.48903   65.5 -122.61165
    645 "ANACORTES 0.9 ESE, WA US"             1.5299999993294477        . 48.48629   85.3  -122.6015
    645 "ANACORTES 1.3 NW, WA US"              1.1899999976158142        .  48.5046   61.9  -122.6503
    645 "ANACORTES 1.7 WNW, WA US"              .9499999769032001        .  48.5017   61.9  -122.6635
    645 "ANACORTES 3.0 N, WA US"               1.2300000041723251        .  48.5355   34.1  -122.6265
    645 "ANACORTES 4.7 SSW, WA US"              .9700000081211329        .  48.4268  101.5  -122.6576
    end
    format %tm monthly
    I want to keep just the city/CDP name so I can match to fips data I have. Here, I need anything before the first number, or before the state name.

    the code I want to use is:

    split name, parse(" ") limit(1) gen(place_name)


    But I do not know what to put after the space/delimiter to account for three potential issues.
    Thank you for your help!

    Best,
    Josh

  • #2
    Here is a start:

    Code:
    gen where = length(name) 
    
    replace where = min(where, strpos(name, ",") - 1) if strpos(name, ".") 
    
    forval i = 0/9 { 
          replace where = min(where, strpos(name, "`i'") - 1)  if strpos(name, "`i'") 
    } 
    
    gen wanted = substr(name, 1, where)

    Comment


    • #3
      A regex approach:

      Code:
      gen wanted = ustrregexra(name,"^(.*?)(,| [0-9]).*$","$1")
      This will deal with commas and numbers.

      If you also want to deal with state names (i.e. keep only ABERNATHY MOUNTAIN from ABERNATHY MOUNTAIN WASHINGTON), you will first need to create and load a dataset of state names, after which you can use the following code:

      Code:
      use statenames, clear
      replace statenames = upper(statenames)
      levelsof statename, p(|) local(list) clean
      use your_dataset, clear
      gen wanted = ustrregexra(upper(name),"^(.*?) ?(,|[0-9]|`list').*$","$1")
      Note: sent from mobile device -- watch out for typos.
      Last edited by Ali Atia; 01 Aug 2022, 16:19.

      Comment

      Working...
      X