Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identify strings with multiple inclusion and exclusion criteria - strmatch

    Hello all,

    over the last month this forum helped me a lot with my projects, but now I have a problem for which I´m not able to find the answer.

    My dataframe (about 10,000,000 observations) has the variables "ops_ko1" to "ops_ko101", "dat_ops1" to "dat_ops101" and "time_ops1" to "time_ops101". Any of these variables contains a string of numbers and letters, for example "5529s1", "8146y" or "88422q".

    I want to identify, if the variables "ops_ko1" to "ops_ko101" (ops_koX) begin with a specific string - for example "5431" but this string cannot be "54312" or "54312x". If the string matches and is not excluded, I want the respective strings in "dat_opsX" and "time_opsX" extracted into a new variable.

    My inclusion criteria is a set of 60 strings with 4 numbers each. I want only the variables, that begin with these 60 combinations.

    However, as mentioned, there are specific strings/codes, that I want to be excluded. This is a set of over 280 strings with 5 to 7 letters.

    The code I used to identify the strings in "ops_koX" and pull the respective values in "dat_opsX" and "time_ops" works well, is not too slow, and looks like this:

    Code:
    g reop = 0
    g reop_num = 0
    g reop1_date = "0"
    g reop1_time = "0"
    g reop2_date = "0"
    g reop2_time = "0"
    *there are several more time stamps needed
    
    foreach ops in 5433 5431 5549 /* there are 60 codes in total */ {
            forvalues i = 1/101 {
        replace reop_num = reop_num + 1 if strmatch(ops_ko`i', "`ops'*")
        replace reop = 1 if strmatch(ops_ko`i', "`ops'*") & reop == 0
        replace reop1_date = dat_ops`i' if strmatch(ops_ko`i', "`ops'*") & reop_num == 1 & reop1_date == "0"
        replace reop1_time = time_ops`i' if strmatch(ops_ko`i', "`ops'*") & reop_num == 1 & reop1_time == "0"
        replace reop2_date = dat_ops`i' if strmatch(ops_ko`i', "`ops'*") & reop_num == 2 & reop2_date == "0"
        replace reop2_time = time_ops`i' if strmatch(ops_ko`i', "`ops'*") & reop_num == 2 & reop2_time == "0"
    *several more time stamps are exluded here
    }
    }
    Now I want to add the codes, that I want to exclude - a set of about 280 individual codes. I tried the following:

    Code:
    foreach ops in 5433 5431 5549 /* there are 60 codes in total */ {
    foreach ops_excl in 543350 54335x 544513 /* there are about 280 codes in total */ {
            forvalues i = 1/101 {
        replace reop_num = reop_num + 1 if strmatch(ops_ko`i', "`ops'*") & !strmatch(ops_ko`i', "`ops_excl'")
        replace reop = 1 if strmatch(ops_ko`i', "`ops'*") & !strmatch(ops_ko`i', "`ops_excl'") & reop == 0
        replace reop1_date = dat_ops`i' if strmatch(ops_ko`i', "`ops'*") & !strmatch(ops_ko`i', "`ops_excl'") & reop_num == 1 & reop1_date == "0"
        replace reop1_time = time_ops`i' if strmatch(ops_ko`i', "`ops'*") & !strmatch(ops_ko`i', "`ops_excl'") & reop_num == 1 & reop1_time == "0"
    }
    }
    }
    This loop already takes ages in a small subset I created and does not work as expected. Additionally, I think it will take very long on the complete dataset.

    To exclude specific strings using strmatch, I also tried
    Code:
    & !strmatch(ops_ko`i', "543350" | "54335x" | "544513" | /* 280 more */)
    which did not work.

    I also thought about adding each code separately:
    Code:
    & !strmatch(ops_ko`i', "543350") | !strmatch(ops_ko`i', "54335x") | !strmatch(ops_ko`i', "544513")
    but adding 280 strings does not seem to be the right approach and I need this code block several times to identify the time/date of different codes.

    To conclude:
    • How do I best use a large number of inclusion/exclusions criteria with strmatch(), i.e. the string must begin with "abcd" but cannot be "abcdef".
    • How do I make this code very efficient, so I can run it over a lot of observations/variables?

    I very appreciate any help. Thank you in advance!
    Last edited by Josua Decker; 31 Oct 2022, 05:36.

  • #2
    Here's some toy code:

    Code:
    clear
    input str20 opsvar
    "54312"
    "54312x"
    "5431256y"
    "25431"
    "254312x"
    "12345"
    "1234y"
    "12343"
    "123433"
    end
    
    local ex_5431 54312|54312x
    local ex_1234 1234y|12343
    
    local combos 5431 1234
    
    gen reopl_date = "not this one"
    
    foreach ops of local combos {
        replace reopl_date = "got it!" if ustrregexm(opsvar,"^`ops'") & !ustrregexm(opsvar,"^(`ex_`ops'')$")
    }
    which produces:

    Code:
    . li , noobs sep(0)
    
      +-------------------------+
      |   opsvar     reopl_date |
      |-------------------------|
      |    54312   not this one |
      |   54312x   not this one |
      | 5431256y        got it! |
      |    25431   not this one |
      |  254312x   not this one |
      |    12345        got it! |
      |    1234y   not this one |
      |    12343   not this one |
      |   123433        got it! |
      +-------------------------+
    just put the exclusions into the corresponding ex_ macros, separated by |
    Last edited by Hemanshu Kumar; 31 Oct 2022, 06:28.

    Comment


    • #3
      Thanks so much for your helpful response, Hemanshu Kumar!

      Your approach works very well and I especially like the elegant split of the exclusion groups for different beginnings of the strings. This saves so much computation time!

      Comment

      Working...
      X