Hello all,
over the last month this forum helped me a lot with my projects, but now I have a problem for which I´m not able to find the answer.
My dataframe (about 10,000,000 observations) has the variables "ops_ko1" to "ops_ko101", "dat_ops1" to "dat_ops101" and "time_ops1" to "time_ops101". Any of these variables contains a string of numbers and letters, for example "5529s1", "8146y" or "88422q".
I want to identify, if the variables "ops_ko1" to "ops_ko101" (ops_koX) begin with a specific string - for example "5431" but this string cannot be "54312" or "54312x". If the string matches and is not excluded, I want the respective strings in "dat_opsX" and "time_opsX" extracted into a new variable.
My inclusion criteria is a set of 60 strings with 4 numbers each. I want only the variables, that begin with these 60 combinations.
However, as mentioned, there are specific strings/codes, that I want to be excluded. This is a set of over 280 strings with 5 to 7 letters.
The code I used to identify the strings in "ops_koX" and pull the respective values in "dat_opsX" and "time_ops" works well, is not too slow, and looks like this:
Now I want to add the codes, that I want to exclude - a set of about 280 individual codes. I tried the following:
This loop already takes ages in a small subset I created and does not work as expected. Additionally, I think it will take very long on the complete dataset.
To exclude specific strings using strmatch, I also tried
which did not work.
I also thought about adding each code separately:
but adding 280 strings does not seem to be the right approach and I need this code block several times to identify the time/date of different codes.
To conclude:
I very appreciate any help. Thank you in advance!
over the last month this forum helped me a lot with my projects, but now I have a problem for which I´m not able to find the answer.
My dataframe (about 10,000,000 observations) has the variables "ops_ko1" to "ops_ko101", "dat_ops1" to "dat_ops101" and "time_ops1" to "time_ops101". Any of these variables contains a string of numbers and letters, for example "5529s1", "8146y" or "88422q".
I want to identify, if the variables "ops_ko1" to "ops_ko101" (ops_koX) begin with a specific string - for example "5431" but this string cannot be "54312" or "54312x". If the string matches and is not excluded, I want the respective strings in "dat_opsX" and "time_opsX" extracted into a new variable.
My inclusion criteria is a set of 60 strings with 4 numbers each. I want only the variables, that begin with these 60 combinations.
However, as mentioned, there are specific strings/codes, that I want to be excluded. This is a set of over 280 strings with 5 to 7 letters.
The code I used to identify the strings in "ops_koX" and pull the respective values in "dat_opsX" and "time_ops" works well, is not too slow, and looks like this:
Code:
g reop = 0
g reop_num = 0
g reop1_date = "0"
g reop1_time = "0"
g reop2_date = "0"
g reop2_time = "0"
*there are several more time stamps needed
foreach ops in 5433 5431 5549 /* there are 60 codes in total */ {
forvalues i = 1/101 {
replace reop_num = reop_num + 1 if strmatch(ops_ko`i', "`ops'*")
replace reop = 1 if strmatch(ops_ko`i', "`ops'*") & reop == 0
replace reop1_date = dat_ops`i' if strmatch(ops_ko`i', "`ops'*") & reop_num == 1 & reop1_date == "0"
replace reop1_time = time_ops`i' if strmatch(ops_ko`i', "`ops'*") & reop_num == 1 & reop1_time == "0"
replace reop2_date = dat_ops`i' if strmatch(ops_ko`i', "`ops'*") & reop_num == 2 & reop2_date == "0"
replace reop2_time = time_ops`i' if strmatch(ops_ko`i', "`ops'*") & reop_num == 2 & reop2_time == "0"
*several more time stamps are exluded here
}
}
Code:
foreach ops in 5433 5431 5549 /* there are 60 codes in total */ {
foreach ops_excl in 543350 54335x 544513 /* there are about 280 codes in total */ {
forvalues i = 1/101 {
replace reop_num = reop_num + 1 if strmatch(ops_ko`i', "`ops'*") & !strmatch(ops_ko`i', "`ops_excl'")
replace reop = 1 if strmatch(ops_ko`i', "`ops'*") & !strmatch(ops_ko`i', "`ops_excl'") & reop == 0
replace reop1_date = dat_ops`i' if strmatch(ops_ko`i', "`ops'*") & !strmatch(ops_ko`i', "`ops_excl'") & reop_num == 1 & reop1_date == "0"
replace reop1_time = time_ops`i' if strmatch(ops_ko`i', "`ops'*") & !strmatch(ops_ko`i', "`ops_excl'") & reop_num == 1 & reop1_time == "0"
}
}
}
To exclude specific strings using strmatch, I also tried
Code:
& !strmatch(ops_ko`i', "543350" | "54335x" | "544513" | /* 280 more */)
I also thought about adding each code separately:
Code:
& !strmatch(ops_ko`i', "543350") | !strmatch(ops_ko`i', "54335x") | !strmatch(ops_ko`i', "544513")
To conclude:
- How do I best use a large number of inclusion/exclusions criteria with strmatch(), i.e. the string must begin with "abcd" but cannot be "abcdef".
- How do I make this code very efficient, so I can run it over a lot of observations/variables?
I very appreciate any help. Thank you in advance!

Comment