Substring function

Martin Imelda Borg

Join Date: Jan 2022

Posts: 225
#1

Substring function

17 Aug 2022, 09:56

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(Diagnosis1 Diagnosis2) str1 Diagnosis3 str4(Diagnosis4 Diagnosis5) 3 3 "3" "T345" "T345" 3 3 "4" "T88" "T77" 34 4 "4" "T76" "T88" 3 3 "4" "T76" "A76" 3 4 "3" "A89" "A89" 3 3 "4" "A09" "A09" 4 5 "6" "T89" "T89" end

Question:
I would like to create a loop (code done) and replace all string variables starting with T8* to binary variables

CODE:

forvalues p = 4/5 {
generate Diagx`p' = 0
replace Diagx`p' = 1 if Diagnosis`p' == substr("T345",2,.) | Diagnosis`p' == substr("T8",1,.)
replace Diagx`p' = 2 if Diagnosis`p' == "T77"
label values Diagx`p' Diagx
}

However, stata still doesn't read my code to substitute all the T88 and T89 to binary variable = 1

What am I doing wrong please?
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

17 Aug 2022, 10:40

Building on the code from a question you asked earlier at

https://www.statalist.org/forums/for.../1677930-loops

we change a line that originally read

Code:

    replace Diagx`p' = 1 if Diagnosis`p' == "T345" | Diagnosis`p' == "T88"

to what is highlighted in red in the code below

Code:

label define Diagx 1 "stroke" 2 "diabetes" 0 "Other"
forvalues p = 4/5 {
    generate Diagx`p' = 0
    replace Diagx`p' = 1 if Diagnosis`p' == "T345" | substr(Diagnosis`p',1,2) == "T8"
    replace Diagx`p' = 2 if Diagnosis`p' == "T77"
    label values Diagx`p' Diagx
}

Code:

. list, abbreviate(12) separator(0)

     +------------------------------------------------------------------------------------+
     | Diagnosis1   Diagnosis2   Diagnosis3   Diagnosis4   Diagnosis5   Diagx4     Diagx5 |
     |------------------------------------------------------------------------------------|
  1. |          3            3            3         T345         T345   stroke     stroke |
  2. |          3            3            4          T88          T77   stroke   diabetes |
  3. |         34            4            4          T76          T88    Other     stroke |
  4. |          3            3            4          T76          A76    Other      Other |
  5. |          3            4            3          A89          A89    Other      Other |
  6. |          3            3            4          A09          A09    Other      Other |
  7. |          4            5            6          T89          T89   stroke     stroke |
     +------------------------------------------------------------------------------------+

. list, abbreviate(12) separator(0) nolabel

     +----------------------------------------------------------------------------------+
     | Diagnosis1   Diagnosis2   Diagnosis3   Diagnosis4   Diagnosis5   Diagx4   Diagx5 |
     |----------------------------------------------------------------------------------|
  1. |          3            3            3         T345         T345        1        1 |
  2. |          3            3            4          T88          T77        1        2 |
  3. |         34            4            4          T76          T88        0        1 |
  4. |          3            3            4          T76          A76        0        0 |
  5. |          3            4            3          A89          A89        0        0 |
  6. |          3            3            4          A09          A09        0        0 |
  7. |          4            5            6          T89          T89        1        1 |
     +----------------------------------------------------------------------------------+

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35721
#3

17 Aug 2022, 11:30

William Lisowski helpfully suggested good code. Here as a footnote I expand on what is going wrong with substr() -- illustrating a favourite debugging maxim of mine

Use display with small examples to check what is going on.

In particular, without any access to your dataset, we can still go

Code:

. di substr("T345",2,.) 345 . di substr("T8",1,.) T8

The first is: show me the substring of "T345" that starts at position 2, going on as long as possible, which is "345".

The second is: show me the substring of "T8" that starts at position 1, going on as long as possible, which is identically "T8".

You were then comparing with variables and not finding any equalities.

substr() belongs as a function processing your variables.
Comment
Martin Imelda Borg

Join Date: Jan 2022

Posts: 225
#4

22 Aug 2022, 04:26

Hi all, thanks for this

I've tried the code again - having tried it last week which worked, and this week and changed the dataset slightly and it won't work

clear
input str3 diagnosis1 str5 diagnosis2 str3 diagnosis3
"A00" "A20" "A50"
"A01" "A20.1" "A64"
"A02" "A28" "A99"
end

Code used:

label define diagx 1 "stroke" 2 "diabetes" 3"other"
forvalues p = 1/3 {
generate diagx`p' = 0
replace diagx`p' = 1 if diagnosis`p' == substr(diagnosis`p',1,2) == "A0"
label values diagx`p' diagx
}

Stata comes up with error
1. 'Type mistmatch' which I can not understand as I used the same code last week on my stata and it worked
2. Stata only produces 1 diagx1 rather than cycling through diagnosis 1, diagnosis 2, diagnosis 3

This worked last week, and by mistake closed by do file and just took a photo and I can't understand why it's not working again.

Secondly, (apologies if this is on the same thread, but same topic, I've also tried replacing values 1 for all the codes between A20 - A29 using the code below, and stata says 'invalid command'

forvalues p = 1/3 {
generate diagx`p' = 0
replace diagx`p' = 1 if diagnosis`p' >= "A20" & diagnosis`p' <= "A29"
label values diagx`p' diagx
}

Is this because I am telling stata to use greater and equal commands for string values and this wouldn't work.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35721
#5

22 Aug 2022, 04:56

Your example needs substantive knowledge to be understood fully. Which diagnoses correspond to stroke and which to diabetes?

Also, it seems fortuitous but potentially confusing that you have 3 diagnosis variables and 3 coarse categories.

Also, nothing in your code assigns values 2 or 3 so I can't follow why you are surprised not to get any such values. Otherwise put, your new variables are born as 0 and sometimes replaced with 1 so on your code 2 or 3 could never be a value.

This much seems clear to me.

Code:

if diagnosis`p' == substr(diagnosis`p',1,2) == "A0"

is illegal for the following reason. There are two comparisons there, which will be evaluated in turn (NOT simultaneously).

Code:

diagnosis`p' == substr(diagnosis`p',1,2)

is legal and compares a string variable with a string expression, with numeric result 1 if true and 0 if false. But then either

Code:

1 == "A0"

or

Code:

0 == "A0"

is illegal as a type mismatch.

But it seems that what you want there may be much simpler, say

Code:

if substr(diagnosis`p',1,2) == "A0"
Comment
Martin Imelda Borg

Join Date: Jan 2022

Posts: 225
#6

22 Aug 2022, 06:08

Nick Cox Thanks - I made the same syntax mistake as I did in the first post of this thread even after having done some indepth reading & your article on stata journal.
Many thanks.

Regarding my second question of trying to replace all values between A20 - A29
replace diagx`p' = 1 if diagnosis`p' >= "A20" & diagnosis`p' <= "A29"

I have tried using these commands (as above) is there another command I should use that perhaps will work with string values?
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#7

22 Aug 2022, 06:59

Code:

h inrange()
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

22 Aug 2022, 07:49

Regarding my second question of trying to replace all values between A20 - A29
replace diagx`p' = 1 if diagnosis`p' >= "A20" & diagnosis`p' <= "A29"

I have tried using these commands (as above) is there another command I should use that perhaps will work with string values?

Why do you think this will not work with string values? Is it perhaps because you have diagnosis codes such as "A20.1" (from your example above) or "A246" (this is a guess)? Nick Cox wrote in post #5

Your example needs substantive knowledge to be understood fully. Which diagnoses correspond to stroke and which to diabetes?

and you have not addressed this. Is "A20.1" a code for a stroke? Would "A246" or something similar also be a code for a stroke?

Tell us in words: if we look at a code how do we know it is the code for a stroke?
Does it start "A2" - so "A20.1" would be a stroke and "A246" would be a stroke?

Does it start "A" followed by a number at least 20 and less than 30 - so "A20.1" would be a stroke and "A246" would not be a stroke?

???

With that said, perhaps the following approach will start you in a useful direction; you can modify it to suit your definition of the coding for a stroke.

Code:

clear input str8 (diagnosis1 diagnosis2 diagnosis3) "A00" "A20" "A50" "A01" "A20.1" "A64" "A02" "A28" "A246" end label define diagx 1 "stroke" 2 "diabetes" 3"other" generate letter = "" generate number = . forvalues p = 1/3 { generate diagx`p' = 0 replace letter = substr(diagnosis`p',1,1) replace number = real(substr(diagnosis`p',2,.)) replace diagx`p' = 1 if letter=="A" & number>=20 & number<30 label values diagx`p' diagx } drop letter number list, clean noobs abbreviate(12)

Code:

. list, clean noobs abbreviate(12) diagnosis1 diagnosis2 diagnosis3 diagx1 diagx2 diagx3 A00 A20 A50 0 stroke 0 A01 A20.1 A64 0 stroke 0 A02 A28 A246 0 stroke 0

Last edited by William Lisowski; 22 Aug 2022, 07:54.
Comment
Martin Imelda Borg

Join Date: Jan 2022

Posts: 225
#9

22 Aug 2022, 10:29

Thank you all, at last after lots of reading I've managed to fix my code

William Lisowski I was trying to replace 1 with label 'stroke' for all the values between A0 - A28. I had my command for labelling stroke prior to this (not shown in this post)

For anyone who may be using the thread, I used this

forvalues p = 1/3 {
generate diagx`p' = 0
replace diagx`p' = 1 if inrange(diag_0`p', "A2", "A28")
label values diagx`p' diagx
}
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

22 Aug 2022, 10:43

I was trying to replace 1 with label 'stroke' for all the values between A0 - A28

Do you understand that A0, A1, A10, A11, ..., A19 will not be labelled labelled stroke by the code in post #9?
Comment
Martin Imelda Borg

Join Date: Jan 2022

Posts: 225
#11

22 Aug 2022, 11:00

Originally posted by William Lisowski View Post

Do you understand that A0, A1, A10, A11, ..., A19 will not be labelled labelled stroke by the code in post #9?

Fair point ! Thanks for highlighting this !!!! Yes I was trying to label all the values between A2 - A44
(including decimal points eg A2.14 ; and those with A214 as the decimal points in the dataset aren't coded ......and wanted to replace them for 1 representing stroke)

clear
input str4 diag_1 str5 diag_2 str3 diag_3 float(diagx1 diagx2 diagx3)
"A1" "A20.1" "B4" 0 1 0
"A2" "A3" "B5" 1 0 0
"A20" "B1" "A29" 1 0 0
"A205" "B2" "B1" 1 0 0
end
label values diagx1 diagx
label values diagx2 diagx
label values diagx3 diagx
label def diagx 1 "stroke", modify
[/CODE]

My A3 wasn't labelled

label define diagx 1 "stroke" 2 "diabetes" 3 "other"
forvalues p = 1/3 {
generate diagx`p' = 0
replace diagx`p' = 1 if inrange(diag_`p', "A2", "A28")
label values diagx`p' diagx
}
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#12

22 Aug 2022, 11:28

Comparing strings that contain numbers isn't like comparing numbers. The string "19" is not greater than the string "2". That is why I separated out the numeric part of the diagnosis code from the initial letter in post #8, and converted the numeric part to numbers.
Comment
Martin Imelda Borg

Join Date: Jan 2022

Posts: 225
#13

22 Aug 2022, 11:57

Thanks, I’ve already tried real command before hand bit i wont work due to the 1mil rows I have
Comment

Announcement

Substring function

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment