Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Substring function

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(Diagnosis1 Diagnosis2) str1 Diagnosis3 str4(Diagnosis4 Diagnosis5)
     3 3 "3" "T345" "T345"
     3 3 "4" "T88"  "T77" 
    34 4 "4" "T76"  "T88" 
     3 3 "4" "T76"  "A76" 
     3 4 "3" "A89"  "A89" 
     3 3 "4" "A09"  "A09" 
     4 5 "6" "T89"  "T89" 
    end

    Question:
    I would like to create a loop (code done) and replace all string variables starting with T8* to binary variables

    CODE:

    forvalues p = 4/5 {
    generate Diagx`p' = 0
    replace Diagx`p' = 1 if Diagnosis`p' == substr("T345",2,.) | Diagnosis`p' == substr("T8",1,.)
    replace Diagx`p' = 2 if Diagnosis`p' == "T77"
    label values Diagx`p' Diagx
    }

    However, stata still doesn't read my code to substitute all the T88 and T89 to binary variable = 1

    What am I doing wrong please?

  • #2
    Building on the code from a question you asked earlier at

    https://www.statalist.org/forums/for.../1677930-loops

    we change a line that originally read
    Code:
        replace Diagx`p' = 1 if Diagnosis`p' == "T345" | Diagnosis`p' == "T88"
    to what is highlighted in red in the code below
    Code:
    label define Diagx 1 "stroke" 2 "diabetes" 0 "Other"
    forvalues p = 4/5 {
        generate Diagx`p' = 0
        replace Diagx`p' = 1 if Diagnosis`p' == "T345" | substr(Diagnosis`p',1,2) == "T8"
        replace Diagx`p' = 2 if Diagnosis`p' == "T77"
        label values Diagx`p' Diagx
    }
    Code:
    . list, abbreviate(12) separator(0)
    
         +------------------------------------------------------------------------------------+
         | Diagnosis1   Diagnosis2   Diagnosis3   Diagnosis4   Diagnosis5   Diagx4     Diagx5 |
         |------------------------------------------------------------------------------------|
      1. |          3            3            3         T345         T345   stroke     stroke |
      2. |          3            3            4          T88          T77   stroke   diabetes |
      3. |         34            4            4          T76          T88    Other     stroke |
      4. |          3            3            4          T76          A76    Other      Other |
      5. |          3            4            3          A89          A89    Other      Other |
      6. |          3            3            4          A09          A09    Other      Other |
      7. |          4            5            6          T89          T89   stroke     stroke |
         +------------------------------------------------------------------------------------+
    
    . list, abbreviate(12) separator(0) nolabel
    
         +----------------------------------------------------------------------------------+
         | Diagnosis1   Diagnosis2   Diagnosis3   Diagnosis4   Diagnosis5   Diagx4   Diagx5 |
         |----------------------------------------------------------------------------------|
      1. |          3            3            3         T345         T345        1        1 |
      2. |          3            3            4          T88          T77        1        2 |
      3. |         34            4            4          T76          T88        0        1 |
      4. |          3            3            4          T76          A76        0        0 |
      5. |          3            4            3          A89          A89        0        0 |
      6. |          3            3            4          A09          A09        0        0 |
      7. |          4            5            6          T89          T89        1        1 |
         +----------------------------------------------------------------------------------+

    Comment


    • #3
      William Lisowski helpfully suggested good code. Here as a footnote I expand on what is going wrong with substr() -- illustrating a favourite debugging maxim of mine

      Use display with small examples to check what is going on.
      In particular, without any access to your dataset, we can still go

      Code:
      . di substr("T345",2,.)
      345
      
      . di substr("T8",1,.)
      T8
      The first is: show me the substring of "T345" that starts at position 2, going on as long as possible, which is "345".

      The second is: show me the substring of "T8" that starts at position 1, going on as long as possible, which is identically "T8".

      You were then comparing with variables and not finding any equalities.

      substr() belongs as a function processing your variables.


      Comment


      • #4
        Hi all, thanks for this

        I've tried the code again - having tried it last week which worked, and this week and changed the dataset slightly and it won't work

        clear
        input str3 diagnosis1 str5 diagnosis2 str3 diagnosis3
        "A00" "A20" "A50"
        "A01" "A20.1" "A64"
        "A02" "A28" "A99"
        end

        Code used:

        label define diagx 1 "stroke" 2 "diabetes" 3"other"
        forvalues p = 1/3 {
        generate diagx`p' = 0
        replace diagx`p' = 1 if diagnosis`p' == substr(diagnosis`p',1,2) == "A0"
        label values diagx`p' diagx
        }


        Stata comes up with error
        1. 'Type mistmatch' which I can not understand as I used the same code last week on my stata and it worked
        2. Stata only produces 1 diagx1 rather than cycling through diagnosis 1, diagnosis 2, diagnosis 3

        This worked last week, and by mistake closed by do file and just took a photo and I can't understand why it's not working again.

        Secondly, (apologies if this is on the same thread, but same topic, I've also tried replacing values 1 for all the codes between A20 - A29 using the code below, and stata says 'invalid command'

        forvalues p = 1/3 {
        generate diagx`p' = 0
        replace diagx`p' = 1 if diagnosis`p' >= "A20" & diagnosis`p' <= "A29"
        label values diagx`p' diagx
        }

        Is this because I am telling stata to use greater and equal commands for string values and this wouldn't work.

        Comment


        • #5
          Your example needs substantive knowledge to be understood fully. Which diagnoses correspond to stroke and which to diabetes?

          Also, it seems fortuitous but potentially confusing that you have 3 diagnosis variables and 3 coarse categories.

          Also, nothing in your code assigns values 2 or 3 so I can't follow why you are surprised not to get any such values. Otherwise put, your new variables are born as 0 and sometimes replaced with 1 so on your code 2 or 3 could never be a value.

          This much seems clear to me.

          Code:
          if diagnosis`p' == substr(diagnosis`p',1,2) == "A0"
          is illegal for the following reason. There are two comparisons there, which will be evaluated in turn (NOT simultaneously).

          Code:
          diagnosis`p' == substr(diagnosis`p',1,2)
          is legal and compares a string variable with a string expression, with numeric result 1 if true and 0 if false. But then either

          Code:
          1 == "A0" 
          or

          Code:
          0 == "A0" 
          is illegal as a type mismatch.

          But it seems that what you want there may be much simpler, say

          Code:
          if substr(diagnosis`p',1,2) == "A0"

          Comment


          • #6
            Nick Cox Thanks - I made the same syntax mistake as I did in the first post of this thread even after having done some indepth reading & your article on stata journal.
            Many thanks.

            Regarding my second question of trying to replace all values between A20 - A29
            replace diagx`p' = 1 if diagnosis`p' >= "A20" & diagnosis`p' <= "A29"

            I have tried using these commands (as above) is there another command I should use that perhaps will work with string values?

            Comment


            • #7
              Code:
              h inrange()

              Comment


              • #8
                Regarding my second question of trying to replace all values between A20 - A29
                replace diagx`p' = 1 if diagnosis`p' >= "A20" & diagnosis`p' <= "A29"

                I have tried using these commands (as above) is there another command I should use that perhaps will work with string values?
                Why do you think this will not work with string values? Is it perhaps because you have diagnosis codes such as "A20.1" (from your example above) or "A246" (this is a guess)? Nick Cox wrote in post #5
                Your example needs substantive knowledge to be understood fully. Which diagnoses correspond to stroke and which to diabetes?
                and you have not addressed this. Is "A20.1" a code for a stroke? Would "A246" or something similar also be a code for a stroke?

                Tell us in words: if we look at a code how do we know it is the code for a stroke?
                • Does it start "A2" - so "A20.1" would be a stroke and "A246" would be a stroke?
                • Does it start "A" followed by a number at least 20 and less than 30 - so "A20.1" would be a stroke and "A246" would not be a stroke?
                • ???
                With that said, perhaps the following approach will start you in a useful direction; you can modify it to suit your definition of the coding for a stroke.
                Code:
                clear
                input str8 (diagnosis1 diagnosis2 diagnosis3)
                "A00" "A20" "A50"
                "A01" "A20.1" "A64"
                "A02" "A28" "A246"
                end
                
                label define diagx 1 "stroke" 2 "diabetes" 3"other"
                generate letter = ""
                generate number = .
                forvalues p = 1/3 {
                    generate diagx`p' = 0
                    replace letter = substr(diagnosis`p',1,1)
                    replace number = real(substr(diagnosis`p',2,.))
                    replace diagx`p' = 1 if letter=="A" & number>=20 & number<30
                    label values diagx`p' diagx
                }
                drop letter number
                list, clean noobs abbreviate(12)
                Code:
                . list, clean noobs abbreviate(12)
                
                    diagnosis1   diagnosis2   diagnosis3   diagx1   diagx2   diagx3  
                           A00          A20          A50        0   stroke        0  
                           A01        A20.1          A64        0   stroke        0  
                           A02          A28         A246        0   stroke        0
                Last edited by William Lisowski; 22 Aug 2022, 07:54.

                Comment


                • #9
                  Thank you all, at last after lots of reading I've managed to fix my code

                  William Lisowski I was trying to replace 1 with label 'stroke' for all the values between A0 - A28. I had my command for labelling stroke prior to this (not shown in this post)

                  For anyone who may be using the thread, I used this

                  forvalues p = 1/3 {
                  generate diagx`p' = 0
                  replace diagx`p' = 1 if inrange(diag_0`p', "A2", "A28")
                  label values diagx`p' diagx
                  }

                  Comment


                  • #10
                    I was trying to replace 1 with label 'stroke' for all the values between A0 - A28
                    Do you understand that A0, A1, A10, A11, ..., A19 will not be labelled labelled stroke by the code in post #9?

                    Comment


                    • #11
                      Originally posted by William Lisowski View Post

                      Do you understand that A0, A1, A10, A11, ..., A19 will not be labelled labelled stroke by the code in post #9?
                      Fair point ! Thanks for highlighting this !!!! Yes I was trying to label all the values between A2 - A44
                      (including decimal points eg A2.14 ; and those with A214 as the decimal points in the dataset aren't coded ......and wanted to replace them for 1 representing stroke)

                      clear
                      input str4 diag_1 str5 diag_2 str3 diag_3 float(diagx1 diagx2 diagx3)
                      "A1" "A20.1" "B4" 0 1 0
                      "A2" "A3" "B5" 1 0 0
                      "A20" "B1" "A29" 1 0 0
                      "A205" "B2" "B1" 1 0 0
                      end
                      label values diagx1 diagx
                      label values diagx2 diagx
                      label values diagx3 diagx
                      label def diagx 1 "stroke", modify
                      [/CODE]

                      My A3 wasn't labelled

                      label define diagx 1 "stroke" 2 "diabetes" 3 "other"
                      forvalues p = 1/3 {
                      generate diagx`p' = 0
                      replace diagx`p' = 1 if inrange(diag_`p', "A2", "A28")
                      label values diagx`p' diagx
                      }



                      Comment


                      • #12
                        Comparing strings that contain numbers isn't like comparing numbers. The string "19" is not greater than the string "2". That is why I separated out the numeric part of the diagnosis code from the initial letter in post #8, and converted the numeric part to numbers.

                        Comment


                        • #13
                          Thanks, I’ve already tried real command before hand bit i wont work due to the 1mil rows I have

                          Comment

                          Working...
                          X