Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract last part (ICD10 code, mix of numbers and letters) of string variable to a new variable

    I'm trying to extract the ICD-10 code from a string variable. The string variable all vary in content and length, but all have the heading of the ICD10 code followed by the code within brackets. Here is one example: "Related To Length Of Gestation And Fetal Growth (P05–P08)". I'm trying to extract the P05–P08 as a separate variable.

    I've explored the split function, and the regexs functions but cannot get it right.
    Anybody that know how to?

    Hanna

  • #2
    Welcome to Statalist, Hanna. Getting regular expressions right in Stata can be a challenge, but on my third or fourth try, this is what I came up with. Note that it will not work if there are parentheses other than around the ICD-10 code. That would complicate matters.
    Code:
    clear
    input str100 desc
    "Related To Length Of Gestation And Fetal Growth (P05–P08)"
    end
    generate icdm = regexm(desc,"\((.*)\)")
    generate icd = regexs(1) if icdm
    list, clean noobs
    Code:
                                                             desc   icdm       icd  
        Related To Length Of Gestation And Fetal Growth (P05–P08)      1   P05–P08
    I'll note that by using two generate commands, I showed the regexm/regexs functions in their greatest generality. The two commands can be reduced to the single command.
    Code:
    generate icd = regexs(1) if regexm(desc,"\((.*)\)")
    Last edited by William Lisowski; 21 Jun 2015, 18:43.

    Comment


    • #3
      Try something like this:

      Code:
      clear
      set obs 1
      gen str70 v1="Related To Length Of Gestation And Fetal Growth (P05–P08)"
      gen str10 mycode=subinstr(substr(v1,strpos(v1,"(")+1,.),")","",.)
      Last edited by ben earnhart; 21 Jun 2015, 18:48.

      Comment


      • #4
        Hi William,
        thanks for your very quick reply. Your suggestion did separate the icd code (except the clear statement as I do not wish to erase my data).
        Unfortunately I cannot follow your suggestion, as my variable 'diaguppergroup' contains 200 different diagnoses, and "Related To Length Of Gestation And Fetal Growth (P05–P08)" is only one of them. I'm trying to find a command where I do not need to specify what is written in string (as this differs for all 200 diagnoses). The ICD-code is always written at the end, always consist of a letter and two numbers, a hyphen, and another letter and two numbers, and always within brackets (nothing else is within brackets).

        Thanks so much for trying to think of a solution with me.

        Hanna

        Comment


        • #5
          Hanna --

          With neither William's code nor mine do you need to clear your data, nor would you want to, for obvious reasons. We just included a little bit of code to generate data like yours. In his case, the action starts on the "generate" line, and in my code, it starts on the second "gen" line.

          Comment


          • #6
            Thanks so much, you have been very helpful. I tried
            generate icd = regexs(1) if regexm(varname,"\((.*)\)") and it works! H

            Comment


            • #7
              Hanna -

              As Ben explained, both he and I needed to create data to demonstrate the syntax of the techniques we each described, and to provide you with a reproducible example that you could run to help you understand what each of us recommended. Your post gave just one example, and a very limited idea of what your data is like. (The FAQ linked to at the top of the page contains much good advice about posing questions in a way to maximize your chances of receiving helpful answers.) Listing five or ten examples in Stata with the list command and copying and pasting them into CODE blocks like the ones Ben and I used (again, these are described in the FAQ, and note that they prevent the "escaped characters" from disappearing as they did in the code you pasted into post #6) would have given readers like Ben and myself a better idea of the problem you needed to solve.

              If indeed your variable is named diaguppergroup, then the following will extract the ICD10 code.
              Code:
              generate icd10 = regexs(1) if regexm(diaguppergroup,"\((.*)\)")
              If however the ICD10 code is, as you described it, always found in the 8th to 2nd characters from the end of the string, then the following much more direct code will also do the trick.
              Code:
              generate icd10 = substr(trim(diaguppergroup),-8,7)
              Last edited by William Lisowski; 21 Jun 2015, 20:06.

              Comment

              Working...
              X