Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Group values of a string variable by keywords

    Dear Statalist,

    I am struggling with a specific task of simplifying my data.

    Basically, I am working with diseases and for my further research, I would like to group them by specific keywords.
    As there are a total of over 150.000 possible diseases in my dataset it is impossible to group them by hand.

    So what I would like to do is search within the string value of my disease variable "disease" and replace the value in my "disease_group" variable with the keyword.

    Here is an example of my data:
    As a remark: the disease_group is missing in all observations at the moment.
    I hope to have posted the data in an understandable fashion


    Code:
    clear
    input str5 icd10 str222 disease str1 disease_group
    "A00"  "Cholera"                                                                                                                                             ""
    "A000" "Cholera durch Vibrio cholerae O:1, Biovar cholerae"                                                                                                  ""
    "A001" "Cholera durch Vibrio cholerae O:1, Biovar eltor"                                                                                                     ""
    "A009" "Cholera, nicht näher bezeichnet"                                                                                                                  ""
    "A01"  "Typhus abdominalis und Paratyphus"                                                                                                                   ""
    "A010" "Typhus abdominalis"                                                                                                                                  ""
    "A011" "Paratyphus A"                                                                                                                                        ""
    "A012" "Paratyphus B"                                                                                                                                        ""
    "A013" "Paratyphus C"                                                                                                                                        ""
    "A014" "Paratyphus, nicht näher bezeichnet"                                                                                                               ""
    "A02"  "Sonstige Salmonelleninfektionen"                                                                                                                     ""
    "A020" "Salmonellenenteritis"                                                                                                                                ""
    "A021" "Salmonellensepsis"                                                                                                                                   ""
    "A022" "Lokalisierte Salmonelleninfektionen"                                                                                                                 ""
    "A028" "Sonstige näher bezeichnete Salmonelleninfektionen"                                                                                                ""
    "A029" "Salmonelleninfektion, nicht näher bezeichnet"                                                                                                     ""
    "A03"  "Shigellose [Bakterielle Ruhr]"                                                                                                                       ""
    "A030" "Shigellose durch Shigella dysenteriae"                                                                                                               ""
    "A031" "Shigellose durch Shigella flexneri"                                                                                                                  ""
    "A032" "Shigellose durch Shigella boydii"                                                                                                                    ""
    "A033" "Shigellose durch Shigella sonnei"
    As an example, I want to group all "diseases" with the keywords "typhus" (as in A011 - A014) into one "disease_group" with the value "Typhus".
    The goal is to break down the diseases to around 100 disease_groups, so I will have to do it several times for different keywords.

    I tried the foreach command in combination with lookfor, but didn't really got close to a solution.

    I would really appreciate some help and hope to have explained my problem adequately.

    With kind regards,
    Torben

  • #2
    Torben, you have ICD-10 codes. They are already broken down into a hierarchy; the link shows all the infectious diseases. The first 3 characters correspond to a hierarchy, e.g. anything beginning with "A03" has to be some type of shigellosis.

    You should note that Stata has a built-in command to ease working with ICD-10 codes. For those of us stuck with ICD-9, i.e. the United States, there's another command for that as well, but I digress. The -icd10- command enables you to generate codes from ranges. For example,

    Code:
    icd10 generate shigellosis = icd10, range(A03*)
    icd10 generate paratyphus = icd10, range(A011/A014)
    That's one option to group diseases.
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #3
      Weiwen's solution is the sensible way to go in your situation, but a direct answer to your question is the following

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str5 icd10 str222 disease str33 disease_group
      "A00"  "Cholera"                                              ""
      "A000" "Cholera durch Vibrio cholerae O:1, Biovar cholerae"   ""
      "A001" "Cholera durch Vibrio cholerae O:1, Biovar eltor"      ""
      "A009" "Cholera, nicht näher bezeichnet"                   ""
      "A01"  "Typhus abdominalis und Paratyphus"                    ""
      "A010" "Typhus abdominalis"                                   ""
      "A011" "Paratyphus A"                                         ""
      "A012" "Paratyphus B"                                         ""
      "A013" "Paratyphus C"                                         ""
      "A014" "Paratyphus, nicht näher bezeichnet"                ""
      "A02"  "Sonstige Salmonelleninfektionen"                      ""
      "A020" "Salmonellenenteritis"                                 ""
      "A021" "Salmonellensepsis"                                    ""
      "A022" "Lokalisierte Salmonelleninfektionen"                  ""
      "A028" "Sonstige näher bezeichnete Salmonelleninfektionen" ""
      "A029" "Salmonelleninfektion, nicht näher bezeichnet"      ""
      "A03"  "Shigellose [Bakterielle Ruhr]"                        ""
      "A030" "Shigellose durch Shigella dysenteriae"                ""
      "A031" "Shigellose durch Shigella flexneri"                   ""
      "A032" "Shigellose durch Shigella boydii"                     ""
      "A033" "Shigellose durch Shigella sonnei"                     ""
      end
      replace  disease_group ="Typhus" if substr(disease, strpos(lower(disease),"typhus"), .)!=""
      Resulting in

      Code:
             icd10                                              disease   diseas~p  
        1.     A00                                              Cholera             
        2.    A000   Cholera durch Vibrio cholerae O:1, Biovar cholerae             
        3.    A001      Cholera durch Vibrio cholerae O:1, Biovar eltor             
        4.    A009                     Cholera, nicht näher bezeichnet             
        5.     A01                    Typhus abdominalis und Paratyphus     Typhus  
        6.    A010                                   Typhus abdominalis     Typhus  
        7.    A011                                         Paratyphus A     Typhus  
        8.    A012                                         Paratyphus B     Typhus  
        9.    A013                                         Paratyphus C     Typhus  
       10.    A014                  Paratyphus, nicht näher bezeichnet     Typhus  
       11.     A02                      Sonstige Salmonelleninfektionen             
       12.    A020                                 Salmonellenenteritis             
       13.    A021                                    Salmonellensepsis             
       14.    A022                  Lokalisierte Salmonelleninfektionen             
       15.    A028   Sonstige näher bezeichnete Salmonelleninfektionen             
       16.    A029        Salmonelleninfektion, nicht näher bezeichnet             
       17.     A03                        Shigellose [Bakterielle Ruhr]             
       18.    A030                Shigellose durch Shigella dysenteriae             
       19.    A031                   Shigellose durch Shigella flexneri             
       20.    A032                     Shigellose durch Shigella boydii             
       21.    A033                     Shigellose durch Shigella sonnei

      Comment


      • #4
        Thank you both a lot!
        That really helped

        Comment

        Working...
        X