Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Manipulating with string variables

    Hi all,

    I am using STATA 15.1 and I have a new problem and it is a challenge, at least for me if I want to solve it in most efficient way. After searching the forum for answers its seems that has not been illustrated in details and examples. So, let me ask these please:

    1) How to check if the string varlist var1-var16 has period e.g C45.1 - D25.9 etc.?
    2) How to remove that period and keep the digits that followed it. e.g It should be C451 and D259? Preferably without destring.

    3) Now regarding the data; I have ICD10 variables for diseases s_40006_0_0 to s_40006_16_0 BUT s_40006_12_0 and s_40006_14_0 are not there. Also, I have diagnosis age variables n_40008_0_0 up to n_40008_31_0 "Sequentially completed".

    Here what I have tried:
    Code:
    //// Cases ////
    generate hasC61= 0
    foreach var of varlist s_40006_0_0- s_40006_16_0{
    replace  hasC61= 1 if strpos(`var',"C61")
    }
    tab hasC61
    It worked but if I want to add other diseases codes along with C61 it returns zero obs which is not true.
    Code:
    //// Cases ////
    generate cases= 0
    foreach var of varlist s_40006_0_0 - s_40006_16_0 {
    replace  cases= 1 if strpos(`var',"C61 D400")
    }
    tab cases
    The second inquiry is I tried the codes I have for numeric variables, thanks to Clyde Schechter which is to pair the lowest age value with corresponding diagnosis code using " " but it doesn't work here. Also, I wonder if it is possible to do paring where there are two missing part in s_40006 as mentioned above?

    The code I used for coding age is:
    Code:
    ds n_40006*
    local n40006s `r(varlist)'
    
    gen Age = .
    
    foreach v of varlist `n40006s' {
    local pairing: subinstr local v "40006" "40008"
    confirm numeric var `pairing'
    replace Age = `pairing' if `v' == "C61" & `pairing' < Age
    The sample data as follows:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str5(s_40006_0_0 s_40006_1_0 s_40006_2_0 s_40006_3_0 s_40006_4_0 s_40006_5_0) double(n_40008_0_0 n_40008_1_0 n_40008_2_0 n_40008_3_0 n_40008_4_0 n_40008_5_0)
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    "C509" ""     "" "" "" ""              56.4 56.4064       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    "C504" ""     "" "" "" ""              68.5 68.5024 68.5024       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    "C187" ""     "" "" "" ""              57.8 57.7507 57.7507       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    "C64"  ""     "" "" "" ""              56.5 56.5022 56.5022       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    "C187" "C508" "" "" "" "" 37.20000000000001    54.4 54.4323 37.2027 54.4323 .
    ""     ""     "" "" "" ""                 .       .       .       .       . .
    end

    Any suggestions would be highly appreciated.
    Last edited by Mohammad Aladwani; 12 Nov 2019, 17:12.

  • #2
    For your questions 1 and 2, it's this:

    Code:
    foreach v of varlist var1-var16 {
        replace `v' = subinstr(`v', ".", "', .)
    }
    Notice that there is no need for a separate command to "check" for periods. You just delete them all.

    As for the rest, I do not understand what you want to do. In the previous situation there was a correspondence between diagnostic code variables and corresponding age variables. But your present data does not have such a correspondence. So how do you find ages from non-exitent age variables or ages for non-existent diagnostic codes? Clearly I'm missing something here.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      For your questions 1 and 2, it's this:

      Code:
      foreach v of varlist var1-var16 {
      replace `v' = subinstr(`v', ".", "', .)
      }
      Notice that there is no need for a separate command to "check" for periods. You just delete them all.

      As for the rest, I do not understand what you want to do. In the previous situation there was a correspondence between diagnostic code variables and corresponding age variables. But your present data does not have such a correspondence. So how do you find ages from non-exitent age variables or ages for non-existent diagnostic codes? Clearly I'm missing something here.
      Thanks Clyde Schechter for your support as usual!.

      I managed to find a command to see whether there is a comma or not, but you have to put each var separately. I will put it just in case someone might need it.
      Code:
      1 var1 if strpos(var1, ".")
      Now for the rest,

      1) I want to search through the string varlist if it contain a range of string variables. For example, I want to generate a variable called benign that contain codes from D10 to D36. When I put:

      Code:
      gen benign= 0
      
      foreach var of varlist s_40006_0_0 - s_40006_16_0 {
       replace benign= 1 if inlist(substr(`var',1,4), "D10" - "D36") & n_31_0_0 !=0
       }
      Stata said type mismatch!

      I even tried:
      Code:
       generate ben = 0
      
      
      foreach var of varlist s_40006_0_0 - s_40006_16_0 {
      
      replace  ben = 1 if strpos(`var',"D10-D36") & n_31_0_0 !=0
      
      }
      But it returns zero matches!

      2) The variables for cancer codes are:
      Code:
      s_40006_0_0 s_40006_1_0 s_40006_2_0 s_40006_3_0 s_40006_4_0 s_40006_5_0 s_40006_6_0 s_40006_7_0 s_40006_8_0 s_40006_9_0 s_40006_10_0 s_40006_11_0 s_40006_13_0 s_40006_15_0 s_40006_16_0
      And the variables for age at cancer diagnosis are:
      Code:
      n_40008_0_0 n_40008_1_0 n_40008_2_0 n_40008_3_0 n_40008_4_0 n_40008_5_0 n_40008_6_0 n_40008_7_0 n_40008_8_0 n_40008_9_0 n_40008_10_0 n_40008_11_0 n_40008_12_0 n_40008_13_0 n_40008_14_0 n_40008_15_0 n_40008_16_0 n_40008_17_0 n_40008_18_0 n_40008_19_0 n_40008_20_0 n_40008_21_0 n_40008_22_0 n_40008_23_0 n_40008_24_0 n_40008_25_0 n_40008_26_0 n_40008_27_0 n_40008_28_0 n_40008_29_0 n_40008_30_0 n_40008_31_0
      If you notice, age variable are from 0_0 to 31_0 with no missing variables. And cancer code variable are from 0_0 to 16_0 with two missing which are 12_0 and 14_0. So, I want to pair the lowest age variable like this:
      Code:
      s_40006_0_0 = n_40008_0_0 (If s_40006_0_0 == "C61 | D400") & n_31_0_0 !=0
      s_40006_1_0 = n_40008_1_0 (If s_40006_1_0 == "C61 | D400") & n_31_0_0 !=0
      s_40006_2_0 = n_40008_2_0 (If s_40006_2_0 == "C61 | D400") & n_31_0_0 !=0
      s_40006_3_0 = n_40008_3_0 (If s_40006_3_0 == "C61 | D400") & n_31_0_0 !=0
      s_40006_4_0 = n_40008_4_0 (If s_40006_4_0 == "C61 | D400") & n_31_0_0 !=0
      s_40006_5_0 = n_40008_5_0 (If s_40006_5_0 == "C61 | D400") & n_31_0_0 !=0
      s_40006_6_0 = n_40008_6_0 (If s_40006_6_0 == "C61 | D400") & n_31_0_0 !=0
      s_40006_7_0 = n_40008_7_0 (If s_40006_7_0 == "C61 | D400") & n_31_0_0 !=0
      s_40006_8_0 = n_40008_8_0 (If s_40006_8_0 == "C61 | D400") & n_31_0_0 !=0
      s_40006_9_0 = n_40008_9_0 (If s_40006_9_0 == "C61 | D400") & n_31_0_0 !=0
      s_40006_10_0 = n_40008_10_0 (If s_40006_10_0 == "C61 | D400") & n_31_0_0 !=0
      s_40006_11_0 = n_40008_11_0 (If s_40006_11_0 == "C61 | D400") & n_31_0_0 !=0
      s_40006_13_0 = n_40008_13_0 (If s_40006_13_0 == "C61 | D400") & n_31_0_0 !=0
      s_40006_15_0 = n_40008_15_0 (If s_40006_15_0 == "C61 | D400") & n_31_0_0 !=0
      s_40006_16_0 = n_40008_16_0 (If s_40006_16_0 == "C61 | D400") & n_31_0_0 !=0
      Notice that we skipped 12_0 and 14_0 here as well as age variables above 16_0 because we don't want it. Also similar to the previous case if possible, I want the lowest age values because there are some obs that has overlapping.


      Thanks in Advance!

      Comment


      • #4
        Basically the code for age is like this:
        Code:
        gen age0_0= n_40008_0_0 if s_40006_0_0 =="C61" 
        gen age1_0= n_40008_1_0 if s_40006_1_0 =="C61"
        gen age2_0= n_40008_2_0 if s_40006_2_0 =="C61"
        gen age3_0= n_40008_3_0 if s_40006_3_0 =="C61"
        gen age4_0= n_40008_4_0 if s_40006_4_0 =="C61"
        gen age5_0= n_40008_5_0 if s_40006_5_0 =="C61"
        gen age6_0= n_40008_6_0 if s_40006_6_0 =="C61"
        gen age7_0= n_40008_7_0 if s_40006_7_0 =="C61"
        gen age8_0= n_40008_8_0 if s_40006_8_0 =="C61"
        gen age9_0= n_40008_9_0 if s_40006_9_0 =="C61"
        gen age10_0= n_40008_10_0 if s_40006_10_0 =="C61"
        gen age11_0= n_40008_11_0 if s_40006_11_0 =="C61"
        gen age13_0= n_40008_13_0 if s_40006_13_0 =="C61"
        gen age15_0= n_40008_15_0 if s_40006_15_0 =="C61"
        gen age16_0= n_40008_16_0 if s_40006_16_0 =="C61"
        
        gen age0_1= n_40008_0_0 if s_40006_0_0 == "D400" 
        gen age1_1= n_40008_1_0 if s_40006_1_0 == "D400" 
        gen age2_1= n_40008_2_0 if s_40006_2_0 == "D400" 
        gen age3_1= n_40008_3_0 if s_40006_3_0 == "D400" 
        gen age4_1= n_40008_4_0 if s_40006_4_0 == "D400" 
        gen age5_1= n_40008_5_0 if s_40006_5_0 == "D400" 
        gen age6_1= n_40008_6_0 if s_40006_6_0 == "D400" 
        gen age7_1= n_40008_7_0 if s_40006_7_0 == "D400" 
        gen age8_1= n_40008_8_0 if s_40006_8_0 == "D400" 
        gen age9_1= n_40008_9_0 if s_40006_9_0 == "D400" 
        gen age10_1= n_40008_10_0 if s_40006_10_0 == "D400" 
        gen age11_1= n_40008_11_0 if s_40006_11_0 == "D400" 
        gen age13_1= n_40008_13_0 if s_40006_13_0 == "D400" 
        gen age15_1= n_40008_15_0 if s_40006_15_0 == "D400" 
        gen age16_1= n_40008_16_0 if s_40006_16_0 == "D400"
        
        egen Age1 = rowmin(age0_0- age16_0)
        egen Age2 = rowmin( age0_1 - age16_1 )
        egen Age0 = rowmin( Age1 - Age2 )

        Comment


        • #5
          1) I want to search through the string varlist if it contain a range of string variables. For example, I want to generate a variable called benign that contain codes from D10 to D36.
          Sometimes these code has three digits like D10.5 and D21.4 but in the data its like D105 and D214. So I want the range include them as well no matter what comes after D10 and D36. Thanks!

          Comment


          • #6
            Originally posted by Mohammad Aladwani View Post

            Sometimes these code has three digits like D10.5 and D21.4 but in the data its like D105 and D214. So I want the range include them as well no matter what comes after D10 and D36. Thanks!
            Please help! I am stuck now on how to generate new variable that contains only this range of string values and the rest is missing. I would appreciate any suggestions as I have been searching and trying for several days now.

            Comment


            • #7
              It might help to know about inlist() so that

              Code:
              ... if inlist(whatever, "a", "b", "c")


              is another way to write

              Code:
              ... if whatever == "a" | whatever == "b" | whatever == "c"
              whereas


              Code:
              ... if whatever == "a" | "b" | "c"
              is never legal and

              Code:
              ... if whatever == "a | b | c"
              is legal, but not at all equivalent.

              Note that the three dots ... mean "other stuff goes here" and are not part of the syntax.


              Last edited by Nick Cox; 14 Nov 2019, 08:13.

              Comment


              • #8
                Originally posted by Nick Cox View Post
                It might help to know about inlist() so that

                Code:
                … if inlist(whatever, "a", "b", "c")


                is another way to write

                Code:
                if whatever == "a" | whatever == "b" | whatever == "c"
                whereas


                CODE]
                if whatever == "a" | "b" | "c"
                [/CODE]

                is never legal and

                CODE]
                if whatever == "a | b | c"
                [/CODE]

                is legal, but not at all equivalent.






                Thanks Nick! I was waiting you to reply on one of my questions .

                I wonder if inlist will work with range of values? i.e finding only values "D10" up to "D36" and other is missing from variables s_40006_0_0 to s_40006_16_0.

                To give you an example of what I want:

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str5(s_40006_0_0 s_40006_1_0 s_40006_2_0 s_40006_3_0 s_40006_4_0 s_40006_5_0)
                "D352" "C449" "" "" "" ""
                "D043" "D329" "" "" "" ""
                "C509" "D352" "" "" "" ""
                "D352" ""     "" "" "" ""
                "D333" "C445" "" "" "" ""
                "D320" "C449" "" "" "" ""
                "C750" "D351" "" "" "" ""
                "D320" ""     "" "" "" ""
                "D333" ""     "" "" "" ""
                "D320" ""     "" "" "" ""
                "D352" ""     "" "" "" ""
                "D333" ""     "" "" "" ""
                "D320" ""     "" "" "" ""
                "D320" "C443" "" "" "" ""
                "D352" ""     "" "" "" ""
                end

                In obs #1, obs#3, obs# 5, obs#6, obs# 7 and obs#14 it should be 0 because there are other codes than the range D10-D36. In obs#2 it is =1 as well as the rest where there is/are only the range of D10-D36.

                Having said that, what if I want to include other range to the previous range. For example, to generate a new variable that contain only string values from D10 to D36 AND O00 to O08?



                Hope I explain what I want properly.
                Last edited by Mohammad Aladwani; 14 Nov 2019, 08:30.

                Comment


                • #9
                  inlist() requires a list. inrange() might be closer to what you want.

                  Disclaimer: I haven't read most of this thread given (1) the shortness of life (2) a multiplicity of commitments (3) Clyde Schechter 's outstanding support.

                  Comment


                  • #10
                    Completely understandable. Thanks anyway Nick!. I guess I will have to wait Clyde Schechter 's support. Meanwhile I will continue searching.

                    Comment


                    • #11
                      Sorry, but being flippant about my answer doesn't mean that I thought it was useless. Do you believe #9 is irrelevant? Did you understand it? Did you even try it?

                      Comment


                      • #12
                        Not at all Nick. Well, I consider myself a beginner in STATA, especially programming and manipulating with data. If you give courses I would love to learn. Anyway I tried this:

                        Code:
                        gen hasD=0
                        replace hasD=1 if inrange (s_40006_0_0, "D10", "D36")
                        it works but I can't put varlist like s_40006_* or s_40006_0_0 - s_40006_16_0. The other thing is that it also returns observations where there are other codes in the rest of varlist. Also, I want to include other range with it which is (O00 to O08). Clearly I am missing something here and it is beyond my capabilities and that's why I consider myself a beginner. I have construct this silly data to show what I want.

                        Code:
                        * Example generated by -dataex-. To install: ssc install dataex
                        clear
                        input str4(var1 var2 var3) float hasD_and_O
                        "D105" " "    ""     1
                        "C69"  "D24"  ""     0
                        "O08"  "D300" ""     1
                        "D40"  "D14"  ""     0
                        "D32"  "D181" ""     1
                        "D160" "D246" "D305" 1
                        end


                        Thanks again!

                        Comment


                        • #13
                          There are several questions floating around in this thread, and I find it confusing to put them all together in a single post. So I'll just respond here to the problem raised in #12. The code you show in your upper code box works well for a single variable. Now all you have to do is wrap that inside a loop over variables:

                          Code:
                          gen hasD = 0
                          foreach v of varlist s_40006_* { // OR WHATEVER VARLIST IS APPRPORIATE
                              
                          replace hasD=1 if inrange (`v', "D10", "D36") }


                          Now, I'm going to point out a potential problem with your data and this code.

                          Code:
                          . display inrange("D101", "D10", "D16")
                          1
                          In other words, because Stata sorts strings alphabetically and does not recognize sequences of digits as numbers in this context, for Stata, D101 is between D10 and D16. Now if D101 is actually an abbreviation for D10.1, then this is perhaps no problem. But if you have some four character codes where the last three actually represent an integer > 99, you are going to get wrong answers here. I can't even propose a workaround for this because there is no way within the data, to distinguish these two possible interpretations of a letter followed by three digits. It could only be done by having external knowledge of how this code system works: and Stata does not have that.

                          Perhaps you are lucky and this is not even an issue: perhaps there are no codes with an integer > 99 (or, at least, none in your particular data set.) But if there are, then you may have to go back to the beginning and re-generate your data set so as to preserve the decimal points. You will also then need to switch from using -inrange()- to looking for string matches because "D101" is still, in the Stata sort order, between D10.0 and D16.0. So, for example, the -replace statement might have to change to something like:

                          Code:
                          replace hasD = 1 if strmatch(`v', "D1?.0") & inrange(substr(`v', 3, 1), "1", "6")
                          Now, I think I remember somewhere in the thread you may have mentioned that these are ICD10 codes. If so, you probably can simplify your life considerably by making use of Stata's -icd10- commands. Read both -help icd10- and the section of the PDF manual that is linked (in blue) near the top of that page. It can, in particular, facilitate the tasks of checking and validating these codes, and also it has a command that can help you generate variables that select out particular groups of codes, including ranges.

                          Comment


                          • #14
                            All sorted out. Thanks a lot Clyde Schechter for your time and inputs.

                            Comment

                            Working...
                            X