Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looping over multiple levels of multiple variables

    First of all, my solution works but it is slow. Maybe I should just be content that it works but I am imagining an increasingly complicated dataset where my solution would be painfully slow! Is Stata efficient with foreach loops? Is there a more efficient data form? In R I would use matrices.

    I am using public-use CDC multiple cause mortality data. http://www.nber.org/data/vital-stati...-of-death.html They are very large files.

    The multiple causes of deaths are coded (by ICD-10 codes) in 20 variables named record_1 to record_20. I would like to select only observations that have certain multiple causes of death (at least one of list A AND at least one of list B). There are hundreds (thousands?) levelsof(record_*) but I am only interested in these 3 + 15. record_* variables are type string and potentially have the same possible levels but not all of the levels are observed in all of the variables.

    Code:
    capture drop OD
    g OD = .
    
    foreach var_A of varlist record_* {
        foreach MCD_A in T402 T403 T404 {
                    replace OD = 1 if `var_A' == "`MCD_A'"
                    }
                }
    
    keep if OD == 1
    
    foreach var_B of varlist record_* {
        foreach MCD_B in X40 X41 X42 X43 X44 X60 X61 X62 X63 X64 X85 Y10 Y11 Y12 Y13 Y14 {
            replace OD = 2 if `var_B' == "`MCD_B'"
            }
        }
    
    keep if OD == 2
    So I loop over the first 3 of from List A for each of the 20 record_* variables. Then I keep only those that match. This first step reduces the number of observations from 2,394,871 to ~10,000 Then I loop over the 15 from list B in each of the 20 record_* variables. I don't care where they show up in those variables or in which order. Then I keep all observations that meet the second criterion.

    These for loops work (I made them based on Nicholas Cox's excellent for loop "fortitude" tutorial). But it seems inefficient. Because then I do this for each of the files from 1999-2014. Any thoughts? Posting sample data would be unwieldy, I looked for a relevant toy dataset but I couldn't really find anything helpful.

    Thanks!

  • #2
    Well, the inner loops are unnecessary and can be handled with the -inlist()- function. That, in turn, is part of Stata's C-coded core and should be more efficient than repeatedly interpreting comparisons. I don't know how much time it will really save you, but I can't imagine it will hurt.

    Code:
    capture drop OD
    g OD = .
    
    foreach var_A of varlist record_* {
        replace OD = 1 if inlist(`var_A', "T402", "T403", "T404")
    }
    
    keep if OD == 1
    
    foreach var_B of varlist record_* {
        replace OD = 2 if inlist(`var_B', "X40", "X41", "X42", "X43", "X44", "X60", "X61", "X62", "X63") ///
            | inlist(`var_B', "X64", "X85", "Y10", "Y11", "Y12", "Y13", "Y14")
    }
    
    keep if OD == 2
    An additional improvement might come from -encode-ing the record_* variables and then comparing to the appropriate numeric values; numeric comparisons should be faster than string comparisons. Of course, there will be the time used doing the -encode- command itself, but my guess is that doing that once will be faster than all those repeated string comparisons.

    Comment


    • #3
      I don't follow the context completely, but it looks like egen and bysort would work very well here..?

      Comment


      • #4
        Clyde's solution was very fast. I had shied away from inlist because of the 10 strings maximum and had not considered combining inlists with | .

        Comment


        • #5
          Originally posted by Melvin Donaldson View Post
          Clyde's solution was very fast. I had shied away from inlist because of the 10 strings maximum and had not considered combining inlists with | .
          Combining multiple insist() functions is a good work around. I prefer to use the regexm() function instead. So that might look like:

          Code:
           replace test = 1 if regexm(var_B, `"^T402$|^T403$|^T404$|^X40$|^X41$|^X42$|^X43$|^X44$|^X60$|^X61$|^X62$|^X63$|^X64$|^X85$|^Y10$|^Y11$|^Y12$|^Y13$|^Y14$"')
          instead of
          Code:
           
          replace OD = 1 if inlist(`var_A', "T402", "T403", "T404")
          *and
          replace OD = 2 if inlist(`var_B', "X40", "X41", "X42", "X43", "X44", "X60", "X61", "X62", "X63") ///
                  | inlist(`var_B', "X64", "X85", "Y10", "Y11", "Y12", "Y13", "Y14")
          I probably wouldn't type all the elements in the regexm() list though, I'd use something in the loop to build a macro containing the list with the regular expression anchors around each item in the list, something like:

          Code:
          local keeplist `""T402" "T403" "T404" "X40" "X41" "X42" "X43" "X44" "X60" "X61" "X62" "X63" "X64" "X85" "Y10" "Y11" "Y12" "Y13" "Y14""' //you can use this later if needed
              **process keeplist for regex -- adds the anchors for regexm**
                local keeplist2: subinstr local keeplist `"" ""' `"$|^"', all
                local keeplist2: subinstr local keeplist2 `"""' `"^"', 
                local keeplist2: subinstr local keeplist2 `"""' `"$"',     
          foreach var_A of varlist record_* {
              replace OD = 1 if regexm(`var_A', `"`keeplist2'"')
          }
          Putting the list of items in a macro (like 'keeplist' above) helps you recycle the list for other commands later in the command file.

          It's worth noting that separating out the replace command using 2 or 3 inlist() functions (because of the 9 item limit) slows down the loop that generates the variable OD above, and when I run the version using regexm() for all the elements it's about 1 second slower on my computer than the inlist() approach (even though a single inlist() function is definitely faster by itself). You can see this yourself by using the toy do-file code below which downloads, unzips, imports, combines the nber/mortality data and then creates "OD" using both the inlist() and the regexm() approaches.

          At minimum, this code might shorten the work for downloading and combining more years of nber/mortality data for your study. I wrote it because I was curious about the speed difference in inlist() vs. regexm() and thought I'd share it:

          Code:
          **--1.  make a master file of mortality data to work with**
              clear
              sa "masterfile.dta", replace emptyok //we'll use this later
          
              
          **--2.  get data for 99-2001 and combine it into one file**
              *! you can expand this to all years you need !*
          forval n = 1999/2001 { //takes a while to run, can expand to other years
              cap confirm file "mort`n'.dta"
              di _rc " for `n'" //no need to redownload/unzip if you have it.
          if _rc {
              copy http://www.nber.org/mortality/`n'/mort`n'.dta.zip "mort`n'.dta.zip", replace public   //may take a bit
              unzipfile "mort`n'.dta.zip", replace
                  }
              u `"mort`n'.dta"', clear 
              g fileyear = `n'
               cap tostring divstoc, replace  //doesnt match format across years
              append using "masterfile.dta"
           sa "masterfile.dta", replace
          }
          
          
          
          
          **--3.  here is the 1999/2001 data to manage:
          u "masterfile.dta", clear
              ta fileyear //combined for 3 years
              set rmsg on    //take a look at the timing difference
          
           **here are elements to loop over**
           local keeplist `""T402" "T403" "T404" "X40" "X41" "X42" "X43" "X44" "X60" "X61" "X62" "X63" "X64" "X85" "Y10" "Y11" "Y12" "Y13" "Y14""' //you can use this later if needed
          
          
          //--FIRST APPROACH IS WITH INLIST--//
          capture drop OD
          g OD = .
          
          foreach var_A of varlist record_* {
              replace OD = 1 if inlist(`var_A', "T402", "T403", "T404")
               replace OD = 1 if inlist(`var_A', "X40", "X41", "X42", "X43", "X44", "X60", "X61", "X62", "X63") ///
             | inlist(`var_A', "X64", "X85", "Y10", "Y11", "Y12", "Y13", "Y14")
          }
          
          
          
          //--NOW WITH REGEXM--//
          cap drop test
          g test = .
              **first, process keeplist for regexm**
              local keeplist2: subinstr local keeplist `"" ""' `"$|^"', all
              local keeplist2: subinstr local keeplist2 `"""' `"^"', 
              local keeplist2: subinstr local keeplist2 `"""' `"$"', 
          foreach var_A of varlist record_* {
              replace test = 1 if regexm(`var_A', `"`keeplist2'"')
          }
          
          assert OD == test //check equivalent
          ta OD test, mi
          
          
          
          
          //--COMPARE WITH 2 LOOPS  & INLIST: LOOPING OVER VALUES::
              **this takes nearly twice as long to process.
          g x =.
          foreach var_A in`keeplist' {
              foreach var_B of varlist record_* {
              replace x = 1 if inlist("`var_A'", `var_B') //switched elements
          }
          }
          
          ta OD x, mi




          It's probably a matter of preference, but I will caution/mention that if you leave off the anchors for regexm() you can run into issues with regexm(var, `"(T10|T11)"') matching things like "T101" or "T1111" ...inlist() avoids this potential mistake/issue by specifying the exact string you want to match. The contrary view is just as true though, that sometimes you want regexm() because of it's flexibility to match things like "T[0-9][0-9]x" where inlist() cannot.


          ______________
          One last thing I was curious about was if there was some kind of internal limit on regexm() in the same way there is for inlist(). For the example below (with anchors), the limit appears to be 1,992 elements. I'm not sure where this comes from (those elements create a regular expression that is 12,837 characters long, so I don't think that is related to any internal limits on macros/chars/stringL/etc?). Just fyi.

          Try for yourself:

          Code:
          **how many regexm elements**
          **Test regexm**
          
          clear
          set obs 10000
          g x = string(_n)
          
          
          count if inlist(x, "1", "2" , "3", "4", "5", "6", "7", "8", "9") //lim=9
          count if inlist(x, "1", "2" , "3", "4", "5", "6", "7", "8", "9") | inlist(x, "10") //limit is 9
          
          
          count if regexm(x, `"^1$|^2$|^3$|^4$|^5$|^6$|^7$|^8$|^9$"')  //works
          count if regexm(x, `"^1$|^2$|^3$|^4$|^5$|^6$|^7$|^8$|^9$|^10$|^11$|^12$|^13$|^14$|^15$|^16$|^17$|^18$|^19$"')  //works
          
          forval n = 1/1992 {
              loc nn `"`nn'|^`n'$"'
              }
              local nn:subinstr local nn "|" ""
              di `"`nn'"'
          count if regexm(x, `"`nn'"') //still works with 1,992 limit for elements ; fails for 1,993








          Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

          Comment

          Working...
          X