Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • parallel usage

    I sometimes run many regressions on each of 1000s of genes or proteins where subjects from an experiment are measured for each protein. I sometimes use Stata, sometimes R. In this case I'm using Stata because I want an interval censored Cox model. Stata offers a better implementation of this model in my opinion than R solutions. My question has to do with whether I can use the parallel package. Below is example code, that uses third party commands xsvmat and xframeappend to grab r(table) from each model and append into one results frame. xsvmat and xframeappend are from Roger Newson. To speed this up, I broke the 2755 genes into groups of 500 or so and ran those subsets in 5 separate Stata instances because I have a Windows laptop with 32gb of RAM and 14 cores. I manually appended the results into a single results file and ran model diagnostics on top biomarkers. While this approach worked just fine, I was curious about the use of parallel.

    Has anyone had a similar problem dataset (long formatted dataset with many endpoints) and used parallel on a Windows machine to collect model results for each endpoint?

    Code:
    frame reset    
    use mydata.dta, clear
    frame create cox_results
    local setA = "c.x1 i.x2"
    quietly {
        forvalues i = 1/2755 {
            noisily display `i'
            stintcox baseline_ntx `setA' if gene_id == `i', ///
                interval(ltime rtime) favorspeed
            xsvmat, from(r(table)') rowname(parm) names(col) idnum(`i') ///
                frame(outframe, replace)
            frame change cox_results
            xframeappend outframe, fast
            frame change default
        }
    }
    frame change cox_results
    save cox_setA_cov_results.dta, replace
    
    /*
    The mydata.dta variables are:
    
    gene_id subject ltime rtime baseline_ntx x1 x2
    
    2755 gene_id, ~800 subjects measured for each protein (gene_id)
    A cox model is run for each gene_id using a for loop
    */
    Last edited by Dave Airey; 04 Aug 2025, 08:49.

  • #2
    Hi Dave,

    You can certainly do something like we do in this example. You need to write a program that holds all the logic, iterating through gene ids after listing them. It would be something like this (my Stat is a bit rusty!):

    Code:
    prog def parfor
      /* some code to list available genes */
      /* and store it in genelist */
      foreach i in $genelist {
        ... your fancy code ...
        ... A step to save the i-th gene results...
      } 
    end
    
    parallel initialize 10
    parallel, prog(parfor): parfor
    
    /* Collect all the results */

    Comment


    • #3
      Thank you George Vega! So it is possible. I will give it a try. Cheers!

      Comment


      • #4
        It worked great George Vega! Just trying 500 genes in a stintcox model with no covariates and using 8 cores, the parallel code ran in 53 seconds whereas the sequential code ran in 184 seconds, 3.5x difference. Pretty nice.

        Code:
        clear all
        use mydata.dta, clear
        keep if gene_id <= 500
        
        program define parfor
        levelsof gene, local(mygenes)
        global mygenes = r(levels)
        foreach i of global mygenes {
            stintcox baseline_ntx if gene == "`i'", /// 
                interval(ltime rtime) favorspeed
            xsvmat, from(r(table)') rowname(parm) names(col) idstr("`i'") /// 
                saving(.\parallel\gene_`i', replace)
        }
        end
        
        timer clear
        timer on 1
        parallel initialize 8
        parallel, prog(parfor): parfor
        timer off 1
        timer list
        
        use mydata.dta, clear
        keep if gene_id <= 500
        levelsof gene, local(mygenes)
        global mygenes = r(levels)
        timer on 2
        quietly {
        foreach i of global mygenes {
            stintcox baseline_ntx if gene == "`i'", /// 
                interval(ltime rtime) favorspeed
            xsvmat, from(r(table)') rowname(parm) names(col) idstr("`i'") /// 
                saving(.\parallel\gene_`i', replace)
        }
        }
        timer off 2
        
        timer list
        . timer list
        1: 53.05 / 1 = 53.0550
        2: 184.05 / 1 = 184.0460
        Last edited by Dave Airey; 04 Aug 2025, 13:21.

        Comment


        • #5
          Hmmm.... this should be in StataNow! A nice complement to Stata MP, which I have and like. Oh, and I did check results against each other and they were the same.
          Last edited by Dave Airey; 04 Aug 2025, 13:51.

          Comment


          • #6
            Actually when I checked more carefully by a merge operation with assert commands, I discovered that 4/500 of the genes' data had been split across cores. There are 800+ observations on the same subjects per gene. This results in models being estimated in part of the 800 observations per gene. So the parfor program needs more information to split the data across cores but by gene.

            Comment


            • #7
              I think you can solve the issue by just reloading the original dataset within your program. By doing so, each gene is tested on the original, unsplit dataset.
              Best wishes

              Stata 18.0 MP | ORCID | Google Scholar

              Comment


              • #8
                Thanks Felix Bittmann , I solved this just now using a different method, but one illustrated in George Vega's gallery. I used one of the parallel macros "$pll_instance". I created a variable pid assigning an integer from 1 to the number of cores I wanted to each gene randomly. I then used that in the program to subset the data to a core. Now all my genes get processed and match the sequential results using a series of assert commands (not shown).

                Code:
                clear all
                use mydata.dta, clear
                keep if gene_id <= 500
                
                bysort gene: generate pid = runiformint(1,8) if _n == 1
                bysort gene: replace pid = pid[1]
                save mytempdata.dta, replace
                
                program define parfor
                use if pid == $pll_instance using mytempdata.dta, clear
                levelsof gene, local(mygenes)
                global mygenes = r(levels)
                foreach i of global mygenes {
                    stintcox baseline_ntx if gene == "`i'", ///
                        interval(ltime rtime) favorspeed
                    xsvmat, from(r(table)') rowname(parm) names(col) idstr("`i'") ///
                        saving(.\parallel\gene_`i', replace)
                }
                end
                
                timer clear
                timer on 1
                parallel initialize 8
                parallel, prog(parfor): parfor
                timer off 1
                timer list
                
                use mytempdata.dta, clear
                levelsof gene, local(mygenes)
                global mygenes = r(levels)
                timer on 2
                quietly {
                foreach i of global mygenes {
                    stintcox baseline_ntx if gene == "`i'", ///
                        interval(ltime rtime) favorspeed
                    xsvmat, from(r(table)') rowname(parm) names(col) idstr("`i'") ///
                        saving(.\parallel\gene_`i', replace)
                }
                }
                timer off 2
                
                timer list
                Last edited by Dave Airey; Yesterday, 11:40.

                Comment


                • #9
                  Actually, George pointed out offline there is a by option that keeps the by variable from being split up across child processes. Slapped forehead. That will work too.

                  Comment

                  Working...
                  X