Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • parallel usage

    I sometimes run many regressions on each of 1000s of genes or proteins where subjects from an experiment are measured for each protein. I sometimes use Stata, sometimes R. In this case I'm using Stata because I want an interval censored Cox model. Stata offers a better implementation of this model in my opinion than R solutions. My question has to do with whether I can use the parallel package. Below is example code, that uses third party commands xsvmat and xframeappend to grab r(table) from each model and append into one results frame. xsvmat and xframeappend are from Roger Newson. To speed this up, I broke the 2755 genes into groups of 500 or so and ran those subsets in 5 separate Stata instances because I have a Windows laptop with 32gb of RAM and 14 cores. I manually appended the results into a single results file and ran model diagnostics on top biomarkers. While this approach worked just fine, I was curious about the use of parallel.

    Has anyone had a similar problem dataset (long formatted dataset with many endpoints) and used parallel on a Windows machine to collect model results for each endpoint?

    Code:
    frame reset    
    use mydata.dta, clear
    frame create cox_results
    local setA = "c.x1 i.x2"
    quietly {
        forvalues i = 1/2755 {
            noisily display `i'
            stintcox baseline_ntx `setA' if gene_id == `i', ///
                interval(ltime rtime) favorspeed
            xsvmat, from(r(table)') rowname(parm) names(col) idnum(`i') ///
                frame(outframe, replace)
            frame change cox_results
            xframeappend outframe, fast
            frame change default
        }
    }
    frame change cox_results
    save cox_setA_cov_results.dta, replace
    
    /*
    The mydata.dta variables are:
    
    gene_id subject ltime rtime baseline_ntx x1 x2
    
    2755 gene_id, ~800 subjects measured for each protein (gene_id)
    A cox model is run for each gene_id using a for loop
    */
    Last edited by Dave Airey; 04 Aug 2025, 08:49.

  • #2
    Hi Dave,

    You can certainly do something like we do in this example. You need to write a program that holds all the logic, iterating through gene ids after listing them. It would be something like this (my Stat is a bit rusty!):

    Code:
    prog def parfor
      /* some code to list available genes */
      /* and store it in genelist */
      foreach i in $genelist {
        ... your fancy code ...
        ... A step to save the i-th gene results...
      } 
    end
    
    parallel initialize 10
    parallel, prog(parfor): parfor
    
    /* Collect all the results */

    Comment


    • #3
      Thank you George Vega! So it is possible. I will give it a try. Cheers!

      Comment


      • #4
        It worked great George Vega! Just trying 500 genes in a stintcox model with no covariates and using 8 cores, the parallel code ran in 53 seconds whereas the sequential code ran in 184 seconds, 3.5x difference. Pretty nice.

        Code:
        clear all
        use mydata.dta, clear
        keep if gene_id <= 500
        
        program define parfor
        levelsof gene, local(mygenes)
        global mygenes = r(levels)
        foreach i of global mygenes {
            stintcox baseline_ntx if gene == "`i'", /// 
                interval(ltime rtime) favorspeed
            xsvmat, from(r(table)') rowname(parm) names(col) idstr("`i'") /// 
                saving(.\parallel\gene_`i', replace)
        }
        end
        
        timer clear
        timer on 1
        parallel initialize 8
        parallel, prog(parfor): parfor
        timer off 1
        timer list
        
        use mydata.dta, clear
        keep if gene_id <= 500
        levelsof gene, local(mygenes)
        global mygenes = r(levels)
        timer on 2
        quietly {
        foreach i of global mygenes {
            stintcox baseline_ntx if gene == "`i'", /// 
                interval(ltime rtime) favorspeed
            xsvmat, from(r(table)') rowname(parm) names(col) idstr("`i'") /// 
                saving(.\parallel\gene_`i', replace)
        }
        }
        timer off 2
        
        timer list
        . timer list
        1: 53.05 / 1 = 53.0550
        2: 184.05 / 1 = 184.0460
        Last edited by Dave Airey; 04 Aug 2025, 13:21.

        Comment


        • #5
          Hmmm.... this should be in StataNow! A nice complement to Stata MP, which I have and like. Oh, and I did check results against each other and they were the same.
          Last edited by Dave Airey; 04 Aug 2025, 13:51.

          Comment


          • #6
            Actually when I checked more carefully by a merge operation with assert commands, I discovered that 4/500 of the genes' data had been split across cores. There are 800+ observations on the same subjects per gene. This results in models being estimated in part of the 800 observations per gene. So the parfor program needs more information to split the data across cores but by gene.

            Comment


            • #7
              I think you can solve the issue by just reloading the original dataset within your program. By doing so, each gene is tested on the original, unsplit dataset.
              Best wishes

              Stata 18.0 MP | ORCID | Google Scholar

              Comment


              • #8
                Thanks Felix Bittmann , I solved this just now using a different method, but one illustrated in George Vega's gallery. I used one of the parallel macros "$pll_instance". I created a variable pid assigning an integer from 1 to the number of cores I wanted to each gene randomly. I then used that in the program to subset the data to a core. Now all my genes get processed and match the sequential results using a series of assert commands (not shown).

                Code:
                clear all
                use mydata.dta, clear
                keep if gene_id <= 500
                
                bysort gene: generate pid = runiformint(1,8) if _n == 1
                bysort gene: replace pid = pid[1]
                save mytempdata.dta, replace
                
                program define parfor
                use if pid == $pll_instance using mytempdata.dta, clear
                levelsof gene, local(mygenes)
                global mygenes = r(levels)
                foreach i of global mygenes {
                    stintcox baseline_ntx if gene == "`i'", ///
                        interval(ltime rtime) favorspeed
                    xsvmat, from(r(table)') rowname(parm) names(col) idstr("`i'") ///
                        saving(.\parallel\gene_`i', replace)
                }
                end
                
                timer clear
                timer on 1
                parallel initialize 8
                parallel, prog(parfor): parfor
                timer off 1
                timer list
                
                use mytempdata.dta, clear
                levelsof gene, local(mygenes)
                global mygenes = r(levels)
                timer on 2
                quietly {
                foreach i of global mygenes {
                    stintcox baseline_ntx if gene == "`i'", ///
                        interval(ltime rtime) favorspeed
                    xsvmat, from(r(table)') rowname(parm) names(col) idstr("`i'") ///
                        saving(.\parallel\gene_`i', replace)
                }
                }
                timer off 2
                
                timer list
                Last edited by Dave Airey; 05 Aug 2025, 11:40.

                Comment


                • #9
                  Actually, George pointed out offline there is a by option that keeps the by variable from being split up across child processes. Slapped forehead. That will work too.

                  Comment


                  • #10
                    Hello everyone,

                    I am also trying to use the parallel function but I am stuggling to make it work. Could you point me what am I missing?

                    I am running a very simple code like this one:

                    Code:
                    clear all
                    
                    set obs 1000
                    gen y=rnormal()
                    gen x1=rnormal()
                    gen x2=rnormal()
                    
                    
                    program define parfor
                    
                    foreach n of numlist 1/100 {
                    reghdfe y x1 x2
                    estimates save "$path/OUTPUT/try_loop/try_`n'", replace
                    }
                    end
                    
                    parallel initialize 4
                    parallel, prog(parfor): parfor
                    But I keep having an error as one of the child (or more) stops because cannot write the .ster file. Moreover, from the log files, it seems to me that each child performs all the loop, while I was expecting some child to do some part like from n=1/25 , the other n=25/50 etc.

                    Is there something I am doing wrong and something I am not grasping from how parallel work?

                    Comment


                    • #11
                      The problem is that parallel works a bit differently as you suppose in this example. parallel cannot magically read your dofile and find all relevant loops and then split your file up intelligently (sadly). What it does, it splits the dataset into 4 parts and then applies your code to all. Then it recombines the results into a single dataset. This can be helpful, but in your example. you want to distribute the loop across multiple threads but using the original complete datafile. This could be done as such:

                      Code:
                      clear all
                      global totalcores = 2        //This MUST be a global to work
                      parallel init $totalcores
                      set obs 1000
                      gen y = runiform()
                      gen x = runiform()
                      gen id = _n
                      save maindata, replace        //Must be stored, so each instance can start with a complete dataset
                      
                      program parfor
                          use maindata, clear
                          di $pll_instance
                          local inst = $pll_instance - 1        //Parallel starts enumerating at 1, so this is required for mod() to work
                          forvalues i = 1/10 {        //Here you define your main loop
                              if mod(`i', $totalcores) == `inst' {
                                  reg y x if id > `i'        //Some condition I guess
                                  estimates save test_`i'
                              }
                          }
                      end
                      
                      
                      parallel, prog(parfor): parfor
                      Best wishes

                      Stata 18.0 MP | ORCID | Google Scholar

                      Comment

                      Working...
                      X