Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Error parallelising foreach command

    Hi,

    I'm trying to split a master dataset into its constituent country parts. I'm using large datasets (300g) that currently are taking more than 48 hours to run, so any help speeding the process would be much appreciated.

    Using this data:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str20 bvd_id_number strL main_activity str2 countrycode
    "CN9360430024" "Manufacturing" "CN"
    "AU072891993"  "Services"      "AU"
    "US149668182L" "Manufacturing" "US"
    "US133096011L" "Services"      "US"
    "CA32531NC"    "Services"      "CA"
    end

    and this Stata code:

    Code:
    
    //create country list
    glevelsof countrycode, local(countries)
    
    //timer on
    
    timer on 1
    
    parallel: foreach c of local countries {  
        use overviews.dta, clear
        keep if countrycode == "`c'"
        save `c', replace    
    }
    
    timer off 1
    
    timer list 1


    Code:
     //timer on
    . 
    . timer on 1
    
    . 
    . parallel: foreach c of local countries {  
    --------------------------------------------------------------------------------
    Parallel Computing with Stata (by GVY)
    Clusters   : 4
    pll_id     : rp2wznupm1
    Running at : D:\Firmographics\overviews\parallell_test
    Randtype   : datetime
    Waiting for the clusters to finish...
      -3621
    cluster 0004 has exited without error...
      -3621
    cluster 0001 has exited without error...
      -3621
    cluster 0002 has exited without error...
      -3621
    cluster 0003 has exited without error...
    --------------------------------------------------------------------------------
    Enter -parallel printlog #- to checkout logfiles.
    --------------------------------------------------------------------------------
                    unlink():  3621  attempt to write read-only file
    parallel_recursively_rm():     -  function returned error
            parallel_clean():     -  function returned error
                     <istmt>:     -  function returned error
    r(3621);
    
    end of do-file
    Thanks

    Ciaran

  • #2
    Well, I think this computation is probably I/O bound and I'm not sure how much you can speed it up. But there are a few things that can be done. For one thing, you are reading in the entire data set over and over again for each country. Second, you are iterating over the levels of countrycode, requiring you to apply an -if countrycode == "`c'"- qualifier to every observation in the complete data set at each iteration. Both of these problems can be overcome by using -runby-.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str20 bvd_id_number strL main_activity str2 countrycode
    "CN9360430024" "Manufacturing" "CN"
    "AU072891993"  "Services"      "AU"
    "US149668182L" "Manufacturing" "US"
    "US133096011L" "Services"      "US"
    "CA32531NC"    "Services"      "CA"
    end
    
    capture program drop one_country
    program define one_country
        local c = countrycode[1]
        save `c', replace
        clear
        exit
    end
    
    runby one_country, by(countrycode) status
    -runby- is written by Robert Picard and me, and is available from SSC. Note: the status option causes -runby- to give you a progress report periodically. It will tell you how many countries have been processed so far, in how much time, and give an estimate of the time remaining.

    As I say, given all the file saving you have to do, I'm not sure how much time can be saved here. But at least you'll only have to read the whole big file in once, and you will not have to evaluate any -if- qualifiers. I'm sure it will be noticeable, but it may not be dramatic.

    Comment


    • #3
      Hi Clyde,

      I tried this on one of the smaller files just now and the speed increase was dramatic! Thank you!

      Comment

      Working...
      X