Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I make this faster?

    Hello everyone,

    I am working on a big dataset. I need to perform a task many times, and the script I have written for it is pretty slow. I was wondering if you have suggestions on how to make it faster. The toy example of my task is as follows:
    Code:
    clear
    set obs 50
    generate time = _n
    expand 2, generate(kind)
    expand 100, generate(copied)
    generate v1 = runiform(0,1)
    generate v2 = runiform(0,1)
    generate v3 = runiform(0,1)
    
    // Read time range
    summarize time
    local tmin = `r(min)'
    local tmax = `r(max)'
    
    // Iterate over all variables
    ds time kind copied, not
    foreach v of varlist `r(varlist)' {
      // Create empties to store estimates
      generate pe_`v' = .
      generate lb_`v' = .
      generate ub_`v' = .
    
      // Iterate over time and spouse/versions of EmSt
      forvalues k = 0/1 {
        forvalues t = `tmin'/`tmax' {
          // Point estimate
          * Check is non-missing
          count if `v' != . & copied == 0 & time == `t' & kind == `k'
          if (`r(N)' > 0) { // Not all are missing
            summarize `v' if copied == 0 & time == `t' & kind == `k'
            replace pe_`v' = `r(mean)' if time == `t' & kind == `k'
          }
    
          // Confidence intervals
          * Check is non-missing
          count if `v' != . & copied > 0 & time == `t' & kind == `k'
          if (`r(N)' > 0) { // Not all are missing
            _pctile `v' if copied > 0 & time == `t' & kind == `k', p(2.5 97.5)
            replace lb_`v' = `r(r1)' if time == `t' & kind == `k'
            replace ub_`v' = `r(r2)' if time == `t' & kind == `k'
          }
        }
      }
    }
    
    // Keep relevant variables & observations
    keep time kind pe_* lb_* ub_*
    bysort time kind: keep if _n == 1
    My real task has many more observations and variables. I thought that perhaps it's faster to use bysort: somehow, but I cannot see how to do it.

    All suggestions are welcomed.

    Thanks!

  • #2
    Arnau, the code below may substantially cut the running time.

    Code:
    clear
    set obs 50
    generate time = _n
    expand 2, generate(kind)
    expand 100, generate(copied)
    generate v1 = runiform(0,1)
    generate v2 = runiform(0,1)
    generate v3 = runiform(0,1)
    
    
    sort time kind copied
    
    foreach v of varlist v1-v3 {
        by time kind: egen pe_`v' = mean(`v') if copied == 0
    
        by time kind: egen lb_`v' = pctile(`v') if copied > 0, p(2.5)
        by time kind: replace lb_`v' = lb_`v'[_N] if copied == 0
    
        by time kind: egen ub_`v' = pctile(`v') if copied > 0, p(97.5)
        by time kind: replace ub_`v' = ub_`v'[_N] if copied == 0
    }
    
    keep time kind pe_* lb_* ub_*
    by time kind: keep if _n == 1

    Comment


    • #3
      Awsome! It is faster. For comparison (I added a few missing observations as it is the case with my real data):
      Code:
      timer clear
      
      timer on 1
      clear
      set obs 50
      generate time = _n
      expand 2, generate(kind)
      expand 100, generate(copied)
      generate v1 = runiform(0,1)
      generate v2 = runiform(0,1)
      generate v3 = runiform(0,1)
      generate v4 = runiform(0,1)
      replace v4 = . if time > 20 & time < 25
      
      
      sort time kind copied
      
      ds time kind copied, not
      foreach v of varlist `r(varlist)' {
          by time kind: egen pe_`v' = mean(`v') if copied == 0
      
          by time kind: egen lb_`v' = pctile(`v') if copied > 0, p(2.5)
          by time kind: replace lb_`v' = lb_`v'[_N] if copied == 0
      
          by time kind: egen ub_`v' = pctile(`v') if copied > 0, p(97.5)
          by time kind: replace ub_`v' = ub_`v'[_N] if copied == 0
      }
      
      keep time kind pe_* lb_* ub_*
      by time kind: keep if _n == 1
      timer off 1
      
      timer on 2
      clear
      set obs 50
      generate time = _n
      expand 2, generate(kind)
      expand 100, generate(copied)
      generate v1 = runiform(0,1)
      generate v2 = runiform(0,1)
      generate v3 = runiform(0,1)
      generate v4 = runiform(0,1)
      replace v4 = . if time > 20 & time < 25
      
      // Read time range
      summarize time
      local tmin = `r(min)'
      local tmax = `r(max)'
      
      // Iterate over all variables
      ds time kind copied, not
      foreach v of varlist `r(varlist)' {
        // Create empties to store estimates
        generate pe_`v' = .
        generate lb_`v' = .
        generate ub_`v' = .
      
        // Iterate over time and spouse/versions of EmSt
        forvalues k = 0/1 {
          forvalues t = `tmin'/`tmax' {
            // Point estimate
            * Check is non-missing
            count if `v' != . & copied == 0 & time == `t' & kind == `k'
            if (`r(N)' > 0) { // Not all are missing
              summarize `v' if copied == 0 & time == `t' & kind == `k'
              replace pe_`v' = `r(mean)' if time == `t' & kind == `k'
            }
      
            // Confidence intervals
            * Check is non-missing
            count if `v' != . & copied > 0 & time == `t' & kind == `k'
            if (`r(N)' > 0) { // Not all are missing
              _pctile `v' if copied > 0 & time == `t' & kind == `k', p(2.5 97.5)
              replace lb_`v' = `r(r1)' if time == `t' & kind == `k'
              replace ub_`v' = `r(r2)' if time == `t' & kind == `k'
            }
          }
        }
      }
      
      // Keep relevant variables & observations
      keep time kind pe_* lb_* ub_*
      bysort time kind: keep if _n == 1
      timer off 2
      
      timer list
      The improvement:
      Code:
      . timer list
         1:      0.22 /        1 =       0.2240
         2:      1.95 /        1 =       1.9540

      Comment

      Working...
      X