Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is -fastsample- still available? (A faster alternative to -sample- for large datasets)

    Does anyone know of a faster (perhaps more recent) alternative to -sample- for large datasets?

    I could not find the 2014 package:
    https://www.stata.com/statalist/arch.../msg00008.html

  • #2
    Perhaps you can use the Stata code by Andrew Maurer (who wrote `fastsample`) provided in the Statalist link you showed:
    Code:
    ************** Define mata functions **********************
    mata
    
    void fastbsample(real scalar n)
    // faster alternative to stata's bsample.
    // .2s vs 55.6s in one test
    {
        
        real scalar origN
        real vector allnum, allstr
        
        // check for errors
        if ((n-ceil(n)!=0) | (n <= 0)) _error(9,"n must be a positive integer")
        if (st_nobs() == 0) _error(9,"cannot sample empty dataset")
        
        // declare objects
        allnum = J(1,0,.)
        allstr = J(1,0,.)
        v = st_nvar()
    
        // separate string and numeric variables
        for (i=1;i<=v;i++) if (st_isnumvar(i)==1) allnum = allnum, i; else allstr = allstr, i;
        
        st_view(Nvars=.,.,allnum)
        st_sview(Svars=.,.,allstr)
        
        origN = rows(Nvars)
        
        // manually add extra obs if origN < n
        if (origN < n) {
            st_addobs(n - origN,1)
            st_view(Nvars=.,.,allnum)
            st_sview(Svars=.,.,allstr)
        }
    
        /*
        Below: slightly less efficient to store vector obstokeep than direct
        subscripting. We only have to if Nvars and Svars are both nonempty
        */
        if (cols(allstr) == 0) Nvars[|1,.\n,.|] = Nvars[ceil(origN*runiform(n,1)),.]
        else if (cols(allnum) == 0) Svars[|1,.\n,.|] = Svars[ceil(origN*runiform(n,1)),.]
        else {
            real vector obstokeep
            obstokeep = ceil(origN*runiform(n,1))
            Nvars[|1,.\n,.|] = Nvars[obstokeep,.]
            Svars[|1,.\n,.|] = Svars[obstokeep,.]
        }
    
        st_keepobsin((1,n))
        
    }
    
    void fastsample(real scalar N)
    // faster alternative to stata's sample.
    // 1.6s vs 58.9s in one test
    {
        
        real scalar origN, L, n, i
        real vector allnum, allstr, obstokeep
        
        // check for errors
        L = st_nobs()
        if ((N-ceil(N)!=0) | (N <= 0)) _error(9,"N must be a positive integer")
        if (L < N) _error(9,"cannot sample more observations than entire dataset")
        
        // initialize index    
        I = J(L,1,0) // index of rows to keep
        i = 0
    
        // first try - there could be collisions (the same row being called twice)
        obstokeep = ceil(L*runiform(N,1))
        I[obstokeep] = J(N,1,1)
    
        // iterate until we have a sample of N index values of L
        while (n!=0) {
            R = selectindex(!I) // remaining indices that may be chosen
            l = length(R) // total subindices
            n = N - (L-l) // remaining obs to get
    
            obstokeep = R[ceil(l*runiform(n,1))]
            I[obstokeep] = J(n,1,1)
            i++
        }
        printf("total iterations: %f\n", i)
    
        obstokeep = selectindex(I)
        st_keepobsin(obstokeep)
            
    }
    
    end
    ************** End function definitions *******************
    
    
    *********** Benchmark Stata's bsample vs fastbsample ********
    local reps 5
    tempfile temp
    
    set obs 10000000
    gen r=runiform()
    save `temp', replace
    
    
    forval i = 1/`reps' {
        use `temp', clear
        timer on 1
        bsample 5000
        //sample 5000, count
        timer off 1
    }
    
    forval i = 1/`reps' {
        use `temp', clear
        timer on 2
        mata: fastbsample(5000)
        //mata: fastsample(5000)
        timer off 2
    }
    
    timer list
    *********** End benchmark ***********************************
    I assume that `fastsample` did essentially use these Mata functions.

    I think I could put it into an .ado program, but I am sure others here will be able to do this much faster and more efficiently.
    Last edited by Dirk Enzmann; Yesterday, 08:12. Reason: Added suggestion to write a user-written program.

    Comment


    • #3
      Thanks, Dirk. I would need some time to check that implementation. I am not sure yet if i is completely correct or if it has any issues or biases. Would you feel confident just copying, pasting, and running that program as is?

      Comment


      • #4
        A good question -- others are certainly better placed to judge that. And I think that integration into a user-written program shouldn't be particularly complicated. Perhaps someone would consider this an interesting challenge?

        Comment


        • #5
          I also did a bit of Googling today but did not find any dofiles. Somewhere I read that a C++ plugin is involved, so maybe the implementation might be a bit more difficult. I suggest to either email the original author or contact KitBaum. Apparently, the ados were on ssc in the past, maybe they are in some archives?
          Best wishes

          Stata 18.0 MP | ORCID | Google Scholar

          Comment


          • #6
            Although not sure I don't believe that the "original" fastsample explicitely used a C++ plugin. I tried the two Mata functions with 10,000,000 cases and compared them each to sample and bsample drawing 5,000 cases for each (with set seed 12345 and each time new start of Stata). On my computer (what ever) and using Stata SE, the results are:
            Code:
            * sample vs. fastsample:
            . timer list
               1:     20.93 /        5 =       4.1862
               2:      1.31 /        5 =       0.2626
            
            * bsample vs. bfastsample:
            . timer list
               1:     43.47 /        5 =       8.6934
               2:      0.38 /        5 =       0.0750
            To validate I would compare the randomness of sampling using sample (or bsample) to the Mata function `fastsample` (or `fastbsample`).

            Comment


            • #7
              I'd also point here to -gsample-, -net describe gsample-
              That package is versatile in various ways, and on my machine completed a trimmed-down version of the task from above in about 1 sec. vs. 14 sec. It also can create samples by generating a frequency variable, which should be faster by avoiding the need to clear/use the entire data set before each sampling, a time cost not accounted for in the example above.

              Comment


              • #8
                Mike Lacy : Very promising! Do you (or anybody else) see any possibility to use this with bootstrap -- or is there an example on how to use it instead of bootstrap?

                Comment


                • #9
                  Thanks Dirk Enzmann. I was thinking of something like this:
                  Code:
                  sysuse auto, clear
                  local reps = 1000
                  local N = _N
                  // just for demo
                  gsample `N', gen(freq)
                  tab freq
                  drop freq
                  //
                  cap frame drop bootframe
                  frame create bootframe double mean
                  forval i = 1/`reps' {
                     gsample `N', gen(freq)
                     qui summ price [fweight=freq]
                     frame post bootframe (r(mean))
                     if mod(`i', 100) == 0 di "." _continue
                     drop freq
                  }[
                  di ""
                  frame bootframe: summ mean
                  di "Bootstrap SE = " r(sd)
                  This casual example was much slower than -bootstrap-, but I suppose that might not be true with a large sample size, where use/clear could be slow.
                  Last edited by Mike Lacy; Yesterday, 15:16.

                  Comment


                  • #10
                    Thank you, Dirk, Felix and Mike, for the insightful discussion.

                    Comment

                    Working...
                    X