Is -fastsample- still available? (A faster alternative to -sample- for large datasets)

Tiago Pereira

Join Date: Jan 2016

Posts: 409
#1

Is -fastsample- still available? (A faster alternative to -sample- for large datasets)

Yesterday, 06:11

Does anyone know of a faster (perhaps more recent) alternative to -sample- for large datasets?

I could not find the 2014 package:
https://www.stata.com/statalist/arch.../msg00008.html
Tags: None

Dirk Enzmann

Join Date: Apr 2014
Posts: 579

Yesterday, 07:56

Perhaps you can use the Stata code by Andrew Maurer (who wrote `fastsample`) provided in the Statalist link you showed:

Code:

************** Define mata functions **********************
mata

void fastbsample(real scalar n)
// faster alternative to stata's bsample.
// .2s vs 55.6s in one test
{
    
    real scalar origN
    real vector allnum, allstr
    
    // check for errors
    if ((n-ceil(n)!=0) | (n <= 0)) _error(9,"n must be a positive integer")
    if (st_nobs() == 0) _error(9,"cannot sample empty dataset")
    
    // declare objects
    allnum = J(1,0,.)
    allstr = J(1,0,.)
    v = st_nvar()

    // separate string and numeric variables
    for (i=1;i<=v;i++) if (st_isnumvar(i)==1) allnum = allnum, i; else allstr = allstr, i;
    
    st_view(Nvars=.,.,allnum)
    st_sview(Svars=.,.,allstr)
    
    origN = rows(Nvars)
    
    // manually add extra obs if origN < n
    if (origN < n) {
        st_addobs(n - origN,1)
        st_view(Nvars=.,.,allnum)
        st_sview(Svars=.,.,allstr)
    }

    /*
    Below: slightly less efficient to store vector obstokeep than direct
    subscripting. We only have to if Nvars and Svars are both nonempty
    */
    if (cols(allstr) == 0) Nvars[|1,.\n,.|] = Nvars[ceil(origN*runiform(n,1)),.]
    else if (cols(allnum) == 0) Svars[|1,.\n,.|] = Svars[ceil(origN*runiform(n,1)),.]
    else {
        real vector obstokeep
        obstokeep = ceil(origN*runiform(n,1))
        Nvars[|1,.\n,.|] = Nvars[obstokeep,.]
        Svars[|1,.\n,.|] = Svars[obstokeep,.]
    }

    st_keepobsin((1,n))
    
}

void fastsample(real scalar N)
// faster alternative to stata's sample.
// 1.6s vs 58.9s in one test
{
    
    real scalar origN, L, n, i
    real vector allnum, allstr, obstokeep
    
    // check for errors
    L = st_nobs()
    if ((N-ceil(N)!=0) | (N <= 0)) _error(9,"N must be a positive integer")
    if (L < N) _error(9,"cannot sample more observations than entire dataset")
    
    // initialize index    
    I = J(L,1,0) // index of rows to keep
    i = 0

    // first try - there could be collisions (the same row being called twice)
    obstokeep = ceil(L*runiform(N,1))
    I[obstokeep] = J(N,1,1)

    // iterate until we have a sample of N index values of L
    while (n!=0) {
        R = selectindex(!I) // remaining indices that may be chosen
        l = length(R) // total subindices
        n = N - (L-l) // remaining obs to get

        obstokeep = R[ceil(l*runiform(n,1))]
        I[obstokeep] = J(n,1,1)
        i++
    }
    printf("total iterations: %f\n", i)

    obstokeep = selectindex(I)
    st_keepobsin(obstokeep)
        
}

end
************** End function definitions *******************


*********** Benchmark Stata's bsample vs fastbsample ********
local reps 5
tempfile temp

set obs 10000000
gen r=runiform()
save `temp', replace


forval i = 1/`reps' {
    use `temp', clear
    timer on 1
    bsample 5000
    //sample 5000, count
    timer off 1
}

forval i = 1/`reps' {
    use `temp', clear
    timer on 2
    mata: fastbsample(5000)
    //mata: fastsample(5000)
    timer off 2
}

timer list
*********** End benchmark ***********************************

I assume that `fastsample` did essentially use these Mata functions.

I think I could put it into an .ado program, but I am sure others here will be able to do this much faster and more efficiently.

Last edited by Dirk Enzmann; Yesterday, 08:12. Reason: Added suggestion to write a user-written program.

Comment

Tiago Pereira

Join Date: Jan 2016

Posts: 409
#3

Yesterday, 08:29

Thanks, Dirk. I would need some time to check that implementation. I am not sure yet if i is completely correct or if it has any issues or biases. Would you feel confident just copying, pasting, and running that program as is?
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 579
#4

Yesterday, 08:57

A good question -- others are certainly better placed to judge that. And I think that integration into a user-written program shouldn't be particularly complicated. Perhaps someone would consider this an interesting challenge?
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 743
#5

Yesterday, 10:05

I also did a bit of Googling today but did not find any dofiles. Somewhere I read that a C++ plugin is involved, so maybe the implementation might be a bit more difficult. I suggest to either email the original author or contact KitBaum. Apparently, the ados were on ssc in the past, maybe they are in some archives?

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 579
#6

Yesterday, 10:50

Although not sure I don't believe that the "original" fastsample explicitely used a C++ plugin. I tried the two Mata functions with 10,000,000 cases and compared them each to sample and bsample drawing 5,000 cases for each (with set seed 12345 and each time new start of Stata). On my computer (what ever) and using Stata SE, the results are:

Code:

* sample vs. fastsample: . timer list 1: 20.93 / 5 = 4.1862 2: 1.31 / 5 = 0.2626 * bsample vs. bfastsample: . timer list 1: 43.47 / 5 = 8.6934 2: 0.38 / 5 = 0.0750

To validate I would compare the randomness of sampling using sample (or bsample) to the Mata function `fastsample` (or `fastbsample`).
2 likes
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2423
#7

Yesterday, 13:41

I'd also point here to -gsample-, -net describe gsample-
That package is versatile in various ways, and on my machine completed a trimmed-down version of the task from above in about 1 sec. vs. 14 sec. It also can create samples by generating a frequency variable, which should be faster by avoiding the need to clear/use the entire data set before each sampling, a time cost not accounted for in the example above.
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 579
#8

Yesterday, 14:14

Mike Lacy : Very promising! Do you (or anybody else) see any possibility to use this with bootstrap -- or is there an example on how to use it instead of bootstrap?
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2423

Yesterday, 15:12

Thanks Dirk Enzmann. I was thinking of something like this:

Code:

sysuse auto, clear
local reps = 1000
local N = _N
// just for demo
gsample `N', gen(freq)
tab freq
drop freq
//
cap frame drop bootframe
frame create bootframe double mean
forval i = 1/`reps' {
   gsample `N', gen(freq)
   qui summ price [fweight=freq]
   frame post bootframe (r(mean))
   if mod(`i', 100) == 0 di "." _continue
   drop freq
}[
di ""
frame bootframe: summ mean
di "Bootstrap SE = " r(sd)

This casual example was much slower than -bootstrap-, but I suppose that might not be true with a large sample size, where use/clear could be slow.

Last edited by Mike Lacy; Yesterday, 15:16.

Comment

Tiago Pereira

Join Date: Jan 2016

Posts: 409
#10

Yesterday, 15:53

Thank you, Dirk, Felix and Mike, for the insightful discussion.
Comment

Announcement