Difference between use using if and use ... keep if ...

Jan Ringling

Join Date: Jul 2024

Posts: 3
#1

Difference between use using if and use ... keep if ...

20 Jul 2024, 08:45

I was wondering how the sample selection with "use using (dataset) if (condition)" vs. "use (dataset) keep (condition)" differs.

For example, in the following code:

Code:

clear set obs 10 gen var1 = _n gen var2 = _n * 2 gen var3 = _n * 3 gen var4 = _n * 4 gen var5 = _n * 5 tempfile test_data save `test_data', replace

How does the implementation differ between

Code:

use `test_data', clear keep if var1>=5

and

Code:

use using `test_data' if var1>=5, clear

Is there any efficiency / speed / memory advantage or is the latter essentially doing the former but in a more compact syntax? It obviously doesn't matter for such a small dataset, but when I have a couple million observations a quicker selection would be amazing.

Full code for easy copying:

Code:

clear set obs 10 gen var1 = _n gen var2 = _n * 2 gen var3 = _n * 3 gen var4 = _n * 4 gen var5 = _n * 5 tempfile test_data save `test_data', replace use `test_data', clear keep if var1>=5 use using `test_data' if var1>=5, clear
Tags: None
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#2

20 Jul 2024, 09:58

Judging based on my knowledge of Stata and Python, Stata still needs to read in the data to know which rows to delete. So, I'm not sure if there's an obvious speed gain to either approach, but I'm more than open to being correct, perhaps by someone from Stata Corp.
Comment

Tiago Pereira

Join Date: Jan 2016
Posts: 389

20 Jul 2024, 10:36

_{Small files}

Code:

timer clear 1
timer clear 2

timer on 1
set seed 123456

forvalues i = 1/100000 {
    qui clear
    qui set obs 10
    gen var1 = _n
    gen var2 = _n * 2
    gen var3 = _n * 3
    gen var4 = _n * 4
    gen var5 = _n * 5
    tempfile test_data
    qui save `test_data', replace
    qui use `test_data', clear
    qui keep if var1 >= 5
}

timer off 1

timer on 2
set seed 123456

forvalues i = 1/100000 {
    qui clear
    qui set obs 10
    gen var1 = _n
    gen var2 = _n * 2
    gen var3 = _n * 3
    gen var4 = _n * 4
    gen var5 = _n * 5
    tempfile test_data
    qui save `test_data', replace
    qui use using `test_data' if var1 >= 5, clear
}

timer off 2
timer list 1
timer list 2

Code:

 timer list 1
   1:     10.57 /        1 =      10.5690

.         timer list 2
   2:     10.48 /        1 =      10.4760

Large files

Code:

  
    timer clear 1
    timer clear 2
    
    timer on 1
    
    set seed 123456
    
    forvalues i =1/100 {

    qui clear
    qui set obs 1000000
    gen var1 = _n
    gen var2 = _n * 2
    gen var3 = _n * 3
    gen var4 = _n * 4
    gen var5 = _n * 5
    tempfile test_data
    qui save `test_data', replace
    qui use `test_data', clear
    qui keep if var1>=50000
    }
    timer off 1
    
    timer on 2
    
    set seed 123456
    
    forvalues i =1/100 {

    qui clear
    qui set obs 1000000
    gen var1 = _n
    gen var2 = _n * 2
    gen var3 = _n * 3
    gen var4 = _n * 4
    gen var5 = _n * 5
    tempfile test_data
    qui save `test_data', replace
    qui use using `test_data' if var1>=50000, clear
    }
    timer off 2
    timer list 1
    timer list 2

Code:

      timer list 1
   1:      7.88 /        1 =       7.8820

.         timer list 2
   2:      8.63 /        1 =       8.6280

Hence, both approaches seem to be comparable in terms of speed/time.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30114
#4

20 Jul 2024, 13:21

The places where you can get a substantial speed-up with -use ... using...- in reading large files are:
-use varlist using some_file- will be appreciably faster than -use some_file- followed by -keep varlist- if the variables in varlist take up substantially less memory than those in the full file.

-use in #/# using some_file- will be appreciably faster than -use some_file- followed by -keep in #/#-. As noted in #2, Stata still needs to read every observation to see which ones satisfy the -if- condition. But Stata does not need to read any excluded observations to apply an -in- restriction. In fact, this applies more generally: any command that can be set up using -in- will be faster than the corresponding one using -if-. If that command is repeated many times, the time savings can be appreciable, even astronomical because you change an O(N) process into an O(1) process.
Comment
Tiago Pereira

Join Date: Jan 2016

Posts: 389
#5

21 Jul 2024, 05:07

I have used consistently

Code:

import delimited using

when dealing with large number of simulations (e.g., 10^10 or more). It is the fasted way to import data - as far as I can tell. Clyde contributed a lot on this discussion previously. https://www.statalist.org/forums/for...asets-in-stata
Comment

Announcement

Difference between use using if and use ... keep if ...

Comment

Comment

Comment

Comment