Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difference between use using if and use ... keep if ...

    I was wondering how the sample selection with "use using (dataset) if (condition)" vs. "use (dataset) keep (condition)" differs.

    For example, in the following code:

    Code:
    clear
    set obs 10
    gen var1 = _n
    gen var2 = _n * 2
    gen var3 = _n * 3
    gen var4 = _n * 4
    gen var5 = _n * 5
    tempfile test_data
    save `test_data', replace
    How does the implementation differ between

    Code:
    use `test_data', clear
    keep if var1>=5
    and

    Code:
    use using `test_data' if var1>=5, clear
    Is there any efficiency / speed / memory advantage or is the latter essentially doing the former but in a more compact syntax? It obviously doesn't matter for such a small dataset, but when I have a couple million observations a quicker selection would be amazing.

    Full code for easy copying:

    Code:
    clear
    set obs 10
    gen var1 = _n
    gen var2 = _n * 2
    gen var3 = _n * 3
    gen var4 = _n * 4
    gen var5 = _n * 5
    tempfile test_data
    save `test_data', replace
    
    use `test_data', clear
    keep if var1>=5
    
    use using `test_data' if var1>=5, clear

  • #2
    Judging based on my knowledge of Stata and Python, Stata still needs to read in the data to know which rows to delete. So, I'm not sure if there's an obvious speed gain to either approach, but I'm more than open to being correct, perhaps by someone from Stata Corp.

    Comment


    • #3
      Small files

      Code:
      timer clear 1
      timer clear 2
      
      timer on 1
      set seed 123456
      
      forvalues i = 1/100000 {
          qui clear
          qui set obs 10
          gen var1 = _n
          gen var2 = _n * 2
          gen var3 = _n * 3
          gen var4 = _n * 4
          gen var5 = _n * 5
          tempfile test_data
          qui save `test_data', replace
          qui use `test_data', clear
          qui keep if var1 >= 5
      }
      
      timer off 1
      
      timer on 2
      set seed 123456
      
      forvalues i = 1/100000 {
          qui clear
          qui set obs 10
          gen var1 = _n
          gen var2 = _n * 2
          gen var3 = _n * 3
          gen var4 = _n * 4
          gen var5 = _n * 5
          tempfile test_data
          qui save `test_data', replace
          qui use using `test_data' if var1 >= 5, clear
      }
      
      timer off 2
      timer list 1
      timer list 2

      Code:
       timer list 1
         1:     10.57 /        1 =      10.5690
      
      .         timer list 2
         2:     10.48 /        1 =      10.4760

      Large files

      Code:
        
          timer clear 1
          timer clear 2
          
          timer on 1
          
          set seed 123456
          
          forvalues i =1/100 {
      
          qui clear
          qui set obs 1000000
          gen var1 = _n
          gen var2 = _n * 2
          gen var3 = _n * 3
          gen var4 = _n * 4
          gen var5 = _n * 5
          tempfile test_data
          qui save `test_data', replace
          qui use `test_data', clear
          qui keep if var1>=50000
          }
          timer off 1
          
          timer on 2
          
          set seed 123456
          
          forvalues i =1/100 {
      
          qui clear
          qui set obs 1000000
          gen var1 = _n
          gen var2 = _n * 2
          gen var3 = _n * 3
          gen var4 = _n * 4
          gen var5 = _n * 5
          tempfile test_data
          qui save `test_data', replace
          qui use using `test_data' if var1>=50000, clear
          }
          timer off 2
          timer list 1
          timer list 2
      Code:
            timer list 1
         1:      7.88 /        1 =       7.8820
      
      .         timer list 2
         2:      8.63 /        1 =       8.6280

      Hence, both approaches seem to be comparable in terms of speed/time.

      Comment


      • #4
        The places where you can get a substantial speed-up with -use ... using...- in reading large files are:
        1. -use varlist using some_file- will be appreciably faster than -use some_file- followed by -keep varlist- if the variables in varlist take up substantially less memory than those in the full file.
        2. -use in #/# using some_file- will be appreciably faster than -use some_file- followed by -keep in #/#-. As noted in #2, Stata still needs to read every observation to see which ones satisfy the -if- condition. But Stata does not need to read any excluded observations to apply an -in- restriction. In fact, this applies more generally: any command that can be set up using -in- will be faster than the corresponding one using -if-. If that command is repeated many times, the time savings can be appreciable, even astronomical because you change an O(N) process into an O(1) process.

        Comment


        • #5
          I have used consistently
          Code:
          import delimited using
          when dealing with large number of simulations (e.g., 10^10 or more). It is the fasted way to import data - as far as I can tell. Clyde contributed a lot on this discussion previously. https://www.statalist.org/forums/for...asets-in-stata

          Comment

          Working...
          X