We've got a ton of data and tables stored in text files that we want to write into Stata data files for further processing. I've always assumed mata would be faster for this (especially saving everything into one matrix and then exporting to stata (putmata) one time rather than a loop of many -replace- statements), but upon testing the sandbox of code below, it seems that saving things in a matrix and then using putmata is much slower.
Any thoughts/guidance about a better workflow to speed this up/optimize this via mata? (n.b., I'm very new to mata)
Also if the bottle neck here is actually that we are using -file read-, we are open to other / faster approaches.
Example:
Any thoughts/guidance about a better workflow to speed this up/optimize this via mata? (n.b., I'm very new to mata)
Also if the bottle neck here is actually that we are using -file read-, we are open to other / faster approaches.
Example:
Code:
********************Creating file with a bunch of strings similar to our external data:
clear all
gen input=""
se tr off
forvalues i=1/10000 {
set obs `=_N+1'
replace input="`:word `=trunc(runiform()*26)' of `c(ALPHA)''`:word `=trunc(runiform()*26)' of `c(ALPHA)''`:word `=trunc(runiform()*26)' of `c(ALPHA)''`:word `=trunc(runiform()*26)' of `c(ALPHA)''`:word `=trunc(runiform()*26)' of `c(ALPHA)''" in `i'
}
file open test using test.txt, write replace
forvalues i=1/10000 {
file write test "`=input[`i']'" _n
}
file close test
type test.txt, lines(10)
********************Testing original method dump each line of the file into an observation in the dataset
clear
timer on 1
gen input=""
file open test using "test.txt", read
local runct=0
file read test line
while r(eof)==0 {
local ++runct
qui set obs `=_N+1'
replace input=`"`line'"' in `runct'
file read test line
}
timer off 1
type test.txt, lines(10)
********************Testing new method append to a mata string matrix and then convert into observations
clear
timer on 2
file close _all
file open testm using "test.txt", read
mata: input=`""'
file read testm line
while r(eof)==0 {
mata: input=input \ `"`line'"'
file read testm line
}
getmata input
timer off 2
mata: input
desc
timer list 1
timer list 2
******************************Testing for numbers to see if it is the string slowing things down*******************************************
clear
gen input=.
se tr off
forvalues i=1/1000 {
set obs `=_N+1'
replace input=`=trunc(runiform()*26)' in `i'
}
file open test2 using test2.txt, write replace
forvalues i=1/1000 {
file write test2 "`=input[`i']'" _n
}
file close test2
clear
timer on 3
gen input=.
file open test2 using "test2.txt", read
local runct=0
file read test2 line
while r(eof)==0 {
local ++runct
qui set obs `=_N+1'
replace input=`line' in `runct'
file read test2 line
}
timer off 3
clear
timer on 4
file open testm2 using "test2.txt", read
local runct=0
file read testm2 line
while r(eof)==0 {
local ++runct
if `runct'==1 {
mata: input=`line'
}
if `runct'>1 {
mata: input=input \ `line'
}
file read testm2 line
}
getmata input
timer off 4
timer list 1
timer list 2
timer list 3
timer list 4

Comment