We are finding that our runtime slows way down if our dataset is large, even if we are only analyzing a small subset of the data at a time. What can we do about this?
We are developing a new Stata command called mgbe. We are testing it on data from all US counties. There are 3,100 counties but we only have 15 lines of data per county, so all in all there are about 45,000 rows of data.
We are running the command on one county at a time, like this:
bysort county: mgbe <snip>
And it runs really, really slowly, taking several minutes per county.
But here's the strange thing: if we drop a lot of the data, say keeping only 10 counties, then the same command runs much, much faster, in just a few seconds per county.
What's going on? Why does the total number of counties in the dataset affect runtime when we're only running one county at a time? And given that this is the case, is there a trick we can use to speed up runtime without dropping most of the data?
You will notice I haven't provided any detail on what mgbe does. That's not because it's secret. It's because I suspect the issue I'm describing is general and the details of mgbe wouldn't help and might be distracting. I can say, though, that mgbe relies on "ml model" and that it implements the estimator described in this paper: http://smx.sagepub.com/content/early...81175015599807
We are developing a new Stata command called mgbe. We are testing it on data from all US counties. There are 3,100 counties but we only have 15 lines of data per county, so all in all there are about 45,000 rows of data.
We are running the command on one county at a time, like this:
bysort county: mgbe <snip>
And it runs really, really slowly, taking several minutes per county.
But here's the strange thing: if we drop a lot of the data, say keeping only 10 counties, then the same command runs much, much faster, in just a few seconds per county.
What's going on? Why does the total number of counties in the dataset affect runtime when we're only running one county at a time? And given that this is the case, is there a trick we can use to speed up runtime without dropping most of the data?
You will notice I haven't provided any detail on what mgbe does. That's not because it's secret. It's because I suspect the issue I'm describing is general and the details of mgbe wouldn't help and might be distracting. I can say, though, that mgbe relies on "ml model" and that it implements the estimator described in this paper: http://smx.sagepub.com/content/early...81175015599807
Comment