Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hardware to optimize processing large text files

    Hi list, I'm buying new machines for my lab and would like to optimize the speed with which we process large datasets. At the moment one of our projects generates 5+ GB of nested JSON with each data export. A 10-core iMac Pro takes about 20 minutes to process the raw data into something useful. I'm wondering if there's any general principle with Stata about increasing speed: if I want to improve processing speed (since we need to run this processing step often) should I increase the number of cores? the clock speed of each core? does it matter if I add a high-end GPU to the machine, or does Stata not use GPUs to boost performance? any other items that I should deal with on the hardware end? (memory isn't an issue, our machines all have 64+ GB RAM).

    thanks!

  • #2
    For an authoritative answer to this question, I'd suggest you address it to Stata Technical Services.

    The document at the following link, last updated in late 2017, discusses Stata MP performance in great detail.

    https://www.stata.com/statamp/statamp-20171003.pdf

    Comment


    • #3
      There might also be something to learn in Joseph Canner and Eric Schneider's "Optimizing Stata for Analysis of Large Data Sets." https://econpapers.repec.org/paper/bocnorl13/10.htm Sorry, I don't have a more formal citation to hand.

      Another source on this topic is https://www.nber.org/stata/efficient/
      What I've read in these sources has suggested choices and strategies that surprised me.

      Comment


      • #4
        Before Stata: could you skip JSON or use a faster JSON parser?
        In Stata: is the file to be read by Stata one JSON or one/several flat-file(s)? What Stata command do you use to read the file(s)?

        You may contact email [email protected].

        Comment


        • #5
          Thanks for these resources. Yes, it does seem that the choices of specific commands have surprising effects. I've also found that some commands are an order of magnitude faster on a 10-core vs 4-core machine, whereas others run comparably fast regardless of the machine's performance; this is consistent with what's in Stata's official documentation. I think I'll have to check out the optimization papers in more detail to see where I can shave off time in my code while also figuring out whether I can speed things up with a fancier machine.

          There's only a small amount of preprocessing that I can do before the data get into Stata as we need the datasets to serve a few functions: in addition to providing results to do science with, the JSON output serves as the "official" record of what each participant did in a study, so ideally we aren't doing much with the raw output from our experiments before doing actual analyses. (This is for data generated on themusiclab.org if anyone's interested in being a citizen scientist in our studies

          For what it's worth, my collaborators who are active in the R community think I'm rather a troglodyte for doing analyses in Stata, but in my [limited] piloting doing similar preprocessing in R, things were quite a bit faster in Stata. I'm not a strong enough R user to make a general claim that Stata is faster at this kind of processing but the initial results were slower enough in R that I just went with Stata for this set of tasks. If our site starts generating enough data that my current processing takes on the order of several days rather than 20-30 minutes, then I suppose I'll reconsider, as I could run the processing in R on a cluster rather than on local machines, in a fashion that I think Stata probably can't handle. But for now I'm pretty happy with Stata for these tasks.

          Comment

          Working...
          X