Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparison between `compress` vs disk write/read execution time improvement

    Say my script needs to write a data set on the disk, to be picked up later in the process. Now, imagine I want to improve the performance of this script.

    Is it more efficient to:
    • simply save, then use the data set,
    • or compress the dataset, before save, then use?

    In other words, do compress and save/use (+ their counterpart disk RW operations) share the same order of magnitude in terms of execution time?

    The question is to know whether “investing” time in compress is compensated by improvement in save/use speed due to smaller memory size.

    I'm interested in a generic answer, but it might help to specify that I have an SSD disk, and “regular” data sets (E+3–8 by E+1).

  • #2
    Welcome to Stata list. You didn't get a quick answer. You will increase your chances of useful answer by following the FAQ on asking questions.

    I don't quite know why you ask this question when it should be easy for you to answer it empirically on your computer. Run the two on large enough data sets to see time differences and see which comes out best.

    Comment


    • #3
      Originally posted by Stephan Scheffelberg View Post

      The question is to know whether “investing” time in compress is compensated by improvement in save/use speed due to smaller memory size.

      I'm interested in a generic answer.
      There is probably not a generic answer, only a bunch of case specific answers. What a lot of folks do (including me) is to compress before saving a large dataset that will be read many more times than it is written. That is very easy (lazy) for the user to do, and likely to speed the read times up.

      The best practice in terms of efficiency (but which requires some user effort) is to note that Stata defaults to generating single precision reals when generating variables, even if you are just creating a binary indicator. In that case just put byte into your command and you won't need to compress:

      Code:
      . generate byte x = y > z
      So that one is win-win in terms of run-time although it increases the time spent coding (which is often the more valuable of the two).
      Last edited by John Eiler; 01 May 2020, 11:58.

      Comment


      • #4
        John Eiler

        Thank you for for the rule of thumb. The more one loads a dataset, the higher the ROI indeed.

        As for defining the correct data type when creating a variable, I agree it is a best practice. I my case, however, I also tend to manipulate strings — so compress still comes handy if I have shorten the max length of a string variable.

        ----

        Phil Bromiley

        Thank you for your welcome. I should be grateful if you could indicate which part of the FAQ I have misread, so I could improve my questions in the future. (PM ok, if we want to not pollute the thread)

        it should be easy for you to answer it empirically on your computer
        I believe you are overestimating my skills. I agree I could try to test it by myself to answer the question, but since I have asked the question, I am afraid this means that I don't know how to do it; would you then be so kind to offer some pointers to the right direction?
        Last edited by Stephan Scheffelberg; 01 May 2020, 12:49.

        Comment


        • #5
          The output of
          Code:
          help timer
          shows technique for recording the elapsed time of a sequence of commands. Try generating a large dataset of the size and type of data you expect to have, and then save it both ways.
          Code:
          . clear
          
          . set obs 1000000
          number of observations (_N) was 0, now 1,000,000
          
          . set seed 666
          
          . timer clear
          
          . generate int len = runiformint(1,100)
          
          . generate str300 gnxl = len*"a"
          
          . * and so on for more variables
          . timer on 11
          
          . tempfile a
          
          . save`a'
          file /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_03481.00000f saved
          
          . timer off 11
          
          . timer on 21
          
          . compress
            variable len was int now byte
            variable gnxl was str300 now str100
            (201,000,000 bytes saved)
          
          . tempfile b
          
          . save `b'
          file /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_03481.00000g saved
          
          . timer off 21
          
          . timer on 12
          
          . use `a', clear
          
          . timer off 12
          
          . timer on 22
          
          . use `b', clear
          
          . timer off 22
          
          . timer list
            11:      0.30 /        1 =       0.2970
            12:      0.17 /        1 =       0.1740
            21:      0.57 /        1 =       0.5730
            22:      0.04 /        1 =       0.0440
          
          .
          and in this case it took 0.276 seconds more to do the compress and save, which cut 0.134 seconds off the read time. So whether compress is worth doing depends on how many times you're going to read the dataset.

          As John wrote, there is probably not a generic answer as you asked for, just experiential anecdotes. You'll need to gather some anecdotes suitable to your particular circumstances to draw a conclusion relevant to those circumstances.

          As Phil wrote, you went almost 24 hours without an answer, which suggests that readers found they could not contribute anything to what you had written. Members generally only post when they can help; the lack of posting suggests that the question did not engage anyone enough to be able to help, given the obvious dependency on the characteristics of your data. Section 12 in the FAQ begins "Help us to help you by producing self-contained questions with reproducible examples that explain your data, your code, and your problem." Your response suggests us that you need help doing so, and with that evident, I demonstrated the technique for exploring the problem.

          Comment


          • #6
            Originally posted by Stephan Scheffelberg View Post
            Say my script needs to write a data set on the disk, to be picked up later in the process.
            1. In Stata 16.0 or later you can put your data in a different frame to avoid disk I/O operations.

            2. compress deals with inefficiencies of storage. Rather than compressing the data before saving, review your process where you generate the data (if any) to see whether that code can be improved (more strongly typed) so that compress will not be needed at all. Often times it is simply typing
            Code:
            generate byte dummy=...
            instead of
            Code:
            generate dummy=...
            that can save you tons of space.

            Comment


            • #7
              Thanks for your answers.
              William Lisowski 's answer is notably helpful in detailing how to time specific snippets.

              My take on these is:
              • improve performance of the script by avoiding to compress altogether:
                • one can reduce the need to compress by using more strongly typed variables,
                • using frame removes the need to temporarily store datasets in the disk in Stata 16+
              • if really needed, compress and I/O operations (save, use) have the same order of magnitude computation time-wise. Thus, “investing” in compress can make sense if:
                • the dataset is bound to be loaded multiple times (greater return for a given investment),
                • optimising disk space is critical.


              FWIW, I've phrased my question as such as I come from communities where it is preferred to ask the “real problem” rather than a sub-issue (to avoid XY problem and favour questions that are of general interest — rather than implementation specific). I also imagined that the answer could have been a common best practice I was unaware of due to lack of experience.
              Anyway, I now stand corrected — thanks for the explanations!

              Comment


              • #8
                To post #6 from Sergiy Radyakin let me add that if you have a string variables whose content can range from very short strings to very long strings, or which contain long strings with duplicated values, there is perhaps something to be gained by using strL rather than str# for storage, as described in section 12.4.8 of the Stata User's Guide PDF included in your Stata installation and accessible from Stata's Help menu.

                I would describe the difference between
                Code:
                generate dummy=
                generate byte dummy=
                as one of explicit typing rather than strong typing, since if on the generate command a type is not explicitly given, then float is implicit, so there is no difference between
                Code:
                generate dummy=
                generate float dummy=
                Since you mention the XY problem, I'll note that in post #6 Sergiy Radyakin addresses what appears to be your real question - how to speed up saving and recalling datasets - rather than the question you posed. The Statalist community is also sensitive to the XY problem, and in my case at least, that was a reason for moving on after reading your initial post. Once the question became "how can I test this empirically" you had my interest.

                Also, while the communities you come from may understand "and 'regular' data sets (E+3–8 by E+1)" that is a locution I have not seen on Statalist before. I'm not sure what would constitute an "irregular" Stata dataset, and the only guess I can make about the parenthesized expression is datasets with on the order of 1,000 -100,000,000 observations and 10 variables.

                If that is indeed the size of your dataset, note that the gains from using a frame to save your dataset are reduced to the extent that the total size of your data in memory (in the default frame and the frame with the saved dataset) exceeds the memory available to Stata. And also, speed gains from efficiency in the storage of a large dataset are likely dwarfed by gains from using Stata/MP on the analytical part of the problem.

                Comment

                Working...
                X