Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • where does -compress- store its "bytes saved" value?

    Those familiar with -compress- know that running it on a full dataset returns a list of changes (i.e. variable var1 was str5, now str2) and then, at the end, it returns (150,000 bytes saved) or whatever the number actually is. I'm looking for how that number is calculated and where it is stored (i.e. as an escalar, rscalar, etc.). I've looked for an .ado file, but it appears that -compress- is written as a .dlg file which I don't know how to read yet. Any help would be much appreciated!

  • #2
    -compress- is a Stata built-in command, and does not have an ado-file you can read. Unfortunately, it does not seem to make available the number reported for bytes saved, for us to use.

    The .dlg file merely controls the dialog box which you can access from the menu: Data > Data utilities > Optimize variable storage. You can peek into it if you like, it is a simple text file you can open with any text editor.

    Comment


    • #3
      Compress doesn't return anything other than what it prints out to the results window, and it is usually uninteresting to capture that information. That said, you can manually recreate the calculation using the number of observations in your dataset, the number of bytes used for data storage from -help data types- and noting what the variable storage type was before and after running compress for each altered variable. As a simple example, say your dataset contains 10 observations and a single string variable was compressed from str10 to str9. This is a savings of 10 bytes (= 1 byte * 10 obs). The calculations are straightforward for all but -strL-. (In the background, strL appears to only allocated enough blocks of memory as needed for the current string contents than the maximum width of the variable.)

      So if you really want this information, you have two approaches.
      1) get the list of variables and storage types before and after compress and work out the calculation for yourself. This is not ideal if you have -strL- for the reason mentioned above, but is otherwise accurate.
      2) send just the output of -compress- to a log file, then read that text in to parse out which variables were changed and by how much.

      Comment


      • #4
        What about

        Code:
        memory
        return list,_all
        which returns various measures of memory used. Call it before and after the -compress- statement to see how much memory is saved. See https://www.stata.com/manuals13/u6.pdf

        Comment


        • #5
          Here's an example following post #4, for those (like myself) who have forgotten about the memory command..
          Code:
          . sysuse auto, clear
          (1978 automobile data)
          
          . memory
          
          Memory usage
                                                   Used                Allocated
          ----------------------------------------------------------------------
          Data                                    3,182               67,108,864
          strLs                                       0                        0
          ----------------------------------------------------------------------
          Data & strLs                            3,182               67,108,864
          
          ----------------------------------------------------------------------
          Data & strLs                            3,182               67,108,864
          Variable names, %fmts, ...              4,178                   68,030
          Overhead                            1,081,344                1,082,136
          
          Stata matrices                              0                        0
          ado-files                               5,465                    5,465
          Stored results                              0                        0
          
          Mata matrices                               0                        0
          Mata functions                              0                        0
          
          set maxvar usage                   38,830,849               38,830,849
          
          Other                                   4,660                    4,660
          ----------------------------------------------------------------------
          Total                              39,925,014              107,100,004
          
          . return list, all
          
          Hidden; names and contents could change in future versions of Stata
          
          scalars:
                  r(data_data_u) =  3182
                  r(data_data_a) =  67108864
                  r(data_strl_u) =  0
                  r(data_strl_a) =  0
                   r(data_etc_a) =  68030
                   r(data_etc_u) =  4178
                    r(data_oh_a) =  1082136
                    r(data_oh_u) =  1081344
             r(stata_matrices_a) =  0
                  r(stata_ado_a) =  5465
                   r(stata_sr_a) =  0
              r(mata_matrices_a) =  0
             r(mata_functions_a) =  0
                     r(maxvar_a) =  38830849
                      r(other_a) =  4660
          
          .
          Last edited by William Lisowski; 21 Aug 2022, 08:04.

          Comment


          • #6
            Oh wow, I didn't know about this at all!

            And I can confirm that the amount of bytes saved, as reported by the compress command, is the same as the difference between the scalars r(data_data_u) before and after running the command.

            Comment


            • #7
              Thank you much everyone, the memory command will fit my needs perfectly!

              Comment

              Working...
              X