Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strange results reported by compress

    I wonder if anyone can help me interpret the following results (see below) of using the compress command.
    My confusion arises from the fact that Stata allocates totally (see grand total value) of 253,813,893 bytes for various purposes and utilization and then after the compress command reports saving memory more than this grand total (457,062,876 bytes saved).
    IMHO the savings are capped at grand total, or specifically by grand total used.

    Furthermore, this doesn't agree with the actual savings (if the data are saved to a file and filesize before and after are compared).


    Code:
    . memory
    
      Memory usage
                                                used                allocated
        ---------------------------------------------------------------------
        data                             154,974,000              234,881,024
        strLs                                      0                        0
        ---------------------------------------------------------------------
        data & strLs                     154,974,000              234,881,024
    
        ---------------------------------------------------------------------
        data & strLs                     154,974,000              234,881,024
        var. names, %fmts, ...                 1,188                   34,832
        overhead                          13,647,920               13,648,076
    
        Stata matrices                             0                        0
        ado-files                             13,484                   13,484
        stored results                             0                        0
    
        Mata matrices                              0                        0
        Mata functions                             0                        0
    
        set maxvar usage                   5,231,728                5,231,728
    
        other                                  4,749                    4,749
        ---------------------------------------------------------------------
        grand total                      173,867,693              253,813,893
    
    . save "C:\temp\industry_before.dta", replace
    file C:\temp\industry_before.dta saved
    
    . 
    . recast strL industrystr
    
    . compress
      industrystr is strL now coalesced
      (457,062,876 bytes saved)
    
    . memory
    
      Memory usage
                                                used                allocated
        ---------------------------------------------------------------------
        data                              53,904,000              301,989,888
        strLs                             80,352,806               80,352,806
        ---------------------------------------------------------------------
        data & strLs                     134,256,806              382,342,694
    
        ---------------------------------------------------------------------
        data & strLs                     134,256,806              382,342,694
        var. names, %fmts, ...                 1,188                   34,832
        overhead                          13,647,936               13,648,076
    
        Stata matrices                             0                        0
        ado-files                             13,484                   13,484
        stored results                             0                        0
    
        Mata matrices                              0                        0
        Mata functions                             0                        0
    
        set maxvar usage                   5,231,728                5,231,728
    
        other                                  4,744                    4,744
        ---------------------------------------------------------------------
        grand total                      153,150,510              401,275,558
    
    . save "C:\temp\industry_after.dta", replace
    file C:\temp\industry_after.dta saved
    
    . 
    end of do-file

  • #2
    Modern versions of Stata adjust memory on the go... and therefore I think you have a step missing

    . recast strL industrystr . compress industrystr is strL now coalesced (457,062,876 bytes saved)
    -recast strL- will increase memory usage, and therefore you should access memory after the recast command but before -compress-

    Comment


    • #3
      Dear Andrew, thank you very much for your answer, and this is an excellent point and I should have inspected the memory immediately before and after the compress command. Unfortunately it still doesn't explain Stata's behavior (see listing below).

      The do-file for reproducing the problem is here:
      Code:
      do "http://www.radyakin.org/statalist/2017/1370890-strange-results-reported-by-compress.do"
      Thank you, Sergiy

      Code:
      . memory
      
        Memory usage
                                                  used                allocated
          ---------------------------------------------------------------------
          data                              53,904,000              301,989,888
          strLs                             80,587,132               80,587,132
          ---------------------------------------------------------------------
          data & strLs                     134,491,132              382,577,020
      
          ---------------------------------------------------------------------
          data & strLs                     134,491,132              382,577,020
          var. names, %fmts, ...                 1,188                   34,832
          overhead                          13,647,936               13,648,076
      
          Stata matrices                             0                        0
          ado-files                             13,484                   13,484
          stored results                             0                        0
      
          Mata matrices                              0                        0
          Mata functions                             0                        0
      
          set maxvar usage                   5,231,728                5,231,728
      
          other                                  4,744                    4,744
          ---------------------------------------------------------------------
          grand total                      153,384,836              401,509,884
      
      . compress
        industrystr is strL now coalesced
        (456,241,150 bytes saved)
      
      . memory
      
        Memory usage
                                                  used                allocated
          ---------------------------------------------------------------------
          data                              53,904,000              301,989,888
          strLs                             80,352,806               80,352,806
          ---------------------------------------------------------------------
          data & strLs                     134,256,806              382,342,694
      
          ---------------------------------------------------------------------
          data & strLs                     134,256,806              382,342,694
          var. names, %fmts, ...                 1,188                   34,832
          overhead                          13,647,936               13,648,076
      
          Stata matrices                             0                        0
          ado-files                             13,484                   13,484
          stored results                             0                        0
      
          Mata matrices                              0                        0
          Mata functions                             0                        0
      
          set maxvar usage                   5,231,728                5,231,728
      
          other                                  4,744                    4,744
          ---------------------------------------------------------------------
          grand total                      153,150,510              401,275,558
      
      . save "C:\temp\industry_after.dta", replace
      file C:\temp\industry_after.dta saved

      Comment


      • #4
        Have you checked actual memory usage in the task manager? It's a long time ago, but I remember looking into memory usage during loops and so on once, and seem to recall that it wasn't particularly reliable.

        Comment


        • #5
          Sergiy, I now see your point. I have no idea what the default -coalesce- option is doing here and whether it has any implications for actual memory usage. However, the printed results after compress seem to match what you will get using the -nocoalesce- option.

          Code:
          compress, nocoalesce
          As Jesse suggests, it may be worth exploring other methods of measuring memory usage.

          Comment


          • #6
            Jesse, I have checked with the task manager, and no, the value reported by memory is different from the actual memory consumption (screenshot based on a different dataset).

            Andrew, coalesce option will do more effort to store repeated strings more efficiently. Basically it will store duplicates only once. And yes, I would like to utilize this option to store the file most efficiently on the disk.

            It seems that neither compress, nor memory give accurate indication of the memory savings (or I fail to interpret their output correctly). Hence, when choosing between the alternative layouts, I will resolve to saving one file for each of data layouts layout1 and layout2, and then comparing the file sizes as reported by the operating system.

            Thank you, Sergiy



            Click image for larger version

Name:	stata_memory_usage.png
Views:	1
Size:	22.0 KB
ID:	1371074

            Comment


            • #7
              Does the value presented by compress represent the difference in memory as observed by the task manager? E.g.

              Code:
              set niceness 10
              compress
              This actually reminds me, what happens if you set niceness to 10 and then execute the memory-compress-memory, perhaps with a sleep 5000 after the compress? I don't think it will make a difference, but who knows?

              Comment


              • #8
                Jesse,
                1. the manual is silent about what compress actually reports. But the manual describes what compress is doing: "compress attempts to reduce the amount of memory used by your data.". Hence, I anticipate that compress reports the difference in the data-related part of allocated memory: (data & strLs + varnames & formats + overhead) in the output reported by memory.
                2. i believe the amount of memory reported by Stata's memory command may for various reasons be different from one reported by memory manager, but I don't see how it can be larger. If Stata reports that it has claimed X bytes from OS, than the OS should reflect this. Stata may have claimed something else for other purposes, but at least the data portion of the memory should be reflected.
                3. I have watched the amount of memory and it didn't change within a minute after the do-file was run.
                Thank you, Sergiy

                Comment


                • #9
                  Very bizarre

                  Comment

                  Working...
                  X