Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Memory management for large data: choice of strL vs strF

    Dear all,

    I am working with a large dataset in Stata 13 and I am interested in storing the data in the most compact form. The dataset has a number of string variables of various lengths (currently all of them fit the older 244 limit). Converting some of them to strLs seems to reduce the file size, but I am unsure if converting all of them would be beneficial. All further processing will use these existing variables in read-only mode, so their lengths and content will not change.

    I am interested in empirical guidance on the tradeoff between strL and strF in Stata 13. For example, it follows from the dta-117 specification, that employing strL will only increase the file size for str8. However it is not clear for, say, str20? I do anticipate a large number of duplicates in these strings, but mostly for the short strings. Longer ones are likely to be unique sequences.

    Should I look for my own strategy for data storage or should I trust that compress finds the best storage types? For example I don't see it changing types from strF to strL, and I can't come up with an explanation why? It would imho be beneficial in the case of long datasets with repetitive long content, or with a few long strings and mostly short ones; and it inspires me to write supcompress to achieve better storage than compress does. Should I be aware of something not obvious?

    I am also interested in finding out, what exactly r(width) is for datasets containing strLs? (given that the length of a strL is not fixed).

    I am also interested in finding out, how to capture programmatically the size of the dataset in memory (neither of the memory or describe commands that report this size saves this value)

    Thank you, Sergiy Radyakin

  • #2
    For all of the shorter strings that contain a large number of duplicates, would encode be helpful? You might be able to get, say, a 20-byte string down to integer or even byte after encoding them. Even long integer would be some savings.

    Also, for programmatically capturing dataset size (and memory allocated or used by strLs), see the do-file results below. (But note the warning from StataCorp about possible changes.)

    For your other questions, I'm afraid that you're way ahead of me--about the only suggestion I could offer in the absence of a response from StataCorp is to experiment and see whether you can formulate some practical rules empirically, and you've already contemplated that.

    .version13.1

    .
    .clear*

    .setmoreoff

    .
    .sysuseauto
    (1978AutomobileData)

    .
    .generatestrLhappy="Myhappydays!"

    .
    .memory

    Memoryusage
    usedallocated
    ---------------------------------------------------------------------
    data3,77467,108,864
    strLs1,8391,839
    ---------------------------------------------------------------------
    data&strLs5,61367,110,703

    ---------------------------------------------------------------------
    data&strLs5,61367,110,703
    var.names,%fmts,...1,87426,173
    overhead1,081,3521,082,144

    Statamatrices00
    ado-files7,0037,003
    storedresults00

    Matamatrices00
    Matafunctions00

    setmaxvarusage1,431,7361,431,736

    other1,9631,963
    ---------------------------------------------------------------------
    grandtotal2,527,24169,659,722

    .
    .*
    .*Usedmemory
    .*
    .displayinsmclastextr(data_data_u)
    3774

    .
    .*
    .*Allocated/usedmemoryforstrLvariables
    .*
    .displayinsmclastextr(data_strl_a)
    1839

    .
    .*
    .*Moregenerally
    .*
    .returnlist,all

    Hidden;namesandcontentscouldchangeinfutureversionsofStata

    scalars:
    r(data_data_u)=3774
    r(data_data_a)=67108864
    r(data_strl_u)=1839
    r(data_strl_a)=1839
    r(data_etc_a)=26173
    r(data_etc_u)=1874
    r(data_oh_a)=1082144
    r(data_oh_u)=1081352
    r(stata_matrices_a)=0
    r(stata_ado_a)=7003
    r(stata_sr_a)=0
    r(mata_matrices_a)=0
    r(mata_functions_a)=0
    r(maxvar_a)=1431736
    r(other_a)=1963

    .
    .exit

    endofdo-file


    .

    Comment


    • #3
      Dear Joseph,

      thank you very much for this wonderful hint!! It is perfect! Now my supcompress prototype does not have to save the data just to measure the file size and works much faster.

      Here is an experiment:
      Code:
      set seed 123456789
      clear
      set obs `=2e5'
      
      generate x="A"
      generate y="@ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
      generate z=cond(runiform()<0.97, "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789","C")
      generate a="abcdefghijklm"
      
      memory
      save "gen_before", replace
      
      compress
      memory
      save "gen_before_c", replace
      
      supcompress
      memory
      save "gen_after", replace
      
      generate uuu=z
      
      memory
      save "gen_after_u", replace
      
      compress
      memory
      save "gen_after_u_c", replace
      
      supcompress
      memory
      save "gen_after_u_sc", replace
      Saved files allow to see the progress, here is a snapshot of file sizes:



      Basically the data file could be reduced 2-2.5 times without any damage to data (or at least I haven't felt it yet).
      I know it is based on the strategically generated example data (see program listing above), but it shows that old good compress can now have some competition

      On the real dataset it gives a modest 25% decrease in the file size (which is all I care at the moment, the total memory size does not mirror the same).

      Correction: on the real-real dataset it actually gave a 6 times savings in file size (sorry for the confusion), the rest is correct, I still don't care about the memory size for now.

      PS: I would like to avoid using labels as much as possible for a few variables (ids and lists of ids), which in the future might be used for merging the data. For other (categorical) variables I have surely created numerical labelled equivalents.

      Thank you, Sergiy Radyakin
      Last edited by Sergiy Radyakin; 03 Jun 2014, 18:45. Reason: Compared wrong set of files, so obtained a rather conservative estimate, corrected myself.

      Comment


      • #4
        Apologies for necroing the thread, but like Sergiy I am surprised that -compress- does not consider promoting strfs to strLs. Starting with a 140MB dataset, -compress- did nothing, but -recast-ing all my strfs as strLs and then -compress-ing knocked off 70MB. If there's a problem with having that be default behavior, then it seems like an option would be in order. I'm sympathetic to concerns of option-bloat, but this technique can make a heck of a difference.
        Last edited by Nils Enevoldsen; 14 Jan 2015, 08:24.

        Comment


        • #5
          Having -compress- promote to strLs when it would save space as the default could be a problem in that strL variable cannot be used as part of the key in a -merge- operation. They also cannot be used with -fillin-. ([U] 12.4.6. Later in that same passage, they tell us that -compress- will automatically change a strL to to str# when that would save space, but does not do the reverse precisely because of the concern about -merge- and -fillin-.

          Making this an option sounds like a better idea. For my part, I rarely deal with data sets so large that I need to worry about the size of my files beyond what -compress- routinely does for me. But if after saving my data set I found that I could no longer -merge- on a key variable, I would tear my hair out! On the other hand, I can certainly see that other users have reasons to put a premium on the smallest file size.

          Comment

          Working...
          X