Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is it a general rule that numeric takes up smaller file size than string?

    I was thinking of how to reduce the size of my 100GB dataset.

    I suspected numeric might take smaller space.

    So I experimented first with toy data.

    Code:
    set obs 1000
    gen test="12345678"
    save string,replace
    destring test, replace
    save numeric, replace

    string.dta is 9KB
    numeric.dta is 6KB.

    Is it generally true that numeric always takes up smaller space?

  • #2
    I think it depends on the size/width of the string-variable. But as a general rule you are right, numerical data uses smaller space than string.
    Have a look at: https://blog.stata.com/2012/04/02/th...-to-precision/ which does provide information on storage types of numerical data.

    Storage use of string are explained in the manual https://www.stata.com/manuals/u12.pdf#u12.4.7
    Last edited by Marc Kaulisch; 07 Dec 2022, 00:14.

    Comment


    • #3
      Storing numeric values as strings tends to consume (much) more memory than storing numeric values efficiently, i.e. using the least amount of memory that preserves precision, also see compress.

      Comment


      • #4
        Don't forget encode or the user-written sencode (R. Newson, from SSC), the latter of which can save a step or two and yields an even smaller 3 KB file for me.

        Code:
        set obs 1000
        gen test="12345678"
        sencode test, replace
        save numeric2, replace
        Code:
        . ssc describe sencode
        
        ----------------------------------------------------------------------------------------------
        package sencode from http://fmwww.bc.edu/repec/bocode/s
        ----------------------------------------------------------------------------------------------
        
        TITLE
              'SENCODE': module to encode a string variable non-alphanumerically into a numeric
                variable
        
        DESCRIPTION/AUTHOR(S)
              
                sencode is a sequential version of encode. It takes, as input,
              a string variable, and generates, as output, a numeric
              variable, with value labels    corresponding to values of the
              string variable. Unlike encode, sencode      orders the numeric
              values corresponding to string values in the sequential
              order of appearance in the input string variable in the data set,
              or in      another order specified by the user, instead of
              ordering them in alphanumeric      order of the string value, as
              encode does. The mapping from numeric code   values to string
              labels may be one-to-one (coded in order of first  appearance
              of the string value) or many-to-one (coded in each observation
                by the position of that observation in the data set, or in the
              user-specified      ordering).
              
              KW: data manipulation
              
              Requires: Stata version 10.0
              
              Distribution-Date: 20130930
              
              Author: Roger Newson,  King's College London
              Support: email [email protected]
              
        
        INSTALLATION FILES                              (type net install sencode)
              sencode.ado
              sencode.sthlp
        ----------------------------------------------------------------------------------------------
        (type ssc install sencode to install)
        David Radwin
        Senior Researcher, California Competes
        californiacompetes.org
        Pronouns: He/Him

        Comment


        • #5
          Originally posted by David Radwin View Post
          Don't forget encode or the user-written sencode (R. Newson, from SSC), the latter of which can save a step or two and yields an even smaller 3 KB file for me.

          Code:
          set obs 1000
          gen test="12345678"
          sencode test, replace
          save numeric2, replace
          Code:
          . ssc describe sencode
          
          ----------------------------------------------------------------------------------------------
          package sencode from http://fmwww.bc.edu/repec/bocode/s
          ----------------------------------------------------------------------------------------------
          
          TITLE
          'SENCODE': module to encode a string variable non-alphanumerically into a numeric
          variable
          
          DESCRIPTION/AUTHOR(S)
          
          sencode is a sequential version of encode. It takes, as input,
          a string variable, and generates, as output, a numeric
          variable, with value labels corresponding to values of the
          string variable. Unlike encode, sencode orders the numeric
          values corresponding to string values in the sequential
          order of appearance in the input string variable in the data set,
          or in another order specified by the user, instead of
          ordering them in alphanumeric order of the string value, as
          encode does. The mapping from numeric code values to string
          labels may be one-to-one (coded in order of first appearance
          of the string value) or many-to-one (coded in each observation
          by the position of that observation in the data set, or in the
          user-specified ordering).
          
          KW: data manipulation
          
          Requires: Stata version 10.0
          
          Distribution-Date: 20130930
          
          Author: Roger Newson, King's College London
          Support: email [email protected]
          
          
          INSTALLATION FILES (type net install sencode)
          sencode.ado
          sencode.sthlp
          ----------------------------------------------------------------------------------------------
          (type ssc install sencode to install)
          Thank you so much! You mean encode and sencode saves reduces file size more than simply destring and then compress? Even if I do "compress" after "destring", will it have still bigger file size than just doing "encode" or "sencode"?

          And does "sencode" reduce file size even more than "encode" would do? If yes, how is it possible?

          Comment


          • #6
            If destring makes sense then encode doesn’t hold any advantage — as at best holding the same data as value labels as well as holding it as values is pointless. At worst, encode is a machine for generating garbage as a test case of “1000”
            ”200” “3”
            will show.

            Testing against a case in which only one value label
            is needed is testing an extreme case.


            The larger point is that very rarely if ever are destring and encode even competing solutions. Have string, want numeric has different solutions superficially hidden by different problems fitting the same vague wording.

            The odd detail here for me as the original
            author of destring is that it was written precisely because a recurrent problem could not be solved by encode.

            Comment


            • #7
              Originally posted by Nick Cox View Post
              If destring makes sense then encode doesn’t hold any advantage — as at best holding the same data as value labels as well as holding it as values is pointless. At worst, encode is a machine for generating garbage as a test case of “1000”
              ”200” “3”
              will show.

              Testing against a case in which only one value label
              is needed is testing an extreme case.


              The larger point is that very rarely if ever are destring and encode even competing solutions. Have string, want numeric has different solutions superficially hidden by different problems fitting the same vague wording.

              The odd detail here for me as the original
              author of destring is that it was written precisely because a recurrent problem could not be solved by encode.


              Thank you. I think you are right. I tested with this example instead this time:

              Code:
              clear
              set obs 1000
              gen test=_n
              tostring test, replace
              save string,replace
              destring test, replace
              compress
              save numeric, replace
              
              clear
              set obs 1000
              gen test=_n
              tostring test, replace
              save string,replace
              encode test, generate(test2)
              drop test
              rename test2 test
              compress
              save numeric2, replace
              
              clear
              set obs 1000
              gen test=_n
              tostring test, replace
              save string,replace
              sencode test, replace
              compress
              save numeric3, replace
              Now, numeric is 4KB, numeric2 is 16KB, numeric3 is 15KB.

              So I guess in most cases where each observation has different values, destring keeps the file size smaller than encode or sencode.

              Comment


              • #8
                I don't think testing is needed here. The difference is between holding numeric values and holding those values together with those values as value labels. Any other set-up is not comparable.

                Comment

                Working...
                X