Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating variable to record filesize of dataset

    Is there anyway to have a variable record the filesize of a dataset?

    I'm trying to compare # of observations to filesize for several hundred spreadsheets (a daily dataset recording store purchases)

    Thanks!!

  • #2
    You can use Stata -checksum- command to get the size of a file, for example:

    Code:
    . checksum auto.dta
    Checksum for auto.dta = 2622619601, size = 12207
    
    . ret list
    
    scalars:
                r(version) =  1
                r(filelen) =  12207
               r(checksum) =  2622619601
    
    . di "`r(filelen)'"
    12207
    r(filelen) stores the length of the file in bytes. For more details see

    Code:
    help checksum

    Comment


    • #3
      You can also use filelist (from SSC) to get the file size of any file and contrary to checksum, it does not have to read the file to calculate its size. To install filelist, type in Stata's command window:

      Code:
      ssc install filelist
      For example, to get the file size of all datasets installed with Stata, change the current directory the Stata directory (help cd) and type

      Code:
      filelist , pattern(*.dta)
      This will recursively find all datasets.

      Comment


      • #4
        One line solution by Hua Peng is usually the most readable and easy to maintain, but the file size is produced by checksum as a by-product, since the checksum command reads and processes the whole file. This means it is slower than other alternatives:

        Code:
        local benchmark "U:\QUERY13.dta"
        timer clear
        
        timer on 1
        checksum `"`benchmark'"'
        timer off 1
        display %21.0g r(filelen)
        
        
        timer on 2
        file open fh using `"`benchmark'"', read binary
        file seek fh eof
        file close fh
        timer off 2
        display %21.0g r(loc)
        
        timer list
        For a modest file (0.6GB) we get the difference like the following:
        Code:
        . timer list
           1:      6.95 /        1 =       6.9540
           2:      0.00 /        1 =       0.0000
        The method employing checksum is expected to be especially slow on a network drive or a USB external drive. The second method should result in trivial timings independent from the media.
        The benefit of checksum is that it can check not only the size, but also the content of the file!

        Best, Sergiy Radyakin



        Comment


        • #5
          And filelist uses the same method that Sergiy just proposed (in Mata) and therefore has the same performance advantage over checksum.

          Comment


          • #6
            Both Robert and Sergiy are right if performance becomes a concern.

            Comment

            Working...
            X