Creating variable to record filesize of dataset

Nathan Weil

Join Date: Aug 2015

Posts: 1
#1

Creating variable to record filesize of dataset

10 Aug 2015, 13:46

Is there anyway to have a variable record the filesize of a dataset?

I'm trying to compare # of observations to filesize for several hundred spreadsheets (a daily dataset recording store purchases)

Thanks!!
Tags: filesize, Generate, metadata
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#2

10 Aug 2015, 13:53

You can use Stata -checksum- command to get the size of a file, for example:

Code:

. checksum auto.dta Checksum for auto.dta = 2622619601, size = 12207 . ret list scalars: r(version) = 1 r(filelen) = 12207 r(checksum) = 2622619601 . di "`r(filelen)'" 12207

r(filelen) stores the length of the file in bytes. For more details see

Code:

help checksum
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#3

10 Aug 2015, 14:18

You can also use filelist (from SSC) to get the file size of any file and contrary to checksum, it does not have to read the file to calculate its size. To install filelist, type in Stata's command window:

Code:

ssc install filelist

For example, to get the file size of all datasets installed with Stata, change the current directory the Stata directory (help cd) and type

Code:

filelist , pattern(*.dta)

This will recursively find all datasets.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#4

10 Aug 2015, 15:02

One line solution by Hua Peng is usually the most readable and easy to maintain, but the file size is produced by checksum as a by-product, since the checksum command reads and processes the whole file. This means it is slower than other alternatives:

Code:

local benchmark "U:\QUERY13.dta" timer clear timer on 1 checksum `"`benchmark'"' timer off 1 display %21.0g r(filelen) timer on 2 file open fh using `"`benchmark'"', read binary file seek fh eof file close fh timer off 2 display %21.0g r(loc) timer list

For a modest file (0.6GB) we get the difference like the following:

Code:

. timer list 1: 6.95 / 1 = 6.9540 2: 0.00 / 1 = 0.0000

The method employing checksum is expected to be especially slow on a network drive or a USB external drive. The second method should result in trivial timings independent from the media.
The benefit of checksum is that it can check not only the size, but also the content of the file!

Best, Sergiy Radyakin
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#5

10 Aug 2015, 15:07

And filelist uses the same method that Sergiy just proposed (in Mata) and therefore has the same performance advantage over checksum.
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#6

10 Aug 2015, 15:13

Both Robert and Sergiy are right if performance becomes a concern.
Comment

Announcement

Creating variable to record filesize of dataset

Comment

Comment

Comment

Comment

Comment