Most efficient way to append multiple large .dta

Edoardo Briganti

Join Date: Feb 2021

Posts: 17
#1

Most efficient way to append multiple large .dta

02 Mar 2021, 12:06

Dear all,
I have 72 .dta files with average size of 200 MB (overall 15 GB). They are all stored in a subdirectory named "segment1". I want to append them all into a single master file.

Currently, I am using this

Code:

local file2append: dir "segment1" files "*.dta" foreach file of local file2append { preserve use "segment1/`file'", clear save temporary, replace restore append using temporary, force }

The code works fine, however:
It is incredibly slow.

Stata keeps "not responding", even if it seems it is still performing the task.

Do you believe my code can be streamlined? Do I need to specify anything about Stata memory to speed up the code? I have 132 GB of RAM, is Stata fully exploiting it?

Any help would be really appreciated.
Best,
Edoardo

PS: 1) OS: Windows Server 2019 Standard; 2) Stata: SE 16.0
Tags: None

1 like
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#2

02 Mar 2021, 12:18

Append lets you specify all files to be appended at once, so you could shorten your code like this, which will save you the unnecessary caching of data copies and the disk I/O of saving files. (By the way, temporary files should be specified sing -tempfile-). Stata will otherwise use all available memory (provided it's not limited by some server setting, or by -set max_memory-).

Code:

cd "segment1" local file2append: dir . files "*.dta" append using `file2append', force
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#3

02 Mar 2021, 12:18

Yes, it can be sped up considerably.

Code:

clear foreach f of local file2append { append using `"C:\Users\clyde\Documents\Sandbox/`f'"' } save final_appended_files, replace

will be much better.

That said, if the individual files are large, Stata may be "not responding" during the course of reading in each individual file. That is normal Stata operation--I don't know why, but it is.

Finally I will just warn you that trying to do a mass append like this is looking for trouble unless you have carefully reviewed and cleaned each of the files individually. Otherwise, you are likely to find lots of incompatibilities and inconsistencies among the files with respect to value labels, data storage types, and variable names. There is a handy little program -precombine-, by Mark Chatfield, available from SSC, that can check the files before you attempt this and will let you know about such incompatibilities so you can fix them ahead of time. There will be few things more frustrating than setting out to append all these files together, getting a substantial part of the way there, and then have Stata halt and tell you that some variable that was previously numeric has been found to be a string in the using data!

Added: Crossed with #2. Note that using the -force- option is dangerous. Instead of Stata stopping and complaining if it finds incompatibilities in the files, it will simply blunder on silently and will discard some of the offending data instead. The incompatibilities get swept under the rug, not fixed. Unless you have verified (either manually or with -precombine-) ahead of time that there are no such incompatibilities, you can easily end up with a data set that is missing some part of the data you need and you won't even know it happened until you start getting screwy results later on. (Maybe you'll never know about it and other people will rely to their deteriming on you incorrect results.) My experience is that even well-curated series of data sets from well-respected, reliable sources often contain these kinds of problems. If you have verified that the files are all compatible before appending them, then the -force- option is unnecessary.

Last edited by Clyde Schechter; 02 Mar 2021, 12:29.
3 likes
Comment
Edoardo Briganti

Join Date: Feb 2021

Posts: 17
#4

02 Mar 2021, 12:23

Thanks to both of you!
I really appreciate your prompt and helpful reply.

Best regards,
Edoardo
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#5

02 Mar 2021, 12:34

one minor correction to Clyde Schechter - precombine- is from the Stata Journal, and is not available at SSC; best to use -search- or -findit- to locate and install
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#6

02 Mar 2021, 12:53

Thank you Rich Goldstein for correcting my error.
Comment
Edoardo Briganti

Join Date: Feb 2021

Posts: 17
#7

04 Mar 2021, 19:31

Dear all,
I have a related follow up question.

I am using Leonardo Guizzetti's code (without the force option as Clyde suggested - since I conducted a preliminary cleaning of the segments to append).

Code:

clear cd "segment1" local file2append: dir . files "*.dta" append using `file2append' save appended.dta

Before running the code, my Windows Task Manager says Stata is basically using no RAM (which is good).
After running the code, the RAM usage, tracked by the Task Manager, starts increasing. So far so good: the more segments are appended, the more Stata takes up memory. However, even if the overall size of the database is 15GB, Stata uses all the available memory in the system: the Task Managers reports I am using 100% of my RAM (132 GB).

How comes? Shouldn't there be a one to one mapping between RAM used and size of files appended? How is it that appending 15GB of .dta files depletes my 132GB of RAM?!

Any help would be much appreciated.
Best,
Edoardo
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2403

05 Mar 2021, 08:48

Edoardo, it is possible for the dataset to become much larger than 15 GB. If all of your appended datasets include precisely the same variables, each variable is stored as the same type, and nothing else, then the appended dataset's size is the sum of each component dataset. This is the minimum size possible. If there are differences, this can be severely inflated. This may help you to understand.

For simplicity, suppose that I have two datasets named A and B, they each have exactly 10 observations and 2 variables of type double. (Each dataset also uses the smallest possible storage type for the data, otherwise -compress- can make the starting dataset smaller.) Each -double- requires 8 bits of storage per observation. So each dataset has (10 observations) * (2 double variables) * (8 bits per double variable) = 160 bits. Let's consider what happens when two datasets are appended.

Case 1: All variables the same
This is the minimum size possible assuming that each variable is already the smallest necessary storage type.
Dataset A and B each have two variables: -first- and -second-. The resulting dataset, AB has the same variables, with A's 10 observations + B's 10 observations. The total size is then 320 bits (= 2 * 160 bits).

Code:

. desc, detail

Contains data
  obs:            10                          
 vars:             2                          
width:            16                          
------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------
third           double  %10.0g                
fourth          double  %10.0g                
------------------------------------------------------------

Case 2: All variables different
This is the maximum possible size after appending.
Dataset A contains variables -first- and -second- and Dataset B contains -third- and -fourth-. When A and B are appended, additional observations must be added to A to account for the (non-existent) values of -third- and -fourth-, and likewise, additional observations are added to B for -first- and -second- observations. Now the size contribution from A and B are each twice as large, as each observation now has 4 variables, each of width 8 bits. The total size is now 640 bits (4 * 160 bits).

Code:

. desc, detail

Contains data
  obs:            20                          
 vars:             4                          
width:            32                          
------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------
first           double  %10.0g                
second          double  %10.0g                
third           double  %10.0g                
fourth          double  %10.0g                
------------------------------------------------------------

(example code)

Code:

clear *

tempfile a b

/* Case 1 */
set obs 10
gen double first = runiform()
gen double second = runiform()
desc, detail
save `a', replace

clear
set obs 10
gen double first = runiform()
gen double second = runiform()
desc, detail
save `b', replace

clear
append using `a' `b'
desc, detail

/* Case 2 */
clear
set obs 10
gen double first = runiform()
gen double second = runiform()
desc, detail
save `a', replace

clear
set obs 10
gen double third = runiform()
gen double fourth = runiform()
desc, detail
save `b', replace

clear
append using `a' `b'
desc, detail

Comment

Edoardo Briganti

Join Date: Feb 2021

Posts: 17
#9

05 Mar 2021, 11:03

Dear Leonardo,
thanks for your thorough response. I learnt a lot from it.

My 72 segments all contain the same 9 variables. It is true that they are stored in different format and the appending will keep the heaviest one to accommodate larger observations (e.g. str15 in one chunk and maybe str40 in another on, since there is an observation with 40 characters in the second chunk). However, once I append all the 72 segments and save, the resulting appended database becomes 22GB, much smaller than my 132GB of RAM.
How comes that during the appending process all 132 GB of RAM are depleted?

Cordially,
Edoardo
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#10

05 Mar 2021, 12:55

Note that in #8, I said bits but I meant to write bytes. The point still stands though.

Re #9, I can't say if -append- is really the cause, or why it does what it does internally. I am certain that -append- doesn't compress or check for the smallest possible storage type at any time, only expanding the storage type as required. This makes me think that your string variables may not be made into the smallest storage type prior to appending, and this causes some unnecessary inflation. (Perhaps you have strings stored as strL which don't need to be?) This is all speculation at this point.

If you wanted to be sure, and you really care about this optimization because you might be doing it regularly and not just once, you could either open each dataset, -compress- it and then resave it, ensuring that you are using as little memory as possible. Or you could adapt Clyde's code and run compress on the product after say, every 10th dataset. Whether this level of investigation and "optimization" is worth the trade off of diminishing returns is really up to you.
1 like
Comment
Chinmay Korgaonkar

Join Date: Dec 2022

Posts: 34
#11

27 Sep 2023, 15:48

Hello everyone!
I used the code mentioned on this page for appending multiple dta files from the years 1998 to 2021 stored in a global folder. The following is my code.

global datafolder `"C:/Users/..."'
local files: dir "$datafolder" files "*.dta"
foreach f of local files {
append using "$datafolder/`f'"
}
save "$datafolder/data_2021_1998", replace

The code runs perfectly. But the year 1998 gets appended twice. I tried my best to understand what is happening.

Can anyone help me in this regard?

Thanks in advance!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#12

27 Sep 2023, 16:19

I can think of two explanations for this.

1. The data for year 1998 is already in active memory when you run this code. This could be fixed by -clear-ing before running the loop--the loop should begin with no data set in memory.
2. There are two files in that directory which, though they have different names, contain the 1998 data.
1 like
Comment
Chinmay Korgaonkar

Join Date: Dec 2022

Posts: 34
#13

27 Sep 2023, 16:56

Thanks a lot. I found out that your first explanation was the reason behind the issue I faced. I did not put clear before running the loop. The code for cleaning the Excel file of 1998 and storing it as a dta file was just before the loop code.

Thank you once again.

Last edited by Chinmay Korgaonkar; 27 Sep 2023, 17:02.
Comment
Anupam Ghosh

Join Date: Jan 2023

Posts: 122
#14

23 Nov 2023, 18:12

Hi,

I have a similar problem. I have one stata file for each US county named alphabetically i=1/3010. How can I append them using a loop? I have separate year folders for the stata files. Thus I need to change the directory for every loop.
Comment

Announcement