Rearranging data from a set of files into a different set of files via identifier variable

Jon Beck

Join Date: Feb 2023

Posts: 8
#1

Rearranging data from a set of files into a different set of files via identifier variable

29 Mar 2023, 14:15

Hi Statalisters,

I am working with a fairly large dataset (160GB) that takes the form of roughly 1,500 individual .csv-files. Each one of those .csv-files contains a variable that encodes an identifier which takes on one of roughly 40,000 unique values. So, in each .csv-file each identification code from a subset (or occasionally even the full set) of identifiers shows up multiple times.
My objective is to rearrange the data so that I end up with 40,000 .dta-files, each one containing the entirety of the data associated with the corresponding identifier. There is no need to preserve the information contained in the initial "grouping" into 1,500 .csv-files.

Currently my plan is to:
import the first one of my 1,500 .csv-files

define a local macro containing the - levelsof - the identifier variable showing up in this first .csv-file

loop through the ids in this local macro by (1) clearing everything (2) re-importing the first .csv (3) dropping all observation where the identifier does not coincide with the current identifier in my loop (4) saving the result in a .dta-file that contains the identifier in its name

import the second of my 1,500 .csv-files

define a local macro containing the - levelsof - the identifier variable showing up in this second .csv-file

loop through the ids in this local macro by (1) clearing everything (2) re-importing the second .csv (3) dropping all observation where the identifier does not coincide with the current identifier in my loop (4) appending the result to the .dta with the identifier in its name (5) replacing said .dta-file

continue this way

If I were to proceed in this fashion, I'd loop through 1,500 files, where each loop would feature a nested loop which entails close to 40,000 import, drop, append, save operations. I'd guess that this is bound to take ages.

Since my objective seems fairly straightforward and since I am completely new to Stata, however, I am somewhat hopeful that I am overlooking a more efficacious way. I'd very much appreciate any advice.

Thanks!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#2

29 Mar 2023, 16:28

In addition to it taking ages, you may run into problems trying to use -levelsof-. With 40,000 different identifiers, depending on how long they are, you may exceed the length limit for the local macro returned by -levelsof-.

I would take a different approach. First I would import those 1,500 csv files and append them all together into a single big file. Then I would use -runby- to write out the 40,000 separate new files. This will avoid repetitiously reading the same data over and over again. It would look something like this:

Code:

local source_files build_a_local_macro_containing_the_1500_csv_file_names frame create inputs foreach sf of local source_files { frame inputs { import delimited using "`sf'", clear // cleaning code goes here } frameappend inputs } // FRAME DEFAULT NOW CONTAINS A FILE THAT IS THE COMPLETE 1500 CSV FILES // maybe some extra code here to recast the variables in the big // appended file to whatever type you need them to be--this may // require checking the values for things that are not admissible in // numeric variables nevertheless being there. capture program drop one_identifier program define one_identifier local id = identifier_variable[1] save `"`id'"', replace exit end runby one_identifier, by(identifier_variable) status

-frameappend- is written by Jeremy Freese, with help from Daniel Fernandes and Roger Newson. -runby- is written by Robert Picard and me. Both are available from SSC.
Note: This code requires version 16 or later because it relies crucially on frames. While a -tempfile- based variant of this code could produce the same results, it would be orders of magnitude slower because it would be constantly thrashing the disk.

The execution time for this will be much faster than the approach you described. However, it is still going to take a very long time just given the huge volume of files and data to be processed. When the code reaches the -runby- command, you will get a periodic progress report showing how many id's have been processed and an estimate of the time remaining to completion. There will be no such updates during the first part of the code that reads and appends the .csv files--you will just have to be patient.

Finally, there may well be serious problems putting all these files together. In my experience, it would be miraculous if all 1500 files are truly compatible for being put together. And harmonizing a mass of this many files is not going to be easy. You might consider using the -stringcols(_all)- option on the -import delimited- command. By bringing everything in as a string, you will avoid having explicit problems with the appending process encountering a variable that came in as numeric in some files and string in others. Then, after the appending is all done but before you save the files, you can check all the variables that are supposed to be numeric for non-numeric content, deal with that, and then -destring- them. Similarly, you might want to compress some to smaller storage types. If you cannot anticipate all of the problems that need to be cleaned up, then it is probably worth actually spiltting this process up into two stages. The first stage reads and appends all the csv files and then save the whole thing as a giant .dta file. Then you can take your time exploring the .dta file to figure out what you will need to do to finish cleaning up the data. After you've cleaned it up, read it back in to Stata and finish the job with the -runby- part. This will add one very long file write and one very long file read into the process, but it would be very surprising if you know this mass of data so well that you can really anticipate all of the cleaning well enough to stick all the cleaning code in to the appending process and have it work properly.

Good luck!

Last edited by Clyde Schechter; 29 Mar 2023, 16:31.
Comment
Jon Beck

Join Date: Feb 2023

Posts: 8
#3

30 Mar 2023, 10:54

Many, many thanks for getting back to me so quickly. Your suggestions have been incredibly helpful.

As part of my preliminary data-cleaning efforts, I looped over my original 1,500 .csv-files to generate a corresponding set of cleaned and perfectly compatible .dta-files. During this step I have also been able to collapse the data along a couple of dimensions (where granularity is presumably not going benefit my analysis) leaving me with a total of roughly 93GB. Appending those .dta-files via -frameappend- did, however, turn out a little slower than expected (at a decreasing rate of about 2 minutes per file for the first 50 files). As some sort of benchmark, I also ran

Code:

local source_files : dir "." files "*.dta" append using `source_files' save "data_single_file", replace

which somewhat surprisingly yielded a single 93GB .dta-file in a matter of no more than 38 minutes’ time.

Now, with this file in hand, turning to the exercise of writing out the 40,000 separate files:
Based on a batch of simulated data, I have, in fact, been able to convince myself that -runby- is orders of magnitude faster than anything else I have tried so far. However, when running

Code:

ssc install runby use "data_single_file" capture program drop one_identifier program define one_identifier local id = id_code[1] save `"instance_`id'"', replace exit end runby one_identifier, by(id_code) status

with the actual data the issue would seem to be that I immediately run out of memory. Is there any sort of back-of-the-envelop calculation I could do in order to figure out a ballpark for the memory I’d need to request for things to go through? I am currently working with 170GB.

Thanks so much!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#4

30 Mar 2023, 13:13

As part of my preliminary data-cleaning efforts, I looped over my original 1,500 .csv-files to generate a corresponding set of cleaned and perfectly compatible .dta-files. During this step I have also been able to collapse the data along a couple of dimensions (where granularity is presumably not going benefit my analysis) leaving me with a total of roughly 93GB. Appending those .dta-files via -frameappend- did, however, turn out a little slower than expected (at a decreasing rate of about 2 minutes per file for the first 50 files). As some sort of benchmark, I also ran
Code:

local source_files : dir "." files "*.dta" append using `source_files' save "data_single_file", replace
which somewhat surprisingly yielded a single 93GB .dta-file in a matter of no more than 38 minutes’ time.

This does not surprise me. If I had known that you had already created the 1,500 separate .dta files, I would never have recommend doing this by -frameappend-. A direct append of all 1500 .dta files is much faster. I only made that recommendation because I thought you had only the .csv files to work with, and I expected that the frames-based approach to getting them all into one big .dta file would be faster than writing out 1,500 .dta files to disk. In other words, you had already done the really slow part--whereas I assumed it still remained ahead of you.

However, when running
Code:

ssc install runby use "data_single_file" capture program drop one_identifier program define one_identifier local id = id_code[1] save `"instance_`id'"', replace exit end runby one_identifier, by(id_code) status
with the actual data the issue would seem to be that I immediately run out of memory. Is there any sort of back-of-the-envelop calculation I could do in order to figure out a ballpark for the memory I’d need to request for things to go through? I am currently working with 170GB.

Stata's automatic memory management system is almost surely going to be better for you than any direct -set memory- commands you might try. When memory gets scarce, Stata requests more allocations from your OS. If you are running out of memory, then it either means that you literally do not have enough (physical) RAM to accommodate this task, or your OS is refusing to provide it (perhaps due to the requirements of the OS itself and other programs that are running concurrently, perhaps in the background.)

I see two distinct scenarios that could be causing the memory problem here. -runby- creates a copy of the data in memory as a Mata matrix, and then works off of that matrix. If you are truly running out of memory right at the start, then what you need to do is break up data_single_file.dta into chunks corresponding to batches of id_code values and process them separately in -runby-. You can write a simple loop over the chunk files to do that.

However, there is another possibility. The way I constructed program one_identifier, the original data set is being re-built in memory (which, in your situation serve no useful puprpose), while that Mata matrix is not shrinking in size. So if you are able to run for a while, but then, in medias res, you find yourself running out of memory, the simple solution is to add a -clear- command before -exit- in that program. That way, the original data set will not be re-constructed in memory as you proceed. You will only need enough memory then to handle the Mata matrix at the same time as the biggest single batch of observations with a common value of id_code.

Based on a batch of simulated data, I have, in fact, been able to convince myself that -runby- is orders of magnitude faster than anything else I have tried so far.

That's precisely why Robert Picard and I created -runby-. It was to speed up the process of doing processes iterating over values of a varlist in large datasets.
Comment
Jon Beck

Join Date: Feb 2023

Posts: 8
#5

31 Mar 2023, 18:47

It seems that the first scenario you laid out has been, in fact, the bottleneck in my exercise. Breaking up the single data file into roughly 30GB chunks works like a charm. Thank you so much for all your help, -runby- makes all the difference in the world!
Comment

Announcement

Rearranging data from a set of files into a different set of files via identifier variable

Comment

Comment

Comment

Comment