Hi Statalisters,
I am working with a fairly large dataset (160GB) that takes the form of roughly 1,500 individual .csv-files. Each one of those .csv-files contains a variable that encodes an identifier which takes on one of roughly 40,000 unique values. So, in each .csv-file each identification code from a subset (or occasionally even the full set) of identifiers shows up multiple times.
My objective is to rearrange the data so that I end up with 40,000 .dta-files, each one containing the entirety of the data associated with the corresponding identifier. There is no need to preserve the information contained in the initial "grouping" into 1,500 .csv-files.
Currently my plan is to:
Since my objective seems fairly straightforward and since I am completely new to Stata, however, I am somewhat hopeful that I am overlooking a more efficacious way. I'd very much appreciate any advice.
Thanks!
I am working with a fairly large dataset (160GB) that takes the form of roughly 1,500 individual .csv-files. Each one of those .csv-files contains a variable that encodes an identifier which takes on one of roughly 40,000 unique values. So, in each .csv-file each identification code from a subset (or occasionally even the full set) of identifiers shows up multiple times.
My objective is to rearrange the data so that I end up with 40,000 .dta-files, each one containing the entirety of the data associated with the corresponding identifier. There is no need to preserve the information contained in the initial "grouping" into 1,500 .csv-files.
Currently my plan is to:
- import the first one of my 1,500 .csv-files
- define a local macro containing the - levelsof - the identifier variable showing up in this first .csv-file
- loop through the ids in this local macro by (1) clearing everything (2) re-importing the first .csv (3) dropping all observation where the identifier does not coincide with the current identifier in my loop (4) saving the result in a .dta-file that contains the identifier in its name
- import the second of my 1,500 .csv-files
- define a local macro containing the - levelsof - the identifier variable showing up in this second .csv-file
- loop through the ids in this local macro by (1) clearing everything (2) re-importing the second .csv (3) dropping all observation where the identifier does not coincide with the current identifier in my loop (4) appending the result to the .dta with the identifier in its name (5) replacing said .dta-file
- continue this way
Since my objective seems fairly straightforward and since I am completely new to Stata, however, I am somewhat hopeful that I am overlooking a more efficacious way. I'd very much appreciate any advice.
Thanks!
Comment