Is it possible to append several .csv files at the same time in stata?

Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#1

Is it possible to append several .csv files at the same time in stata?

06 Sep 2023, 10:01

Hi everyone,

I just have a quick question. I would like to append several csv files (approx. 30 files. Those files have the same variables) on a dta file for later use in stata. My question is:
Do I have to append one-by-one all the csv files, or is there a better and less cumbersome way to do this please?

Thank you in advance.

Michael
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#2

06 Sep 2023, 10:34

Essentially you have two options. Push all your data from 30 csv files into a single csv file (omitting the repeated header). This is not recommended and is a pain.

The second option is import and append in a loop, it's not really cumbersome at all. The details will vary according to your dataset names, etc but you can adapt the code. It would be helpful if the files were named sequentially, but not essential. Here's a rough outline of the loop you can use. Obviously I have omitted details about importing and filename retrieval since those have not been provided.

Code:

mkf Data tempfile afile forvalues i = 1/30 { import delimited using ..... , clear // you need to fill this bit in appropriately save `afile', replace frame Data: append using `afile', nonotes } cwf Data
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#3

06 Sep 2023, 10:36

If they were .dta files, you could append all 30 in a single command. But with .csv files, no.

One option is to first write a loop to import all the files and save each as a .dta file, and then -append- them all at once.
The second option is to write a loop that imports the files and appends them as you go.

I strongly recommend the first approach. The reason is that although you believe all the files have the same variables, this is probably not true. Large collections of files typically have some incompatibilities that defeat the process of appending them. This is the case even when the files are all provided by the same source, and the source is a data curator with a reputation for high quality. There may be subtle variations in variable names due to changes in case, or spelling errors, etc. More commonly, a variable that is numeric in some of the files may be non-numeric in others. If you try to append these all together, either Stata will fail to notice the incompatibility (different names for what you think is one and the same variable), resulting in your one variable's information being split over two (or more) different variables, or it will notice the incompatibility (numeric vs non-numeric) and will abort execution.

So I recommend importing each of the 30 files separately and then running the -precombine- command (by Mark Chatfield, from SSC). -precombine- will alert you to any important incompatibilities among the data sets. Then you can fix those problems, and, finally, append all of the fixed .dta files using a single -append- command: -append using list_of_all_the_filenames-.

Added: Crossed with #2, which illustrates the second option, the one I do not recommend.

Last edited by Clyde Schechter; 06 Sep 2023, 10:39.
1 like
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#4

06 Sep 2023, 10:44

I agree with Clyde’s comments. I took OP at face value with they suggested the csv files have the same structure. Option #1 is more defensive, which is never a bad idea.
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#5

07 Sep 2023, 00:19

Dear Leonardo Guizzetti, Dear Clyde Schechter,

Thank you so much for your help.

Best,

Michael
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#6

07 Sep 2023, 03:51

Hi Clyde Schechter,

I have been successful in create individual stata datasets from mine initial csv files through a loop. Thank you very much.

But I am not sure how to use precombine. I have read the paper "precombine: A command to examine n>=2 datasets before combining" (2015), but the examples provided are rather simple, and contains only two datasets.

How could I use the command precombine with approximately 30 .dta files (from January 2021 to July 2023) please?

Should I do the following?

Code:

precombine export_telemedida_202101 export_telemedida_202102 export_telemedida_202103 ... export_telemedida_202307, describe(variabledetaillist) uniquevars

Then, how could I know if I can go ahead with the -append process?

If I can go with the -append process, the code is the following, right?

Code:

append export_telemedida_202101 export_telemedida_202102 export_telemedida_202103 ... export_telemedida_202307

Is there a better looking form to do that please (through a loop, for example)? Or is the only existing solution?

Thank you again for the help provided.
Best,

Michael
Comment

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 500

07 Sep 2023, 04:22

Actually,

I have written this loop, but not sure it's correct, as it takes iteratively each "filename.dta" :

Code:

local filepath = "`c(pwd)'" 
local files : dir "`filepath'" files "*.csv"

// Display list of files to import data from
di `"`files'"' 



foreach x of local files {
    // Display file name
    di "`x'" 

    //Import each file and generate id var (filename without ".csv")
    import delimited "`x'", delimiter(",")  case(preserve) clear 
    // <-- ! Change delimiter() if vars are separated by ";" or "tab"
    
    local outfile = subinstr("`x'", ".csv", "", .)
    
    save "../stata_raw_data/`outfile'", replace

    /* 
    *****************************************************************
    Use -precombine- to check files before appending them 
    (if variable names match, etc.)
    *****************************************************************
    */
    cd "../stata_raw_data"
    precombine `outfile'.dta, clear uniquevars describe(type)
    
}

Could you give me some feedback about it please? Thank you so much.

Michael

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#8

07 Sep 2023, 10:19

Here's how I would do this:

Code:

local files: dir "." files "*.csv" foreach f of local files { display `"`f'"' import delimited `"`x'"', delimiter(",") case(preserve) clear local outfile: subinstr local f ".csv" "", all save "../stata_raw_data/`outfile'", replace } local files: dir "../stata_raw_data" files "export_telemedia_*.dta" precombine `files', uniquevars

I should add that you should definitely be logging your Stata output when you run -precombine-. (Well, really, you should always log your output, but if that is not your normal habit, be sure to do it now because you will need to refer to the output of -precombine- while you fix any problems -precombine- uncovers.)

The -precombine- output gives you both the good news and the bad news. Since you expect the files to all have the same variables, anything -precombine- identifies as not being in all of the files is a problem and needs your attention. Next, direct your attention to the part of the output that begins "When merging/appending, expect an error/warning message for the following variable(s) being a string in one data set and numeric in another:" If there is no such section, then there are no such problems, and nothing more for you to do. But you may instead find such a section and it will show you the variables and which type they are in each data set. You then need to decide on a common data type that will work for you and go about changing the data to that type in the non-conforming data sets. (Use -destring- to convert string variables that look like numbers to human eyes into numeric variables. Use -tostring- to convert numeric variables to strings that look like those numbers to human eyes. While -encode- is an important data management command in Stata, it has no role here--don't even think about using it. Even if you have categorical variables that need to be converted to numeric variables with value labels attached, that should not be done until after all of the data sets have been combined.)

-precombine- is also able to detect problems with value labels that can cause truly vexatious bugs--but as these data sets are coming directly from -import delimited-, there won't be any value labels, so you don't have to worry about that.

Once you have fixed up whatever problems you find:

Code:

clear cd "../stata_raw_data" local files: dir "." files "export_telemedia_*.dta" append using `files' save big_combined_file, replace

Note: In this code, I am assuming that there are no files with names matching export_telemedia_*.dta sitting around in the stata_raw_data directory, and that all of the files you are creating have a name that matches export_telemedia_*.dta. When you are dealing with a large group of files that you are going to handle collectively, it is easier if you collect their names into a local macro. If export_telemedia_*.dta doesn't cover them all, or includes extraneous files, then you need some other way to do this. Perhaps there is some other wildcard or combination of wildcards that will do it. Or perhaps some "surgery" can be performed on local macro files as created in the code shown by removing or adding some filenames with separate commands modifying local macro files.
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#9

08 Sep 2023, 00:45

Hi Clyde Schechter,

Thank you so much for your incredible explanation. All is clear now.

Best,

Michael
Comment

Announcement

Is it possible to append several .csv files at the same time in stata?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment