Reading and merging files from multiple directories (a question on compound double quotes)

Riccardo Valboni

Join Date: Jun 2014

Posts: 123
#1

Reading and merging files from multiple directories (a question on compound double quotes)

27 Jun 2014, 16:11

Dear all,

I am asking this question because I am trying to write a program that connects data stored in various directories to a dataset. I suspect that the reason why the program doesn't yet work relates to my (poor) understanding of double compound quotes in Stata.

The program should do something quite simple. The first dataset contains two columns, observation IDs and years. For each observationID-year pair there is other information stored in other files located in a certain directory. I would like to join each observationID-year once for each file in that directory and append everything in a kind of long file. Let me explain you what I mean with an example.

Code:

use dataset.dta // this is my focal dataset preserve foreach y of numlist 2006/2012 { local allfiles`y': dir "G:\folder\"`y'"\folder\" files "*.dta" // I use numbers to reference folders; here is where -set trace- starts getting errors such as 'directory G:\.... not found' restore, preserve //I take back my dataset and use a chunk of it that corresponds to a certain year keep if year==`y' tempfile "portion`y'" // I store the chunk in a tempfile save `"portion`y'"' foreach f in `"allfiles`y'"' { joinby ID year using `f', unmatched(master) // Join the chunk of my dataset to each file in the folder and I append the result drop _merge capture confirm file finaljoin.dta if _rc!=0 { save finaljoin.dta } else { append using finaljoin.dta save using finaljoin.dta, replace clear use "portion`y'" //I reuse the same chunk of my dataset to join in the next round of the loop to another file located within the same folder } } }

I hope what I am trying to do is clear. If it's not I will provide some more information.

Thank you for your help.
Riccardo

Last edited by Riccardo Valboni; 27 Jun 2014, 16:16.
Tags: None
Sarah Edgington

Join Date: Apr 2014

Posts: 284
#2

27 Jun 2014, 16:22

You shouldn't need to put quotes around the macros that are embedded within your path. You will, however, want to make sure that you're using a slash (/) rather than a backslash (\) anytime it appears before a macro.

Try:
local allfiles`y': dir "G:/folder/`y'/folder/" files "*.dta"
Comment
Sarah Edgington

Join Date: Apr 2014

Posts: 284
#3

27 Jun 2014, 16:27

Also, for your foreach loop I don't think you need quotes, but you do need to make sure that you're referencing a macro. What you have now is valid compound quotes but Stata doesn't know you're looking to substitute in the contents of a macro.
You want:
foreach f in `allfiles`y'' {
Comment
Riccardo Valboni

Join Date: Jun 2014

Posts: 123
#4

27 Jun 2014, 16:48

Try:
local allfiles`y': dir "G:/folder/`y'/folder/" files "*.dta"

Hi Sarah, thanks a lot. I spent 3 hours trying to figure out why this didn't work.

...Stata doesn't know you're looking to substitute in the contents of a macro.

What do you mean by that?

You want:
foreach f in `allfiles`y'' {

Ok. It's the first time I find a nested 'macro quote'. I still struggle to understand the general rule...When should one use the 'macro quote" (`abc'), when the compound quote ("`abc'") and when the nested quotes you just showed me (`abc`y'')?

Many thanks again for your help!
Comment
Sarah Edgington

Join Date: Apr 2014

Posts: 284
#5

27 Jun 2014, 17:02

A compound quote is used in exactly the same instances when you'd use regular quotation marks. The only different is that using a compound quotation mark lets you include quotation marks within your quoted string. For example.

Code:

display `"Sally said, "this is an example where you need compound quotes because it contains quoted text.""'

I find that I very rarely need compound quotes.

In your case you have to nest the macros because you used a macro when creating the second macro.
You create the local macro allfiles`y' in a loop.
So you end up with a series of macros named: allfiles2006, allfiles2007, allfiles2008 etc.
You need to let Stata know that those are macros by referencing them by their full name. So you can reference `allfiles2006'. Or you can reference `allfiles`y''.
You need to nest the macros so that Stata knows to lookfor for the macro `allfiles`y'' after it's first expanded `y'.
If you use compound quotes `"allfiles`y'"' the first time through the loop your foreach line will expand to

Code:

foreach f in `"allfiles2006"' {

Stata doesn't know that you're asking for a macro so instead of looping through the file names in `allfiles2006' Stata will instead go through the foreach loop exactly once and substitute in the string "allfiles2006" every time `f' appears, which isn't what you want.
Comment
Riccardo Valboni

Join Date: Jun 2014

Posts: 123
#6

28 Jun 2014, 01:18

Hi Sarah, crystal clear. Thanks a lot for the explanation.
Comment
Riccardo Valboni

Join Date: Jun 2014

Posts: 123
#7

28 Jun 2014, 04:04

Another question on tempfiles. I found this old thread http://www.stata.com/statalist/archi.../msg00915.html where to -save- the tempfile both macro and normal quotation marks are used. To -use- it instead, only macro quotes were used. Can someone explain why?
Comment
Phil Schumm

Join Date: Mar 2014

Posts: 169
#8

28 Jun 2014, 06:16

Originally posted by riccardovalboni View Post

I found this old thread http://www.stata.com/statalist/archi.../msg00915.html where to -save- the tempfile both macro and normal quotation marks are used. To -use- it instead, only macro quotes were used. Can someone explain why?

There is nothing different about use versus save in this regard. Before explaining that, however, let me suggest that you not think of ` and ' as "macro quotes," but rather characters which trigger macro expansion. There are only two types of string quotes in Stata: standard double quotes and compound double quotes (`" and "', which, as Sarah explained, may be used when your string contains an embedded double quote and for nested constructions).

When referring to a file (e.g., with use or save), you can get by without putting the name (including the path, if necessary) in quotes if you're certain that it doesn't contain any spaces. That is what you're doing when you do this
`myfile'
where the macro myfile contains the filename or complete file path. However, this is dangerous when you obtain the filename from tempfile, since these paths may have spaces in them (especially under Windows). Thus, you should always enclose these filenames in quotes, e.g.,
tempfile myfile
"`myfile'"
In fact, since users can set the location of their tempfiles arbitrarily, I always use compound double quotes in production code when referring to files obained via tempfile
`"`myfile'"'
which protects against the possibility that the user has set his or her temporary directory to a path containing a double quote (an admittedly unlikely possibility).

Last edited by Phil Schumm; 28 Jun 2014, 06:26.
Comment
Phil Schumm

Join Date: Mar 2014

Posts: 169
#9

28 Jun 2014, 06:42

BTW, I just looked at your original post, and it seems likely that (1) it is not doing what you want it to, and/or (2) what you're doing can be done more efficiently. First, however, you should explain what it is you're trying to do. That is, in words, explain exactly what is in each file and what you want the result to look like. Then, people can make some specific suggestions for how to proceed.
Comment
Riccardo Valboni

Join Date: Jun 2014

Posts: 123
#10

28 Jun 2014, 07:21

Hi Phil, thank you for the clarification. It's always scary when someone tells you that your routine is not doing what you think. But on the other hand it is a good way to learn. Let me explain what I want better so that it can be assessed whether what I did is correct.

Suppose I have a file, let's call it (A), where I have a list of individuals and some years in which I want to observe them. Imagine that these individuals are a subset of an entire population. On the other hand, I have a multitude of files that contain monthly recorded information about all individuals in the population for all years. These files are divided into folders named after the year to which information refers (e.g. H:/2006/data/). In each of these folders there are 12 files (1 per month) containing the -using- data for the -joinby-. What my routine does is that it takes file (A) and selects all people I want to observe in a certain year (this is the tempfile portion `y'). Then, it opens a folder corresponding to the same year and joins the selected individuals first to the January file. Then it -append-s the result of the -joinby- to the file finaljoin.dta. Next, the procedure takes again those individuals (portion`y') and joins them to the February file. Again, it -append-s the result to finaljoin.dta. This goes on until month files are terminated. At that point, the procedure moves to the following year. The loop terminates when years in the numlist are finished.

I hope it's clear. Please tell if it is not.
Thank you so much for your help.
Riccardo

PS I think I fixed the problems with the quotes because the program is running without reporting any error. However, if you see some mistakes please notify that so that I -break- .and correct

Last edited by Riccardo Valboni; 28 Jun 2014, 07:25.
Comment
Phil Schumm

Join Date: Mar 2014

Posts: 169
#11

28 Jun 2014, 09:37

Thanks for the clarification; let me make sure I have this right. Your first file (A) contains only id and year, and its only purpose is to determine for each year, which individuals will appear for that year in the final file (i.e., those who appear in A for that year). Your other files (1 per month organized in subdirectories by year) also contain both id and year, plus additional variables. You want one final long-format file in which each observation represents one month for a given individual, containing only data for those individual/year combinations appearing in A. Is that right?

Also, 3 more quick questions:
Do your monthly files contain a variable indicating which month they refer to, or do you need to add that when combining?

Can you assume that the monthly files all have the same variables stored in the same way, or at least, that there are no inconsistencies among them (i.e., a variable that appears in two or more files with the same name but contains different information, is mapped to a different value label, or has a different storage type in the two files)?

How big are these files (i.e., approximate number of observations in A and number of variables in the monthly files)?
Comment
Phil Schumm

Join Date: Mar 2014

Posts: 169
#12

28 Jun 2014, 11:05

Assuming I understand the problem correctly, and that the answers to 1–3 are "yes," "yes" and "not too big," then I might do it like this:

Code:

1 clear 2 tempfile result 3 save `"`result'"', emptyok 4 5 forv y = 2006/2012 { 6 clear 7 loc flist: dir "G:/<path component>/`y'/<path component>" files "*.dta" 8 append using `flist' 9 merge ID year using dataset, keep(match) nogen 10 append using `"`result'"' 11 save `"`result'"', replace 12 }

In general, note the use of merge instead of joinby for this problem. Note also that the approach above involves reading all of the data for a given year into memory, which might not be practical or efficient if the dataset is very large (see below). Finally, if you want to make sure that data are available for all of the person/year combinations in A, you could replace line 9 with
merge ID year using dataset, assert(master match) keep(match) nogen
to test this condition, or, if this is not true but you want to retain empty records with missing values in such cases, you could replace it with
merge ID year using dataset, keep(match using) nogen
If the dataset is so large that this approach isn't possible, then you could replace lines 8–11 with

Code:

foreach file of local flist { use dataset if year==`y', clear merge 1:1 ID year using `file', keep(match) append using `"`result'"' save `"`result'"', replace }

(line 6 could then be omitted). The issues regarding which options you pass to the merge command described above apply here as well.

Last edited by Phil Schumm; 28 Jun 2014, 11:08.
Comment
Riccardo Valboni

Join Date: Jun 2014

Posts: 123
#13

29 Jun 2014, 04:51

Originally posted by Phil Schumm View Post

Thanks for the clarification; let me make sure I have this right. Your first file (A) contains only id and year, and its only purpose is to determine for each year, which individuals will appear for that year in the final file (i.e., those who appear in A for that year). Your other files (1 per month organized in subdirectories by year) also contain both id and year, plus additional variables. You want one final long-format file in which each observation represents one month for a given individual, containing only data for those individual/year combinations appearing in A. Is that right?

Also, 3 more quick questions:
Do your monthly files contain a variable indicating which month they refer to, or do you need to add that when combining?

Can you assume that the monthly files all have the same variables stored in the same way, or at least, that there are no inconsistencies among them (i.e., a variable that appears in two or more files with the same name but contains different information, is mapped to a different value label, or has a different storage type in the two files)?

How big are these files (i.e., approximate number of observations in A and number of variables in the monthly files)?

Hi Phil, thanks a lot for the extensive explanation. It taught me a couple of things I didn't know.

I here answer your questions because to one of them the answer is 'no' - and this creates a problem that I am now trying to resolve.

First, yes, each of these files has an indication of which month the data refers to.
Second, the answer is no. Variables all seem to be called in the same way even though I am not completely sure about this because -using- files are about 120 and it's hard to scrutinize each of them by hand. The problem that I am having now pops up when I subset the -using- files. The program goes on for a couple of years and then the -keep- function fails to find one variable and everything stops. Is there a way to avoid this to happen?
Third, -using- files are pretty large. About 10mln observations each per 50 vars. Each file is approximately 1.5GB.

Let me ask you another question on one thing you pointed out. Why using merge instead of joinby? What is the benefit?

Many, many thanks for your help.
Riccardo
Comment
Phil Schumm

Join Date: Mar 2014

Posts: 169
#14

29 Jun 2014, 06:46

Originally posted by riccardovalboni View Post

I here answer your questions because to one of them the answer is 'no' - and this creates a problem that I am now trying to resolve.

Ok, your data files are at least large enough that you may not want to read all of the files (in their entirety) for a given year into memory. Also, it seems that some files have variables that other files do not, and moreover, that you only want to retain a subset of these variables in your final dataset. To do this, you can modify what I gave you before like this:

Code:

1 clear 2 tempfile result 3 save `"`result'"', emptyok 4 5 loc varstokeep <varlist> 6 7 forv y = 2006/2012 { 8 loc flist: dir "G:/<path component>/`y'/<path component>" files "*.dta" 9 foreach file of local flist { 10 use dataset if year==`y', clear 11 qui des using `file', varlist 12 loc fvars `r(varlist)' 13 loc fvars: list varstokeep & fvars 14 merge 1:1 ID year using `file', keep(match) keepusing(`fvars') 15 append using `"`result'"' 16 save `"`result'"', replace 17 } 18 }

where you replace <varlist> with the list of variables (in unabbreviated form) you want to retain from the files (make sure you include your month indicator in this list; you may also include ID and year, but you don't have to). Lines 11–13 construct a list of variables for use in the keepusing() option that is the intersection of the variables you want and the variables available in the file, so that the error you reported in your other post won't occur.

Originally posted by riccardovalboni View Post

Let me ask you another question on one thing you pointed out. Why using merge instead of joinby?

The command joinby generates all pairwise combinations of observations in both files, within the varlist you specify (i.e., the Cartesian product). You don't want (or need) that here. In contrast, merge simply matches each observation from one dataset with (at least) one corresponding observation from the other. That's what you want here—you want to grab all of the records from your data files that correspond to an individual in A. Please consult the documentation on merge and joinby if you still have questions.

Last edited by Phil Schumm; 29 Jun 2014, 06:51.
1 like
Comment
Riccardo Valboni

Join Date: Jun 2014

Posts: 123
#15

29 Jun 2014, 07:22

Hi Phil, thank you so much. I couldn't imagine a more complete answer.
Comment

Announcement

Reading and merging files from multiple directories (a question on compound double quotes)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment