Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reading and merging files from multiple directories (a question on compound double quotes)

    Dear all,

    I am asking this question because I am trying to write a program that connects data stored in various directories to a dataset. I suspect that the reason why the program doesn't yet work relates to my (poor) understanding of double compound quotes in Stata.

    The program should do something quite simple. The first dataset contains two columns, observation IDs and years. For each observationID-year pair there is other information stored in other files located in a certain directory. I would like to join each observationID-year once for each file in that directory and append everything in a kind of long file. Let me explain you what I mean with an example.

    Code:
    use dataset.dta // this is my focal dataset
    preserve
    
    foreach y of numlist 2006/2012 {
    
    local allfiles`y': dir "G:\folder\"`y'"\folder\" files "*.dta" // I use numbers to reference folders;  here is where -set trace- starts getting errors such as 'directory G:\.... not found'
    restore, preserve //I take back my dataset and use a chunk of it that corresponds to a certain year
    keep if year==`y'
    tempfile "portion`y'" // I store the chunk in a tempfile
    save `"portion`y'"'
    
    foreach f in `"allfiles`y'"' {
    joinby ID year using `f', unmatched(master) // Join the chunk of my dataset to each file in the folder and I append the result
    drop _merge
    capture confirm file finaljoin.dta
    
    if _rc!=0 {
    
    save finaljoin.dta
    
    }
    
    else {
    
    append using finaljoin.dta
    save using finaljoin.dta, replace
    clear
    use "portion`y'" //I reuse the same chunk of my dataset to join in the next round of the loop to another file located within the same folder
    
    
    }
    
    }
    
    }
    I hope what I am trying to do is clear. If it's not I will provide some more information.

    Thank you for your help.
    Riccardo
    Last edited by Riccardo Valboni; 27 Jun 2014, 16:16.

  • #2
    You shouldn't need to put quotes around the macros that are embedded within your path. You will, however, want to make sure that you're using a slash (/) rather than a backslash (\) anytime it appears before a macro.

    Try:
    local allfiles`y': dir "G:/folder/`y'/folder/" files "*.dta"

    Comment


    • #3
      Also, for your foreach loop I don't think you need quotes, but you do need to make sure that you're referencing a macro. What you have now is valid compound quotes but Stata doesn't know you're looking to substitute in the contents of a macro.
      You want:
      foreach f in `allfiles`y'' {

      Comment


      • #4
        Try:
        local allfiles`y': dir "G:/folder/`y'/folder/" files "*.dta"
        Hi Sarah, thanks a lot. I spent 3 hours trying to figure out why this didn't work.

        ...Stata doesn't know you're looking to substitute in the contents of a macro.
        What do you mean by that?

        You want:
        foreach f in `allfiles`y'' {
        Ok. It's the first time I find a nested 'macro quote'. I still struggle to understand the general rule...When should one use the 'macro quote" (`abc'), when the compound quote ("`abc'") and when the nested quotes you just showed me (`abc`y'')?

        Many thanks again for your help!

        Comment


        • #5
          A compound quote is used in exactly the same instances when you'd use regular quotation marks. The only different is that using a compound quotation mark lets you include quotation marks within your quoted string. For example.
          Code:
          display `"Sally said, "this is an example where you need compound quotes because it contains quoted text.""'
          I find that I very rarely need compound quotes.

          In your case you have to nest the macros because you used a macro when creating the second macro.
          You create the local macro allfiles`y' in a loop.
          So you end up with a series of macros named: allfiles2006, allfiles2007, allfiles2008 etc.
          You need to let Stata know that those are macros by referencing them by their full name. So you can reference `allfiles2006'. Or you can reference `allfiles`y''.
          You need to nest the macros so that Stata knows to lookfor for the macro `allfiles`y'' after it's first expanded `y'.
          If you use compound quotes `"allfiles`y'"' the first time through the loop your foreach line will expand to
          Code:
          foreach f in `"allfiles2006"' {
          Stata doesn't know that you're asking for a macro so instead of looping through the file names in `allfiles2006' Stata will instead go through the foreach loop exactly once and substitute in the string "allfiles2006" every time `f' appears, which isn't what you want.

          Comment


          • #6
            Hi Sarah, crystal clear. Thanks a lot for the explanation.

            Comment


            • #7
              Another question on tempfiles. I found this old thread http://www.stata.com/statalist/archi.../msg00915.html where to -save- the tempfile both macro and normal quotation marks are used. To -use- it instead, only macro quotes were used. Can someone explain why?

              Comment


              • #8
                Originally posted by riccardovalboni View Post
                I found this old thread http://www.stata.com/statalist/archi.../msg00915.html where to -save- the tempfile both macro and normal quotation marks are used. To -use- it instead, only macro quotes were used. Can someone explain why?
                There is nothing different about use versus save in this regard. Before explaining that, however, let me suggest that you not think of ` and ' as "macro quotes," but rather characters which trigger macro expansion. There are only two types of string quotes in Stata: standard double quotes and compound double quotes (`" and "', which, as Sarah explained, may be used when your string contains an embedded double quote and for nested constructions).

                When referring to a file (e.g., with use or save), you can get by without putting the name (including the path, if necessary) in quotes if you're certain that it doesn't contain any spaces. That is what you're doing when you do this
                `myfile'
                where the macro myfile contains the filename or complete file path. However, this is dangerous when you obtain the filename from tempfile, since these paths may have spaces in them (especially under Windows). Thus, you should always enclose these filenames in quotes, e.g.,
                tempfile myfile
                "`myfile'"
                In fact, since users can set the location of their tempfiles arbitrarily, I always use compound double quotes in production code when referring to files obained via tempfile
                `"`myfile'"'
                which protects against the possibility that the user has set his or her temporary directory to a path containing a double quote (an admittedly unlikely possibility).
                Last edited by Phil Schumm; 28 Jun 2014, 06:26.

                Comment


                • #9
                  BTW, I just looked at your original post, and it seems likely that (1) it is not doing what you want it to, and/or (2) what you're doing can be done more efficiently. First, however, you should explain what it is you're trying to do. That is, in words, explain exactly what is in each file and what you want the result to look like. Then, people can make some specific suggestions for how to proceed.

                  Comment


                  • #10
                    Hi Phil, thank you for the clarification. It's always scary when someone tells you that your routine is not doing what you think. But on the other hand it is a good way to learn. Let me explain what I want better so that it can be assessed whether what I did is correct.

                    Suppose I have a file, let's call it (A), where I have a list of individuals and some years in which I want to observe them. Imagine that these individuals are a subset of an entire population. On the other hand, I have a multitude of files that contain monthly recorded information about all individuals in the population for all years. These files are divided into folders named after the year to which information refers (e.g. H:/2006/data/). In each of these folders there are 12 files (1 per month) containing the -using- data for the -joinby-. What my routine does is that it takes file (A) and selects all people I want to observe in a certain year (this is the tempfile portion `y'). Then, it opens a folder corresponding to the same year and joins the selected individuals first to the January file. Then it -append-s the result of the -joinby- to the file finaljoin.dta. Next, the procedure takes again those individuals (portion`y') and joins them to the February file. Again, it -append-s the result to finaljoin.dta. This goes on until month files are terminated. At that point, the procedure moves to the following year. The loop terminates when years in the numlist are finished.

                    I hope it's clear. Please tell if it is not.
                    Thank you so much for your help.
                    Riccardo

                    PS I think I fixed the problems with the quotes because the program is running without reporting any error. However, if you see some mistakes please notify that so that I -break- .and correct
                    Last edited by Riccardo Valboni; 28 Jun 2014, 07:25.

                    Comment


                    • #11
                      Thanks for the clarification; let me make sure I have this right. Your first file (A) contains only id and year, and its only purpose is to determine for each year, which individuals will appear for that year in the final file (i.e., those who appear in A for that year). Your other files (1 per month organized in subdirectories by year) also contain both id and year, plus additional variables. You want one final long-format file in which each observation represents one month for a given individual, containing only data for those individual/year combinations appearing in A. Is that right?

                      Also, 3 more quick questions:
                      1. Do your monthly files contain a variable indicating which month they refer to, or do you need to add that when combining?
                      2. Can you assume that the monthly files all have the same variables stored in the same way, or at least, that there are no inconsistencies among them (i.e., a variable that appears in two or more files with the same name but contains different information, is mapped to a different value label, or has a different storage type in the two files)?
                      3. How big are these files (i.e., approximate number of observations in A and number of variables in the monthly files)?

                      Comment


                      • #12
                        Assuming I understand the problem correctly, and that the answers to 1–3 are "yes," "yes" and "not too big," then I might do it like this:
                        Code:
                         1  clear
                         2  tempfile result
                         3  save `"`result'"', emptyok
                         4  
                         5  forv y = 2006/2012 {
                         6      clear
                         7      loc flist: dir "G:/<path component>/`y'/<path component>" files "*.dta"
                         8      append using `flist'
                         9      merge ID year using dataset, keep(match) nogen
                        10      append using `"`result'"'
                        11      save `"`result'"', replace
                        12  }
                        In general, note the use of merge instead of joinby for this problem. Note also that the approach above involves reading all of the data for a given year into memory, which might not be practical or efficient if the dataset is very large (see below). Finally, if you want to make sure that data are available for all of the person/year combinations in A, you could replace line 9 with
                        merge ID year using dataset, assert(master match) keep(match) nogen
                        to test this condition, or, if this is not true but you want to retain empty records with missing values in such cases, you could replace it with
                        merge ID year using dataset, keep(match using) nogen
                        If the dataset is so large that this approach isn't possible, then you could replace lines 8–11 with
                        Code:
                        foreach file of local flist {
                            use dataset if year==`y', clear
                            merge 1:1 ID year using `file', keep(match)
                            append using `"`result'"'
                            save `"`result'"', replace
                        }
                        (line 6 could then be omitted). The issues regarding which options you pass to the merge command described above apply here as well.
                        Last edited by Phil Schumm; 28 Jun 2014, 11:08.

                        Comment


                        • #13
                          Originally posted by Phil Schumm View Post
                          Thanks for the clarification; let me make sure I have this right. Your first file (A) contains only id and year, and its only purpose is to determine for each year, which individuals will appear for that year in the final file (i.e., those who appear in A for that year). Your other files (1 per month organized in subdirectories by year) also contain both id and year, plus additional variables. You want one final long-format file in which each observation represents one month for a given individual, containing only data for those individual/year combinations appearing in A. Is that right?

                          Also, 3 more quick questions:
                          1. Do your monthly files contain a variable indicating which month they refer to, or do you need to add that when combining?
                          2. Can you assume that the monthly files all have the same variables stored in the same way, or at least, that there are no inconsistencies among them (i.e., a variable that appears in two or more files with the same name but contains different information, is mapped to a different value label, or has a different storage type in the two files)?
                          3. How big are these files (i.e., approximate number of observations in A and number of variables in the monthly files)?

                          Hi Phil, thanks a lot for the extensive explanation. It taught me a couple of things I didn't know.

                          I here answer your questions because to one of them the answer is 'no' - and this creates a problem that I am now trying to resolve.

                          First, yes, each of these files has an indication of which month the data refers to.
                          Second, the answer is no. Variables all seem to be called in the same way even though I am not completely sure about this because -using- files are about 120 and it's hard to scrutinize each of them by hand. The problem that I am having now pops up when I subset the -using- files. The program goes on for a couple of years and then the -keep- function fails to find one variable and everything stops. Is there a way to avoid this to happen?
                          Third, -using- files are pretty large. About 10mln observations each per 50 vars. Each file is approximately 1.5GB.

                          Let me ask you another question on one thing you pointed out. Why using merge instead of joinby? What is the benefit?

                          Many, many thanks for your help.
                          Riccardo

                          Comment


                          • #14
                            Originally posted by riccardovalboni View Post
                            I here answer your questions because to one of them the answer is 'no' - and this creates a problem that I am now trying to resolve.
                            Ok, your data files are at least large enough that you may not want to read all of the files (in their entirety) for a given year into memory. Also, it seems that some files have variables that other files do not, and moreover, that you only want to retain a subset of these variables in your final dataset. To do this, you can modify what I gave you before like this:
                            Code:
                            1 clear
                            2 tempfile result
                            3 save `"`result'"', emptyok
                            4
                            5 loc varstokeep <varlist>
                            6
                            7 forv y = 2006/2012 {
                            8     loc flist: dir "G:/<path component>/`y'/<path component>" files "*.dta"
                            9     foreach file of local flist {
                            10         use dataset if year==`y', clear
                            11         qui des using `file', varlist
                            12         loc fvars `r(varlist)'
                            13         loc fvars: list varstokeep & fvars
                            14         merge 1:1 ID year using `file', keep(match) keepusing(`fvars')
                            15         append using `"`result'"'
                            16         save `"`result'"', replace
                            17     }
                            18 }
                            where you replace <varlist> with the list of variables (in unabbreviated form) you want to retain from the files (make sure you include your month indicator in this list; you may also include ID and year, but you don't have to). Lines 11–13 construct a list of variables for use in the keepusing() option that is the intersection of the variables you want and the variables available in the file, so that the error you reported in your other post won't occur.

                            Originally posted by riccardovalboni View Post
                            Let me ask you another question on one thing you pointed out. Why using merge instead of joinby?
                            The command joinby generates all pairwise combinations of observations in both files, within the varlist you specify (i.e., the Cartesian product). You don't want (or need) that here. In contrast, merge simply matches each observation from one dataset with (at least) one corresponding observation from the other. That's what you want here—you want to grab all of the records from your data files that correspond to an individual in A. Please consult the documentation on merge and joinby if you still have questions.
                            Last edited by Phil Schumm; 29 Jun 2014, 06:51.

                            Comment


                            • #15
                              Hi Phil, thank you so much. I couldn't imagine a more complete answer.

                              Comment

                              Working...
                              X