Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Appending files in a loop within another loop

    Hello,

    I have a very large series of files (approx 200k) that I am trying append together (in a previous step I insheeted them from CSVs). To avoid the 'too many filenames' problem, I have now put them into folder organized by year. So I have files in folders named 1990, 1991....2023.

    Now I want to append them together into annual files (before finally appending all the year files together).

    My code to append right now is failing and I am not sure why. Here is the syntax:

    Code:
    foreach yr of numlist 1990/2023 {
        local files: dir `"xx/_data/`yr'"' files "*.dta"
        local dir1 "xx/_data/"
           qui display `"`files'"'     
        foreach f of local files {
            append using `"`f'"', force
            drop if agglvl_code==17 | agglvl_code==57
                    }
            save `"`dir1'/`yr'_allplaces.dta"', replace, replace
    }
    When I run this, I get an error saying a specific file in the 1990 folder cannot be found. But on checking manually, the file is there.

    And if I break out the first loop and display the results - ie just:
    Code:
    foreach yr of numlist 1990/2023 {
        local files: dir `"xx/_data/`yr'"' files "*.dta"
        local dir1 "xx/_data/"
            display `"`files'"' 
    }
    ...it list the files fully, without error. So presumably the error is in the second nested loop. But not sure where.

    Any thoughts on where I am going wrong?

    Thanks!

  • #2
    why do you have ", replace , replace" ?

    I think it may be a save issue. When you replace a file that doesn't exist, Stata tells you "cannot be found".

    Comment


    • #3
      Thanks George. The double replace was a copy and paste error - it was not in the original code.

      Whether or not there is a single 'replace' in the code, I get the same error. It seems unlikely to me that it is a save error, as the file that cannot be found is not named according to how I tell Stata to name the file to be saved.



      Comment


      • #4
        Is there data loaded when the error appears? If not, then it's not save.

        Comment


        • #5
          Have you tried just using a single year with the inner loop?

          Comment


          • #6
            I'm sure someone here knows the answer. Maybe try this.

            Code:
             foreach yr of numlist 1990/2023 {    
            local files: dir `"xx/_data/`yr'"' files "*.dta"    
            local dir1 "xx/_data/"        
            display `"`files'"'        
             foreach f of local files {
                    di `"`f'"'
            }
            }

            Comment


            • #7
              Hi Tom,

              When you say "a specific file" from the 1990 directory, do you mean the very first file the loop encounters? The second line of code you have should only give the file name, not the fully specified path. Should your 6th line include the fully specified path? It doesn't look like you are otherwise updating the working directory...

              Code:
              append using "xx/_data/`yr'/`f'", force

              Comment


              • #8
                On a tangential note, I just want to be sure that you aware that the use of the -force- option on your -append- command means that you will lose data if there are incompatibilities in the files being appended. Moreover, in a collection of 200k (!) files there almost certainly will be some inconsistencies. Remember that -force- doesn't fix problems, it just sweeps them under the rug leaving a damaged data set in its wake, and not even warning you about it.

                I think that no matter how you slice it, it will be an onerous task to track down and fix those inconsistencies, but it will be harder in the fully appended mess than checking each file one at a time, because in the fully appended mess some of the problematic data will simply have been jettisoned during the mass append process. You also need to worry about appending together files where the same discrete variable has different coding (value labels). That data won't be lost, but it will be simply thrown together and the variable becomes unusable because the same value means different things in different observations.

                There is a tool, written by Mark Chapman, called -precombine-, available from SSC that scans batches of files and notifies you of incompatibilities. I don't know whether it can handle 200,000 files in a single batch--frankly it would surprise me. But I think doing it within years might be feasible, just as your mass append is being done within years. I would try running that before putting those files together, and I would fix whatever problems it turns up before appending. I don't know how well this will work even in a year since, on average you are going to have about 7,000 files per year. The output of -precombine- will be voluminous and probably too large for you to humanly use. I'm not sure there's a good solution to this problem. Whatever you end up doing, I think you need to be very mistrustful of the validity of the data in the full appended data set.

                Anyway, good luck with this task.

                Comment


                • #9
                  Thanks Daniel and Clyde.

                  Daniel - your suggestion worked!

                  Clyde - your warning is heeded. Scary given how much data is here. I think there may be a different place to start data-wise that will give me more confidence about the final product. I will shift my energies there.

                  Again many thanks.

                  Comment

                  Working...
                  X