Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Most efficient way to append multiple large .dta

    Dear all,
    I have 72 .dta files with average size of 200 MB (overall 15 GB). They are all stored in a subdirectory named "segment1". I want to append them all into a single master file.

    Currently, I am using this
    Code:
    local file2append: dir "segment1" files "*.dta"
    foreach file of local file2append {
        preserve
        use "segment1/`file'", clear
        save temporary, replace
        restore
        append using temporary, force
    }
    The code works fine, however:
    1. It is incredibly slow.
    2. Stata keeps "not responding", even if it seems it is still performing the task.
    Do you believe my code can be streamlined? Do I need to specify anything about Stata memory to speed up the code? I have 132 GB of RAM, is Stata fully exploiting it?

    Any help would be really appreciated.
    Best,
    Edoardo

    PS: 1) OS: Windows Server 2019 Standard; 2) Stata: SE 16.0

  • #2
    Append lets you specify all files to be appended at once, so you could shorten your code like this, which will save you the unnecessary caching of data copies and the disk I/O of saving files. (By the way, temporary files should be specified sing -tempfile-). Stata will otherwise use all available memory (provided it's not limited by some server setting, or by -set max_memory-).

    Code:
    cd "segment1"
    local file2append: dir . files "*.dta"
    append using `file2append', force

    Comment


    • #3
      Yes, it can be sped up considerably.

      Code:
      clear
      foreach f of local file2append {
          append using `"C:\Users\clyde\Documents\Sandbox/`f'"'
      }
      save final_appended_files, replace
      will be much better.

      That said, if the individual files are large, Stata may be "not responding" during the course of reading in each individual file. That is normal Stata operation--I don't know why, but it is.

      Finally I will just warn you that trying to do a mass append like this is looking for trouble unless you have carefully reviewed and cleaned each of the files individually. Otherwise, you are likely to find lots of incompatibilities and inconsistencies among the files with respect to value labels, data storage types, and variable names. There is a handy little program -precombine-, by Mark Chatfield, available from SSC, that can check the files before you attempt this and will let you know about such incompatibilities so you can fix them ahead of time. There will be few things more frustrating than setting out to append all these files together, getting a substantial part of the way there, and then have Stata halt and tell you that some variable that was previously numeric has been found to be a string in the using data!

      Added: Crossed with #2. Note that using the -force- option is dangerous. Instead of Stata stopping and complaining if it finds incompatibilities in the files, it will simply blunder on silently and will discard some of the offending data instead. The incompatibilities get swept under the rug, not fixed. Unless you have verified (either manually or with -precombine-) ahead of time that there are no such incompatibilities, you can easily end up with a data set that is missing some part of the data you need and you won't even know it happened until you start getting screwy results later on. (Maybe you'll never know about it and other people will rely to their deteriming on you incorrect results.) My experience is that even well-curated series of data sets from well-respected, reliable sources often contain these kinds of problems. If you have verified that the files are all compatible before appending them, then the -force- option is unnecessary.
      Last edited by Clyde Schechter; 02 Mar 2021, 12:29.

      Comment


      • #4
        Thanks to both of you!
        I really appreciate your prompt and helpful reply.

        Best regards,
        Edoardo

        Comment


        • #5
          one minor correction to Clyde Schechter - precombine- is from the Stata Journal, and is not available at SSC; best to use -search- or -findit- to locate and install

          Comment


          • #6
            Thank you Rich Goldstein for correcting my error.

            Comment


            • #7
              Dear all,
              I have a related follow up question.

              I am using Leonardo Guizzetti's code (without the force option as Clyde suggested - since I conducted a preliminary cleaning of the segments to append).
              Code:
              clear
              cd "segment1"
              local file2append: dir . files "*.dta"
              append using `file2append'
              save appended.dta
              Before running the code, my Windows Task Manager says Stata is basically using no RAM (which is good).
              After running the code, the RAM usage, tracked by the Task Manager, starts increasing. So far so good: the more segments are appended, the more Stata takes up memory. However, even if the overall size of the database is 15GB, Stata uses all the available memory in the system: the Task Managers reports I am using 100% of my RAM (132 GB).

              How comes? Shouldn't there be a one to one mapping between RAM used and size of files appended? How is it that appending 15GB of .dta files depletes my 132GB of RAM?!

              Any help would be much appreciated.
              Best,
              Edoardo

              Comment


              • #8
                Edoardo, it is possible for the dataset to become much larger than 15 GB. If all of your appended datasets include precisely the same variables, each variable is stored as the same type, and nothing else, then the appended dataset's size is the sum of each component dataset. This is the minimum size possible. If there are differences, this can be severely inflated. This may help you to understand.

                For simplicity, suppose that I have two datasets named A and B, they each have exactly 10 observations and 2 variables of type double. (Each dataset also uses the smallest possible storage type for the data, otherwise -compress- can make the starting dataset smaller.) Each -double- requires 8 bits of storage per observation. So each dataset has (10 observations) * (2 double variables) * (8 bits per double variable) = 160 bits. Let's consider what happens when two datasets are appended.

                Case 1: All variables the same
                This is the minimum size possible assuming that each variable is already the smallest necessary storage type.
                Dataset A and B each have two variables: -first- and -second-. The resulting dataset, AB has the same variables, with A's 10 observations + B's 10 observations. The total size is then 320 bits (= 2 * 160 bits).

                Code:
                . desc, detail
                
                Contains data
                  obs:            10                          
                 vars:             2                          
                width:            16                          
                ------------------------------------------------------------
                              storage   display    value
                variable name   type    format     label      variable label
                ------------------------------------------------------------
                third           double  %10.0g                
                fourth          double  %10.0g                
                ------------------------------------------------------------
                Case 2: All variables different
                This is the maximum possible size after appending.
                Dataset A contains variables -first- and -second- and Dataset B contains -third- and -fourth-. When A and B are appended, additional observations must be added to A to account for the (non-existent) values of -third- and -fourth-, and likewise, additional observations are added to B for -first- and -second- observations. Now the size contribution from A and B are each twice as large, as each observation now has 4 variables, each of width 8 bits. The total size is now 640 bits (4 * 160 bits).

                Code:
                . desc, detail
                
                Contains data
                  obs:            20                          
                 vars:             4                          
                width:            32                          
                ------------------------------------------------------------
                              storage   display    value
                variable name   type    format     label      variable label
                ------------------------------------------------------------
                first           double  %10.0g                
                second          double  %10.0g                
                third           double  %10.0g                
                fourth          double  %10.0g                
                ------------------------------------------------------------
                (example code)
                Code:
                clear *
                
                tempfile a b
                
                /* Case 1 */
                set obs 10
                gen double first = runiform()
                gen double second = runiform()
                desc, detail
                save `a', replace
                
                clear
                set obs 10
                gen double first = runiform()
                gen double second = runiform()
                desc, detail
                save `b', replace
                
                clear
                append using `a' `b'
                desc, detail
                
                /* Case 2 */
                clear
                set obs 10
                gen double first = runiform()
                gen double second = runiform()
                desc, detail
                save `a', replace
                
                clear
                set obs 10
                gen double third = runiform()
                gen double fourth = runiform()
                desc, detail
                save `b', replace
                
                clear
                append using `a' `b'
                desc, detail

                Comment


                • #9
                  Dear Leonardo,
                  thanks for your thorough response. I learnt a lot from it.

                  My 72 segments all contain the same 9 variables. It is true that they are stored in different format and the appending will keep the heaviest one to accommodate larger observations (e.g. str15 in one chunk and maybe str40 in another on, since there is an observation with 40 characters in the second chunk). However, once I append all the 72 segments and save, the resulting appended database becomes 22GB, much smaller than my 132GB of RAM.
                  How comes that during the appending process all 132 GB of RAM are depleted?

                  Cordially,
                  Edoardo

                  Comment


                  • #10
                    Note that in #8, I said bits but I meant to write bytes. The point still stands though.

                    Re #9, I can't say if -append- is really the cause, or why it does what it does internally. I am certain that -append- doesn't compress or check for the smallest possible storage type at any time, only expanding the storage type as required. This makes me think that your string variables may not be made into the smallest storage type prior to appending, and this causes some unnecessary inflation. (Perhaps you have strings stored as strL which don't need to be?) This is all speculation at this point.

                    If you wanted to be sure, and you really care about this optimization because you might be doing it regularly and not just once, you could either open each dataset, -compress- it and then resave it, ensuring that you are using as little memory as possible. Or you could adapt Clyde's code and run compress on the product after say, every 10th dataset. Whether this level of investigation and "optimization" is worth the trade off of diminishing returns is really up to you.

                    Comment

                    Working...
                    X