No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Opening huge data files in STATA

    Dear Satalist

    I am having difficulty opening and working with several huge data files in Stata. These are .txt files which are upwards fo 15GB each - they have around 300 variables and up to 16 million observations each, and I have around 20 of them (representing different years of data), which ideally I would like to append in to one master database. Does anyone have any advice as to how to do this within STATA, or which database software to use in order to then analyse the data in Stata?

    Best Wishes


  • #2
    h infile


    • #3
      Some generic possibilities:
      - run Stata on a server with hardware that can manage such data and computing
      - upload your data on a PostgreSQL installation and query it from Stata
      - collapse/summarize each file and append the collapsed files.
      - take smaller random samples from each file and append those.
      Depends on how much detail you really need from your info in the 15GB files.


      • #4
        To explain things a bit more: Enrico is asking you to type

        help infile
        to see documentation for the infile command. You can use infile with just a subset of your variables. You can also limit the number of observations.

        Please use the code delimiters to show code and results - use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Please use the command -dataex- to show a representative sample of data; it is installed already if you have Stata 14.2 or 15.1, else you can install it by typing

        ssc install dataex


        • #5
          Elaborating on #2 and #4, you might be better off with -import delimited- if it is a delimited text file, or with -infix- if it is a fixed width file.

          The master data base is going to be on the order of 300 GB. You'd better be working on a system with a lot of RAM to work with a file this big.


          • #6
            All good advice above, but I have a perhaps pedestrian suggestion about how to possibly solve this on an ordinary machine, using Stata alone. Do you need to use all 300 variables in any one analysis? (Pretty unusual; I usually find that I don't use all 300 variables in *all* my analyses put together.) If not, what about creating master subsets of variables that will be used together? The subsets might overlap, of course. For one subset, I'm thinking of something like this:

            local filelist= "thisfile thatfile someotherfile"
            local vars1 = "x1 z3 q5 r10"
            save using "master1.dta", emptyok
            foreach f of local filelist {
               import delimited using `f' .....
               keep `vars1'
               // Make "accidental" doubles into floats *if* this is ok
               ds, has(type double) //
               recast float `r(varlist)'
               append using "subset`.dta"
            You could make multiple subsets, as needed.
            Last edited by Mike Lacy; 17 Jun 2019, 16:26.


            • #7
              Limiting the number of variables is a good point of course, but I'd call collapsing or random samples fairly pedestrian solutions as well.
              And postgres can be installed locally on a laptop or PC as well. Not so straightforward, but not hugely complicated, and does allow you to work quickly with datasets that well exceed your laptops RAM.