Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata and very large data sets

    Hello,

    I have a question regarding Stata and large data sets. Is there any limitating factor, besids physical memory, that could prevent stata from open a data set. Is there a maximum amount of memory which stata could use?

    The max. size for one data file would be around 400-500GB.

    Thanks in advance.

  • #2
    There are limits on the number of observations and the number of variables (see help limits), but no limitation on memory aside from your physical memory limitation.

    Comment


    • #3
      Thx, the limit on observations is fixed and cant be extended? Why is there even a limit when the memory is without any bounds?

      Comment


      • #4
        The 2,147,483,647 limit on the number of observations is a theoretical one, presumably a function of one or more implementation details. However, you're likely to run out of memory before you hit that. Note that in Mata, the size of a matrix is limited only by the available memory, so you could (in theory) work with even larger datasets.

        Data files of 400-500GB are indeed large, and whether it makes sense to use Stata in this case depends entirely on what you're trying to do. Maybe you could share a few more details?
        Last edited by Phil Schumm; 12 Jun 2014, 06:04.

        Comment


        • #5
          Some more details: We use very large data bases for research. Normally we work with SAS to handle this amount of data and extract sub samples for other researchers, whom mostly using stata.
          The Idea is that other researches could send us stata do files, which they cant run by them self because of hardware limitation. But we could run them on a stata server with 1TB of memory.
          Regarding MATA: Could a matrix be filled from a file or does it have to be in first load into stata and then into mata.

          Comment


          • #6
            2,147,483,647 = 2^31-1. So my guess is that there is some 4-byte data structure used internally to reference observations, and 1 bit is reserved for some other purpose. That would preclude having more than 2,147,483,647 observations, regardless of the size of memory. In effect, it's how much Stata can "address" rather than the size of memory that is the limiting factor.

            Comment


            • #7
              Originally posted by JonasKr View Post
              Regarding MATA: Could a matrix be filled from a file or does it have to be in first load into stata and then into mata.
              Yes, you can fill a matrix in Mata directly from a file. It sounds like what you're exploring is whether it makes sense to use Stata to handle the data management (i.e., to extract subsets of data for analysis from one or more large datasets). The advantage, of course, is that once you have the data in Stata/Mata, then they are all ready to be accessed by the do-files from your users. One potential disadvantage is that, depending on your existing pipeline(s) and your proficiency in Mata, it might be slower and/or awkward to do this processing in Stata. An alternative would be to write Stata .dta files directly, which could then be used for analysis (assuming that the actual analyses would not bump up against the observation limit).
              Last edited by Phil Schumm; 12 Jun 2014, 06:51.

              Comment


              • #8
                Sounds very good. My plan as of now is to set up a server with the data set loaded into a MATA matrix. Next step is to write new command which extracts the needed data for the sended do file (in 99% of the cases researchers only need a sub sample) from the matrix and converts it to .dta and loads the sended do file up for execution. Third step is to save the results in a new .dta file. Any concerns about speed? So far I think it should be quite fast, once the mata matrix is stored into the memory.

                Comment


                • #9
                  I must admit that I don't have any personal experience using Mata in this way (i.e., stretching it to its limits); there are some who may have such experience (e.g., Sergiy) who may be able to comment. That said, this strikes me as using the wrong tool for the job. It sounds like these data should be stored in a database, from which you can retrieve subsets, as necessary, and generate Stata-format analysis files. You might explore Stata's ODBC capability here. Alternatively, a Java plugin might also work nicely, since you could then use one of many existing Java libraries to connect to the database.

                  Comment


                  • #10
                    It may be relevant to add that a Mata matrix is fairly bare. It doesn't have row names or column names, for example. It's not really another way to read in a dataset unless you are happy with row and column indexes as the only metadata.

                    Comment


                    • #11
                      Following Phil's advice, but sticking with Stata, have you tried use if <condition> using <file> to subset your data sets? I don't know if the observation limit applies in this situation since you're not actually reading the entire data set into memory. Even if the subset cannot be nicely defined using a single if condition, you might be able to do a subset which is small enough for Stata but which is a superset of the one you want, for whence you can then create the subset of interest. As mentioned above, specific examples might help.

                      Comment


                      • #12
                        My main concern with this approach is the fast amount of time it takes to read big data files into stata. So far my guess is, that once I store all the data in a mata matrix the process of reading data should speed up quite a bit.
                        Thx a lot for all the help and I gonna keep you guys updated on furhter progress.

                        Comment


                        • #13
                          When I execute the command "query memory" I see "max_memory: 32m-1600g". It seems to me that the maximum memory limit for stata-mp is 1.6T but I might be wrong. Although I do not have the luxury to buy this kind of RAM to test it out, my operation tells me that it is a piece of cake to deal with hundreds of millions of observation.
                          2 billion observation number limit is just for 32-bit computer. For 64 bit, 281 trillion is the limit according to officials.
                          I never use Mata for this kind of operation. Whereas Mata can be used for "clean" data, for real world data, you need a lot of metadata to help you to describe and differentiate each data point. If you just need matrix operation, there is no point to use Stata, a lot of program language might be faster without the overhead to "invoke" the client in the beginning.
                          Regarding loading big file into memory, you can always just load chunk of rows, but of course it depends on what kind of operation you are doing. And I also don't like the idea to store the dataset in one file and called it database (SAS approach). For usual file system you can only store 16Tb as one file, which means that your database cannot be largger than 16Tb. Just store each column in one file, or even smaller, subset of one column in one file with the index included, and combine them when needed, instead of storing them in one big database and worry about loading time. The advantage is that you can distribute data file into different disks without the need to use Storage Area Network (SAN).

                          Comment

                          Working...
                          X