Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Running out of memory while merging .dta

    Dear all,
    I am encountering a problem similar to this post (running out of memory while merging). In particular, I want to merge these two database:
    1. Segment 1: it only contains one variable (Identifiers). It is a strL variable with 70 million rows (.dta size is around 4GBs).
    2. Segment 2: it contains a "superset" of variable Identifiers and other 8 numeric variables (.dta size is around 8GBs). For "superset" I mean that Segment2 contains the same observations of Segment1 plus some more others (overall Segment2 contains 71 million rows).
    Logic: I want to shrink the size of Segment2 to only consider those observations listed in Segment 1.
    Code:
    clear _all
    use Segment1.dta
    merge 1:1 Identifiers using Segment2.dta
    I am using a powerful Remote Stata Server with 132GB of RAM. However, I obtain this error message:
    Code:
    I/O error writing .dta file
    Usually such I/O errors are caused by the disk or file system being full.
    r(693);
    Am I doing anything wrong? If not, I am surprised I am depleting my 132GB of RAM, I mean, the .dta size is not "big" Do you think is a problem of the remote machine I am using?

    Any help would be much appreciated.

    Best,
    Edoardo

  • #2
    Regardless of that error, my version of Stata (15.1) indicates "Key variables cannot be strLs." I presume this limitation is also true of v. 16. Can someone else verify this? If so, I'm surprised that -merge- didn't complain of that first.

    I'd try fixing that problem first by deriving some reasonable identifier from the one you have. A 5 character string, for example, could easily identify billions of observations.
    Last edited by Mike Lacy; 08 Mar 2021, 21:34.

    Comment


    • #3
      Mike's advice is useful to take into consideration. It's usually a bad idea to use long strings as merge keys.

      Aside from this the error has told you that you have run out of disk space, not RAM memory (not everything is about RAM). I am betting that the environment variable STATATMP points to a location on a disk that has not much available disk space, while the drive you use for your data has lots of space. Stata uses the location pointed to by STATATMP for temporary dataset storage, such as those used while merging. See this FAQ.

      Comment


      • #4
        Dear Mike and Leonardo,
        thanks a lot. Your replies are very helpful.
        1. In response to Mike Lacy's point:
          • I was not aware that strL format could not be used for the merging identifiers. Thanks!
          • I cleaned the database before merging. Now my identifier is stored in "str85" format (I know this is not optimal but I can't do better).
          • After merging I still obtain the aforementioned (see #1) error message.
        2. Therefore, the problem is probably caused by what Leonardo Guizzetti was pointing out. In response to his point:
          • I am aware it is not optimal to merge with a "str85" identifier. But trust me I can't do much more. For sake of clarity in #1 I specified that the identifier in "Segment 2" is a superset of the one in "Segment 1". However, in reality I might have some identifiers that appear in Segment1 and do not appear in "Segment 2". Therefore, I can't "sort" and "gen long order = _n" say to create more efficient identifiers.
          • Concerning the memory use, what you are saying is very interesting: Stata does not temporarily store the tempfiles in the RAM but stores them in the folder specified in the FAQ. I checked that (hidden) folder in my node (i.e. the Stata Server) and turns out it is empty. However, the disk space of the overall node is almost depleted, probably because other users have their tempfile folder full of junk (this is probably what is causing the problem). This creates a bottleneck for other users of that node like me.
          • I will try to follow the advice highlighted in FAQ: I will try to tell Stata to save tempfiles in my current directory (which is a folder mapped to my node and located on a different machine of the same university network with unlimited disk space) and not on the local, saturated, disk space of the node. However, I am afraid I don't have enough admin/root access to do that.
        Thanks again Mike Lacy and Leonardo Guizzetti, your answers are always very helpful.

        Best,
        Edoardo

        Comment


        • #5
          You might be surprised by this behaviour of -merge-, but this command came about long before Stata had frames. As such, one temporary dataset is inevitably saved to disk in order to conduct the merge. Frames only came about in Stata 16, so this is not an option for you.

          Comment


          • #6
            If Segment1 contains only the identifier variables, and the goal is to keep only those observation in Segment2 that also appear in Segment 1 you could try this:

            Code:
            use segment1, clear
            gen byte source = 1
            append using segment2
            by Identifiers, sort: keep if _N == 2 & missing(source)
            I am not entirely sure that -append- doesn't use tempfiles (and for that matter, I'm not certain that the sort required in the last line doesn't either) but it might work.

            Comment


            • #7
              (Incorrect suggestion deleted.)
              Last edited by Mike Lacy; 09 Mar 2021, 13:57. Reason: I posted what turned out to be an incorrect suggestion about using Mata's -hash1- function as useful here. Please ignore; my apologies.

              Comment


              • #8
                Mike Lacy

                Actually the package is still here, but Orange does not accept http any longer. One has to do now:

                Code:
                net from https://jean-claude-arbaut.pagesperso-orange.fr/stata/

                Comment


                • #9
                  Jean-Claude Arbaut responded helpfully to my (deleted) mistaken post, but now his post is an orphan. I had suggested his -hash- package as useful here. Jean-Claude, can you clarify: Do some or all of the kinds of hash functions contained in your -hash- *package* fulfill this goal of producing distinct codes? If so, maybe I'm not completely wrong here :-}.

                  I tried to use Mata's -hash1- function, but it did *not* produce distinct hash codes for all input variable values, which was the problem with my post and why I deleted it. The context was about using hashing to make a shortened version of a very long identifier.
                  Last edited by Mike Lacy; 09 Mar 2021, 14:15.

                  Comment


                  • #10
                    Regarding hash codes: any function that maps from arbitrary strings of length N to strings of lengths M with M<N can't be injective, that's a mathematical fact. However, cryptographic hash functions are usually safe, within an acceptable risk of collision (i.e. two identifiers that have the same hash).
                    For instance, a SHA-1 hash is used by the french national system of health insurance, to anonymize health data.

                    Here I'm not sure it's really helpful to hash the identifier variable: we can reduce the size of the identifier. For instance, SHA-1 is 160 bits, i.e. 40 hexadecimal digits. So we could reduce the length from 85 to 40, with a small risk of collision. But is the size of the variable the real problem here? I have never used such big data with Stata, but I have with R recently, also on health data, and my impression is that often the problem is not the data size but the algorithm. If -merge- is somehow using to much space, maybe try something else? Another possibility would be to use Python data structures from within Stata. Just an idea, for large datasets it may be slow, but it's worth a try.
                    And if it's really, really too large, one may resort to external sorting (that is, sort on disk) then merge: that's basically what SAS or SPSS do to keep memory usage low. However, the problem here seems to be disk space, not memory.

                    Leonardo Guizzetti's answer makes sense: it may happen that Stata is trying to store temporary data in /tmp, and the directory may be mounted on a small partition, or have a small quota. The sugestion to specify a location for temporary files seems promising.

                    Comment

                    Working...
                    X