Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging huge databases

    Hi all,

    I have two very large DB. DB1 has 30181537 observations and looks like the following:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long(docdb_family_id cited_docdb_family_id) int distance long id_mas
    14655790   184 9999   1
    22435239   273 9999   2
    23413025   277 9999   3
    25032836   376 9999   4
    25236037   796 9999   5
    25389490   967 9999   6
     6148394  1224 9999   7
     8213391  1335 9999   8
     8213486  1335 9999   9
     9378178  1354 9999  10
     6122783  1903 9999  11
    25136035  2207 9999  12
    38622262  2570    4  13
    47664232  2629    3  14
     9435050  2701 9999  15
    24755188  3048 9999  16
    23564926  3091 9999  17
    10955460  3143 9999  18
    25479825  3218 9999  19
     6146064  3280 9999  20
     4209965  3325 9999  21
    23820210  3363 9999  22
    22659293  4045 9999  23
     7921670  4107 9999  24
     6145713  4412 9999  25
     6198118  4476 9999  26
    17647544  6071    3  27
    25347708  7261 9999  28
    53479553  7307    3  29
    23326895  9929 9999  30
     4237610 10702 9999  31
     3518621 11028 9999  32
     6099169 11330 9999  33
     9477906 11394 9999  34
    40863376 11682    4  35
     7645211 11797    9  36
    69187708 12355    1  37
    49167122 13043    4  38
    32175886 13050 9999  39
    32314360 13050 9999  40
     3518621 13811 9999  41
    21782807 14102 9999  42
     8195431 15222 9999  43
    38777234 15335    9  44
    37030119 15335 9999  45
    26277664 15706 9999  46
    23045239 15782 9999  47
    39571342 15864    5  48
    35735976 16016 9999  49
     6158000 16612 9999  50
    11097108 16780   10  51
    34835567 16780 9999  52
     8193485 17091 9999  53
     3883243 17278 9999  54
    38103827 17362    3  55
    25962891 17374 9999  56
     7787500 17374 9999  57
     6066778 17795 9999  58
    25371852 17838 9999  59
    45998613 17917    7  60
    29225942 17991 9999  61
    45420805 18025    5  62
     8026829 18069 9999  63
    32046109 18221 9999  64
    24266686 18445 9999  65
    11191737 18555 9999  66
    11191728 18555 9999  67
    27440849 19159 9999  68
    25692576 19226 9999  69
     4252040 19226 9999  70
     4215870 19226 9999  71
    19854428 19321 9999  72
     9144723 19346 9999  73
     6045892 19523 9999  74
    26329156 19527 9999  75
     9277003 20067 9999  76
     9245880 20613 9999  77
    26144032 20914 9999  78
     4376343 21021 9999  79
    11312063 21131 9999  80
     9932116 21877 9999  81
    32684183 21921    5  82
    67810824 22078 9999  83
    39434350 22219 9999  84
    11049942 22698 9999  85
    38331742 22827 9999  86
    11162045 22875 9999  87
    23135237 22956 9999  88
     9205154 22982 9999  89
    25380722 23001    2  90
    43991091 23142 9999  91
    43991094 23142 9999  92
     3507359 23367 9999  93
     9465611 23400 9999  94
    37547534 23616 9999  95
    14311934 24196 9999  96
     9229318 24197 9999  97
     6493957 24411    0  98
     9309425 24648 9999  99
     8005863 24872 9999 100
    end
    DB2, consists of 2434786 observations and looks like this:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long cited_docdb_family_id str71 ctry_docdb float id_us
     569328 "['AT' 'DE' 'WO' 'US' 'EP']"                            1
     574660 "['JP' 'AU' 'AT' 'DE' 'WO' 'US' 'EP']"                  2
    1187498 "['DE' 'RU' 'US' 'WO']"                                 3
    1226468 "['DE' 'US' 'ES' 'EP' 'WO']"                            4
    1236571 "['WO' 'DE' 'CA' 'US' 'EP']"                            5
    1239098 "['WO' 'CA' 'DE' 'EP' 'US']"                            6
    1239277 "['WO' 'DE' 'US' 'EP']"                                 7
    1239483 "['DE' 'EP' 'US' 'WO']"                                 8
    1239622 "['DE' 'EP' 'WO' 'FI' 'US']"                            9
    1239624 "['EP' 'US' 'DE' 'WO']"                                10
    1239749 "['EP' 'JP' 'US' 'DE' 'WO']"                           11
    1334477 "['DE']"                                               12
    1340405 "['US' 'EP' 'WO']"                                     13
    1340418 "['EP' 'WO' 'US' 'GB']"                                14
    1340462 "['WO' 'US' 'EP']"                                     15
    1340471 "['US' 'WO' 'EP']"                                     16
    1340485 "['EP' 'JP' 'AU' 'US' 'CA' 'WO']"                      17
    1340488 "['WO' 'CA' 'EP' 'US']"                                18
    1340508 "['US' 'EP' 'WO' 'DE']"                                19
    1340519 "['EP' 'US' 'WO']"                                     20
    1340541 "['US' 'AU' 'EP' 'WO']"                                21
    1340647 "['EP' 'DE' 'WO' 'CA' 'US']"                           22
    1340659 "['WO' 'NZ' 'FI' 'AU' 'NO' 'EP' 'BR']"                 23
    1340673 "['EP' 'WO' 'FI' 'US']"                                24
    1340826 "['WO' 'EP']"                                          25
    1341006 "['EP' 'US' 'WO']"                                     26
    1341170 "['US' 'EP' 'WO']"                                     27
    1341175 "['EP' 'US' 'WO']"                                     28
    1341188 "['JP' 'EP' 'US' 'WO']"                                29
    1578920 "['NO' 'US' 'CH' 'FI' 'FR' 'SE' 'DK' 'GB' 'IL' 'AT']"  30
    1763337 "['WO' 'US']"                                          31
    2185207 "['US']"                                               32
    2267654 "['US']"                                               33
    2410050 "['US']"                                               34
    2456718 "['US']"                                               35
    2463680 "['US' 'DE' 'FI' 'EP' 'WO' 'CN']"                      36
    2588926 "['US']"                                               37
    2590654 "['US']"                                               38
    2856998 "['US']"                                               39
    3411930 "['US' 'WO' 'EP' 'CA']"                                40
    3459794 "['US']"                                               41
    3459918 "['US']"                                               42
    3460123 "['US']"                                               43
    3460237 "['US']"                                               44
    3460351 "['US']"                                               45
    3460409 "['US']"                                               46
    3460621 "['AT' 'EP' 'DE' 'CA' 'AU' 'WO' 'HK' 'US' 'AP' 'CN']"  47
    3460636 "['AP']"                                               48
    3460679 "['AP']"                                               49
    3460785 "['AR' 'US']"                                          50
    3460796 "['CA' 'AR' 'AU' 'BR' 'MX' 'US' 'EP']"                 51
    3460811 "['AR' 'MX' 'PE' 'US' 'BR' 'CO']"                      52
    3460812 "['AT' 'AR' 'US' 'ES' 'EP' 'DE']"                      53
    3460829 "['CN' 'BR' 'MX' 'US' 'AR' 'IT' 'CO']"                 54
    3460972 "['BR' 'US']"                                          55
    3460979 "['US' 'JP' 'BR' 'MX']"                                56
    3460980 "['US']"                                               57
    3460981 "['US']"                                               58
    3461002 "['AU' 'DE' 'AT' 'US' 'ZA' 'ES' 'EP' 'NZ']"            59
    3461010 "['BR' 'IL' 'US' 'EP']"                                60
    3461018 "['DE' 'ES' 'EP' 'AU' 'ZA']"                           61
    3461040 "['US' 'BR']"                                          62
    3461044 "['US' 'BR']"                                          63
    3461068 "['US']"                                               64
    3461084 "['DE' 'JP' 'US' 'FR']"                                65
    3461088 "['US']"                                               66
    3461099 "['US']"                                               67
    3461127 "['US' 'BR']"                                          68
    3461128 "['IT' 'BR' 'ES' 'US']"                                69
    3461146 "['US']"                                               70
    3461172 "['UY' 'BR' 'EP' 'US']"                                71
    3461173 "['US' 'BR']"                                          72
    3461192 "['AR' 'US']"                                          73
    3461205 "['DE' 'AT' 'AU' 'JP' 'US' 'EP' 'CA' 'BR']"            74
    3461208 "['US' 'AU' 'CN' 'BR' 'CA' 'EP']"                      75
    3461508 "['US']"                                               76
    3461552 "['US']"                                               77
    3461559 "['BR' 'US' 'DE']"                                     78
    3461677 "['US']"                                               79
    3461734 "['US']"                                               80
    3462350 "['US' 'BR']"                                          81
    3463151 "['US']"                                               82
    3463640 "['AU' 'DE' 'FR' 'ES' 'IT' 'AR' 'US']"                 83
    3463716 "['BR' 'US' 'AR']"                                     84
    3464348 "['GB' 'ES' 'DE' 'AU' 'FR' 'BR' 'AR' 'CA' 'US' 'IT']"  85
    3464855 "['AR' 'BR' 'CA' 'US' 'ES' 'IT']"                      86
    3466402 "['AR' 'US']"                                          87
    3466535 "['AR' 'CA' 'US']"                                     88
    3467766 "['AR' 'US']"                                          89
    3467767 "['AR' 'US']"                                          90
    3467833 "['AR' 'MX' 'AU' 'US' 'ZA' 'ES']"                      91
    3467977 "['AR' 'DE' 'JP' 'US' 'DK' 'GB' 'FR']"                 92
    3467987 "['DE' 'US' 'GB' 'FR' 'AR' 'IT' 'CA']"                 93
    3468001 "['AR' 'US']"                                          94
    3468888 "['BR' 'AR' 'US']"                                     95
    3469170 "['BR' 'US' 'AR']"                                     96
    3469171 "['AR' 'US']"                                          97
    3469272 "['ES' 'FR' 'AR' 'US']"                                98
    3470490 "['US' 'MX' 'NZ' 'BR' 'ZA' 'AU' 'AR' 'CA']"            99
    3471419 "['FR' 'JP' 'DE' 'IT' 'US' 'GB' 'AR']"                100
    end
    In DB1, I have more observations since more docdb_family_id can cite another docdb_family_id (see cited_docdb_family_id). DB2 instead consists of unique docdb_family_id. Now, what I would like to do is simply to merge m:1 DB1 with DB2, however this operation is quite expensive computationally. I have also tried to reclink the two but the computational scalability seems huge. I wonder if stata offers some parallel tool to cope with my situation or if I am missing some other solution to the problem.

    Thank you

  • #2
    I don't really know about whether Stata has a tool that does merging in parallel, but I'm just going to try to think through where the computational costs are coming from here. My intuition is that you should be able to merge m:1 in linear (m + n) time assuming at least one of the underlying datasets is sorted, otherwise it should be roughly O(m * n). So my guess is that they presort a dataset in roughly nlog(n) time, then merge in n plus m time, for a merge algorithm in roughly nlog(n) + (n + m) time. This is pretty good in terms of computational efficiency actually, and I think one could parallelize this as well. The other big cost has got to be memory management. High level languages in general are notoriously bad at efficient memory management (R is well known to be one of the worst offenders here.) So of course, it's going to really matter whether the algorithm is written in ado and interpreted; written in mata, compiled to bytecode, and interpreted; or written in C and compiled to machine code. Assuming I'm not making any major mental mistakes, and assuming Stata's merge command is implemented in C and well optimized, you might do a bit better by parallelizing, particularly for very large datasets, but otherwise the algorithm should be about as fast as one can expect. I actually bet that if you have one of the more expensive distributions, then merge operations like this are already parallelized. Not sure about that though.

    So if it's not the theoretical efficiency, then what? I don't really know your setup, but If I had to guess, I'd guess that you are consuming all of your allocated memory and using virtualized memory on your hard drive. Remember, you have to store all of the data for both datasets plus you need to preallocate for a dataset to contain the merged data. The total memory expense doesn't even need to be greater than your total ram, because your operating system as well as any deamons running in the background consume memory as well. If that is the case, parralelization won't really help you, because the major bottleneck is your SATA cable.

    Unfortunately, if it is a memory problem, I don't really have a practical solution for you other than "you might need to let it run over the weekend." You could upgrade your hardware, get access to some kind of Stata enabled computational cluster, or if you know C you could rewrite the merge algorithm yourself to optimize for your hardware. But speaking practically, you might just need to let it run over the weekend.

    Comment


    • #3
      Daniel Schaefer thanks a lot for the great reply. I think the issue is that I have to pre-sort the dataframes. However, I agree with you that there is also a memory issue over there and it might take the weekend to run. Let's see what happens with the pre-sorting

      Comment


      • #4
        Great idea! Sort might run as part of the algorithm anyway, but if the data is already presorted by you, then you should have a best-case scenario sort, which I believe should be in n time, giving you a (very fast) algorithm in 2n + m time. Basically, the algorithm will probably still need to check to make sure the data is sorted, but that's much faster than actually sorting the data on average. Technically speaking, I'm not sure its any faster to presort yourself, then merge, but it is still better because the faster the merge algorithm executes while unattended over the weekend, the less likely it is that something will go wrong and interfere with the process.

        Comment


        • #5
          Something's funny here. I created a simple model of Federico's situation, with the master file having 30,000,000 observations (3 obs. per ID) and one float variable, and the using file having 2,000,000 observations with one float variable. I sorted both files in random order by ID (type long), and then did the m:1 merge. The run time on my 10 year old Windows machine with Stata 15.1 and 8 MB of memory was only about a minute.

          Comment


          • #6
            The first thing I noticed about your question is that you said databases.

            If you in fact do have tables in a database and are comfortable programming in your database's dialect of SQL, I'd do the merge (join) there as well as any other data manipulation you might need to do. If you eventually plan to aggregate (-collapse-) your data, do that in SQL too. Looking at your ctry_docdb variable in the second table, that column might actually be stored as an array column in the database table. Databases have functions for dealing with arrays. Stata does not have an array variable type, so you'd be stuck doing a bunch of string manipulation.

            Long story short, Stata is tops when it comes to estimation and statistics. But if your data is in a database, use SQL to get the dataset into a form for statistical analysis. Modern databases are ridiculously fast and are designed for joining, aggregating, etc. I tear through terabytes of data in seconds in SQL. Use the best tool for the job.

            Comment


            • #7
              That is funny! I have no idea how to explain that. I wonder if Federico Nutarelli could let us know how long he has tried to run the merge algorithm without halting before giving up?

              There are cases where even small differences in the size of the input result in large differences in the time it takes to run the algorithm, and those differences don't have to scale linearly, meaning that smaller differences between larger numbers of observations can have a greater effect on the total runtime compared to larger differences between smaller numbers. That said, I doubt that's what's happening here. Mike Lacy, would you be willing to confirm by rerunning your analysis with 30,200,000 observations in the master file and 2,500,000 observations in the merge file? I doubt it will take much more than a minute.

              Comment


              • #8
                Brian Poi absolutely, one big advantage in SQL is that you can search for matching keys in constant time with hashtable style lookups, depending on the underlying implementation of the database software. That said, I doubt OP is using SQL here.

                If you're looking for array-like objects in Stata, you can get much of the same functionality out of -matrix- objects and expressions. 1D matrixes are essentially arrays with a few extra properties. It's a different primitive toolset than you'll get in SQL, but its useful to know about if you're coming from other languages and not obvious from most of the materials and documentation that Stata releases.

                Comment


                • #9
                  Daniel Schaefer : I tried with 36,000,000 observations in db1 and 3,000,000 in db2, and it took about 90 sec.

                  Comment


                  • #10
                    Thanks Mike! That's fast. I think it might now be safe to say that OPs problem is not processing speed...

                    Comment


                    • #11
                      Mike Lacy Daniel Schaefer thanks for the replies. So fuzzy matching still takes on forever in my computer actually. I am using reclinck command. The merge command actually works. The problem (and I am sorry for this) is that there were two files having the same name. They were the same in essence but the file I was merging to contained much more string variables (about 500) than the other one. I don't know if this makes that huge difference in the merge command though...

                      Comment


                      • #12
                        Frederico, I'm finding your last post a little confusing, and I'm hoping you can clarify a few points.

                        The problem (and I am sorry for this) is that there were two files having the same name.
                        So to clarify, you've correctly identified the problem - you were trying to merge the string-heavy version of the file - and merging the correct version of the file solved the problem?

                        So fuzzy matching still takes on forever in my computer actually.
                        How long did fuzzy matching take? Or if you gave up at some point, how long did you wait for it to halt before giving up?

                        Comment


                        • #13
                          So to clarify, you've correctly identified the problem - you were trying to merge the string-heavy version of the file - and merging the correct version of the file solved the problem?
                          Exactly!

                          How long did fuzzy matching take? Or if you gave up at some point, how long did you wait for it to halt before giving up?
                          I actually gave up after around 40 minutes

                          Comment

                          Working...
                          X