Merging huge databases

Federico Nutarelli

Join Date: Sep 2018
Posts: 430

Merging huge databases

04 Nov 2022, 08:23

Hi all,

I have two very large DB. DB1 has 30181537 observations and looks like the following:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long(docdb_family_id cited_docdb_family_id) int distance long id_mas
14655790   184 9999   1
22435239   273 9999   2
23413025   277 9999   3
25032836   376 9999   4
25236037   796 9999   5
25389490   967 9999   6
 6148394  1224 9999   7
 8213391  1335 9999   8
 8213486  1335 9999   9
 9378178  1354 9999  10
 6122783  1903 9999  11
25136035  2207 9999  12
38622262  2570    4  13
47664232  2629    3  14
 9435050  2701 9999  15
24755188  3048 9999  16
23564926  3091 9999  17
10955460  3143 9999  18
25479825  3218 9999  19
 6146064  3280 9999  20
 4209965  3325 9999  21
23820210  3363 9999  22
22659293  4045 9999  23
 7921670  4107 9999  24
 6145713  4412 9999  25
 6198118  4476 9999  26
17647544  6071    3  27
25347708  7261 9999  28
53479553  7307    3  29
23326895  9929 9999  30
 4237610 10702 9999  31
 3518621 11028 9999  32
 6099169 11330 9999  33
 9477906 11394 9999  34
40863376 11682    4  35
 7645211 11797    9  36
69187708 12355    1  37
49167122 13043    4  38
32175886 13050 9999  39
32314360 13050 9999  40
 3518621 13811 9999  41
21782807 14102 9999  42
 8195431 15222 9999  43
38777234 15335    9  44
37030119 15335 9999  45
26277664 15706 9999  46
23045239 15782 9999  47
39571342 15864    5  48
35735976 16016 9999  49
 6158000 16612 9999  50
11097108 16780   10  51
34835567 16780 9999  52
 8193485 17091 9999  53
 3883243 17278 9999  54
38103827 17362    3  55
25962891 17374 9999  56
 7787500 17374 9999  57
 6066778 17795 9999  58
25371852 17838 9999  59
45998613 17917    7  60
29225942 17991 9999  61
45420805 18025    5  62
 8026829 18069 9999  63
32046109 18221 9999  64
24266686 18445 9999  65
11191737 18555 9999  66
11191728 18555 9999  67
27440849 19159 9999  68
25692576 19226 9999  69
 4252040 19226 9999  70
 4215870 19226 9999  71
19854428 19321 9999  72
 9144723 19346 9999  73
 6045892 19523 9999  74
26329156 19527 9999  75
 9277003 20067 9999  76
 9245880 20613 9999  77
26144032 20914 9999  78
 4376343 21021 9999  79
11312063 21131 9999  80
 9932116 21877 9999  81
32684183 21921    5  82
67810824 22078 9999  83
39434350 22219 9999  84
11049942 22698 9999  85
38331742 22827 9999  86
11162045 22875 9999  87
23135237 22956 9999  88
 9205154 22982 9999  89
25380722 23001    2  90
43991091 23142 9999  91
43991094 23142 9999  92
 3507359 23367 9999  93
 9465611 23400 9999  94
37547534 23616 9999  95
14311934 24196 9999  96
 9229318 24197 9999  97
 6493957 24411    0  98
 9309425 24648 9999  99
 8005863 24872 9999 100
end

DB2, consists of 2434786 observations and looks like this:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long cited_docdb_family_id str71 ctry_docdb float id_us
 569328 "['AT' 'DE' 'WO' 'US' 'EP']"                            1
 574660 "['JP' 'AU' 'AT' 'DE' 'WO' 'US' 'EP']"                  2
1187498 "['DE' 'RU' 'US' 'WO']"                                 3
1226468 "['DE' 'US' 'ES' 'EP' 'WO']"                            4
1236571 "['WO' 'DE' 'CA' 'US' 'EP']"                            5
1239098 "['WO' 'CA' 'DE' 'EP' 'US']"                            6
1239277 "['WO' 'DE' 'US' 'EP']"                                 7
1239483 "['DE' 'EP' 'US' 'WO']"                                 8
1239622 "['DE' 'EP' 'WO' 'FI' 'US']"                            9
1239624 "['EP' 'US' 'DE' 'WO']"                                10
1239749 "['EP' 'JP' 'US' 'DE' 'WO']"                           11
1334477 "['DE']"                                               12
1340405 "['US' 'EP' 'WO']"                                     13
1340418 "['EP' 'WO' 'US' 'GB']"                                14
1340462 "['WO' 'US' 'EP']"                                     15
1340471 "['US' 'WO' 'EP']"                                     16
1340485 "['EP' 'JP' 'AU' 'US' 'CA' 'WO']"                      17
1340488 "['WO' 'CA' 'EP' 'US']"                                18
1340508 "['US' 'EP' 'WO' 'DE']"                                19
1340519 "['EP' 'US' 'WO']"                                     20
1340541 "['US' 'AU' 'EP' 'WO']"                                21
1340647 "['EP' 'DE' 'WO' 'CA' 'US']"                           22
1340659 "['WO' 'NZ' 'FI' 'AU' 'NO' 'EP' 'BR']"                 23
1340673 "['EP' 'WO' 'FI' 'US']"                                24
1340826 "['WO' 'EP']"                                          25
1341006 "['EP' 'US' 'WO']"                                     26
1341170 "['US' 'EP' 'WO']"                                     27
1341175 "['EP' 'US' 'WO']"                                     28
1341188 "['JP' 'EP' 'US' 'WO']"                                29
1578920 "['NO' 'US' 'CH' 'FI' 'FR' 'SE' 'DK' 'GB' 'IL' 'AT']"  30
1763337 "['WO' 'US']"                                          31
2185207 "['US']"                                               32
2267654 "['US']"                                               33
2410050 "['US']"                                               34
2456718 "['US']"                                               35
2463680 "['US' 'DE' 'FI' 'EP' 'WO' 'CN']"                      36
2588926 "['US']"                                               37
2590654 "['US']"                                               38
2856998 "['US']"                                               39
3411930 "['US' 'WO' 'EP' 'CA']"                                40
3459794 "['US']"                                               41
3459918 "['US']"                                               42
3460123 "['US']"                                               43
3460237 "['US']"                                               44
3460351 "['US']"                                               45
3460409 "['US']"                                               46
3460621 "['AT' 'EP' 'DE' 'CA' 'AU' 'WO' 'HK' 'US' 'AP' 'CN']"  47
3460636 "['AP']"                                               48
3460679 "['AP']"                                               49
3460785 "['AR' 'US']"                                          50
3460796 "['CA' 'AR' 'AU' 'BR' 'MX' 'US' 'EP']"                 51
3460811 "['AR' 'MX' 'PE' 'US' 'BR' 'CO']"                      52
3460812 "['AT' 'AR' 'US' 'ES' 'EP' 'DE']"                      53
3460829 "['CN' 'BR' 'MX' 'US' 'AR' 'IT' 'CO']"                 54
3460972 "['BR' 'US']"                                          55
3460979 "['US' 'JP' 'BR' 'MX']"                                56
3460980 "['US']"                                               57
3460981 "['US']"                                               58
3461002 "['AU' 'DE' 'AT' 'US' 'ZA' 'ES' 'EP' 'NZ']"            59
3461010 "['BR' 'IL' 'US' 'EP']"                                60
3461018 "['DE' 'ES' 'EP' 'AU' 'ZA']"                           61
3461040 "['US' 'BR']"                                          62
3461044 "['US' 'BR']"                                          63
3461068 "['US']"                                               64
3461084 "['DE' 'JP' 'US' 'FR']"                                65
3461088 "['US']"                                               66
3461099 "['US']"                                               67
3461127 "['US' 'BR']"                                          68
3461128 "['IT' 'BR' 'ES' 'US']"                                69
3461146 "['US']"                                               70
3461172 "['UY' 'BR' 'EP' 'US']"                                71
3461173 "['US' 'BR']"                                          72
3461192 "['AR' 'US']"                                          73
3461205 "['DE' 'AT' 'AU' 'JP' 'US' 'EP' 'CA' 'BR']"            74
3461208 "['US' 'AU' 'CN' 'BR' 'CA' 'EP']"                      75
3461508 "['US']"                                               76
3461552 "['US']"                                               77
3461559 "['BR' 'US' 'DE']"                                     78
3461677 "['US']"                                               79
3461734 "['US']"                                               80
3462350 "['US' 'BR']"                                          81
3463151 "['US']"                                               82
3463640 "['AU' 'DE' 'FR' 'ES' 'IT' 'AR' 'US']"                 83
3463716 "['BR' 'US' 'AR']"                                     84
3464348 "['GB' 'ES' 'DE' 'AU' 'FR' 'BR' 'AR' 'CA' 'US' 'IT']"  85
3464855 "['AR' 'BR' 'CA' 'US' 'ES' 'IT']"                      86
3466402 "['AR' 'US']"                                          87
3466535 "['AR' 'CA' 'US']"                                     88
3467766 "['AR' 'US']"                                          89
3467767 "['AR' 'US']"                                          90
3467833 "['AR' 'MX' 'AU' 'US' 'ZA' 'ES']"                      91
3467977 "['AR' 'DE' 'JP' 'US' 'DK' 'GB' 'FR']"                 92
3467987 "['DE' 'US' 'GB' 'FR' 'AR' 'IT' 'CA']"                 93
3468001 "['AR' 'US']"                                          94
3468888 "['BR' 'AR' 'US']"                                     95
3469170 "['BR' 'US' 'AR']"                                     96
3469171 "['AR' 'US']"                                          97
3469272 "['ES' 'FR' 'AR' 'US']"                                98
3470490 "['US' 'MX' 'NZ' 'BR' 'ZA' 'AU' 'AR' 'CA']"            99
3471419 "['FR' 'JP' 'DE' 'IT' 'US' 'GB' 'AR']"                100
end

In DB1, I have more observations since more docdb_family_id can cite another docdb_family_id (see cited_docdb_family_id). DB2 instead consists of unique docdb_family_id. Now, what I would like to do is simply to merge m:1 DB1 with DB2, however this operation is quite expensive computationally. I have also tried to reclink the two but the computational scalability seems huge. I wonder if stata offers some parallel tool to cope with my situation or if I am missing some other solution to the problem.

Thank you

Tags: data, Large, merge, Suggestion

Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#2

04 Nov 2022, 10:11

I don't really know about whether Stata has a tool that does merging in parallel, but I'm just going to try to think through where the computational costs are coming from here. My intuition is that you should be able to merge m:1 in linear (m + n) time assuming at least one of the underlying datasets is sorted, otherwise it should be roughly O(m * n). So my guess is that they presort a dataset in roughly nlog(n) time, then merge in n plus m time, for a merge algorithm in roughly nlog(n) + (n + m) time. This is pretty good in terms of computational efficiency actually, and I think one could parallelize this as well. The other big cost has got to be memory management. High level languages in general are notoriously bad at efficient memory management (R is well known to be one of the worst offenders here.) So of course, it's going to really matter whether the algorithm is written in ado and interpreted; written in mata, compiled to bytecode, and interpreted; or written in C and compiled to machine code. Assuming I'm not making any major mental mistakes, and assuming Stata's merge command is implemented in C and well optimized, you might do a bit better by parallelizing, particularly for very large datasets, but otherwise the algorithm should be about as fast as one can expect. I actually bet that if you have one of the more expensive distributions, then merge operations like this are already parallelized. Not sure about that though.

So if it's not the theoretical efficiency, then what? I don't really know your setup, but If I had to guess, I'd guess that you are consuming all of your allocated memory and using virtualized memory on your hard drive. Remember, you have to store all of the data for both datasets plus you need to preallocate for a dataset to contain the merged data. The total memory expense doesn't even need to be greater than your total ram, because your operating system as well as any deamons running in the background consume memory as well. If that is the case, parralelization won't really help you, because the major bottleneck is your SATA cable.

Unfortunately, if it is a memory problem, I don't really have a practical solution for you other than "you might need to let it run over the weekend." You could upgrade your hardware, get access to some kind of Stata enabled computational cluster, or if you know C you could rewrite the merge algorithm yourself to optimize for your hardware. But speaking practically, you might just need to let it run over the weekend.
Comment
Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#3

04 Nov 2022, 10:18

Daniel Schaefer thanks a lot for the great reply. I think the issue is that I have to pre-sort the dataframes. However, I agree with you that there is also a memory issue over there and it might take the weekend to run. Let's see what happens with the pre-sorting
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#4

04 Nov 2022, 10:29

Great idea! Sort might run as part of the algorithm anyway, but if the data is already presorted by you, then you should have a best-case scenario sort, which I believe should be in n time, giving you a (very fast) algorithm in 2n + m time. Basically, the algorithm will probably still need to check to make sure the data is sorted, but that's much faster than actually sorting the data on average. Technically speaking, I'm not sure its any faster to presort yourself, then merge, but it is still better because the faster the merge algorithm executes while unattended over the weekend, the less likely it is that something will go wrong and interfere with the process.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#5

04 Nov 2022, 11:19

Something's funny here. I created a simple model of Federico's situation, with the master file having 30,000,000 observations (3 obs. per ID) and one float variable, and the using file having 2,000,000 observations with one float variable. I sorted both files in random order by ID (type long), and then did the m:1 merge. The run time on my 10 year old Windows machine with Stata 15.1 and 8 MB of memory was only about a minute.
1 like
Comment
Brian Poi

Join Date: Feb 2021

Posts: 22
#6

04 Nov 2022, 12:26

The first thing I noticed about your question is that you said databases.

If you in fact do have tables in a database and are comfortable programming in your database's dialect of SQL, I'd do the merge (join) there as well as any other data manipulation you might need to do. If you eventually plan to aggregate (-collapse-) your data, do that in SQL too. Looking at your ctry_docdb variable in the second table, that column might actually be stored as an array column in the database table. Databases have functions for dealing with arrays. Stata does not have an array variable type, so you'd be stuck doing a bunch of string manipulation.

Long story short, Stata is tops when it comes to estimation and statistics. But if your data is in a database, use SQL to get the dataset into a form for statistical analysis. Modern databases are ridiculously fast and are designed for joining, aggregating, etc. I tear through terabytes of data in seconds in SQL. Use the best tool for the job.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#7

04 Nov 2022, 12:31

That is funny! I have no idea how to explain that. I wonder if Federico Nutarelli could let us know how long he has tried to run the merge algorithm without halting before giving up?

There are cases where even small differences in the size of the input result in large differences in the time it takes to run the algorithm, and those differences don't have to scale linearly, meaning that smaller differences between larger numbers of observations can have a greater effect on the total runtime compared to larger differences between smaller numbers. That said, I doubt that's what's happening here. Mike Lacy, would you be willing to confirm by rerunning your analysis with 30,200,000 observations in the master file and 2,500,000 observations in the merge file? I doubt it will take much more than a minute.
1 like
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#8

04 Nov 2022, 12:51

Brian Poi absolutely, one big advantage in SQL is that you can search for matching keys in constant time with hashtable style lookups, depending on the underlying implementation of the database software. That said, I doubt OP is using SQL here.

If you're looking for array-like objects in Stata, you can get much of the same functionality out of -matrix- objects and expressions. 1D matrixes are essentially arrays with a few extra properties. It's a different primitive toolset than you'll get in SQL, but its useful to know about if you're coming from other languages and not obvious from most of the materials and documentation that Stata releases.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#9

04 Nov 2022, 14:14

Daniel Schaefer : I tried with 36,000,000 observations in db1 and 3,000,000 in db2, and it took about 90 sec.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#10

04 Nov 2022, 14:42

Thanks Mike! That's fast. I think it might now be safe to say that OPs problem is not processing speed...
Comment
Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#11

10 Nov 2022, 05:29

Mike Lacy Daniel Schaefer thanks for the replies. So fuzzy matching still takes on forever in my computer actually. I am using reclinck command. The merge command actually works. The problem (and I am sorry for this) is that there were two files having the same name. They were the same in essence but the file I was merging to contained much more string variables (about 500) than the other one. I don't know if this makes that huge difference in the merge command though...
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#12

10 Nov 2022, 10:17

Frederico, I'm finding your last post a little confusing, and I'm hoping you can clarify a few points.

The problem (and I am sorry for this) is that there were two files having the same name.

So to clarify, you've correctly identified the problem - you were trying to merge the string-heavy version of the file - and merging the correct version of the file solved the problem?

So fuzzy matching still takes on forever in my computer actually.

How long did fuzzy matching take? Or if you gave up at some point, how long did you wait for it to halt before giving up?
1 like
Comment
Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#13

15 Nov 2022, 10:24

So to clarify, you've correctly identified the problem - you were trying to merge the string-heavy version of the file - and merging the correct version of the file solved the problem?

Exactly!

How long did fuzzy matching take? Or if you gave up at some point, how long did you wait for it to halt before giving up?

I actually gave up after around 40 minutes
Comment

Announcement

Merging huge databases

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment