Merge Databases

Nicolas Rodriguez

Join Date: Jul 2016

Posts: 63
#1

Merge Databases

14 Dec 2019, 04:17

Hello everyone,
I'm trying to combine the following datasets (this is a subsample)

DATASET 1

PHP Code:

id value 2435646 7894568 2435646 6589044 2435646 5673686 6557789 6577856 6557789 2535346 8799085 5632546 8799085 8678658 8799085 9788885 8799085 8325155 3422345 4334356 1443259 6453431 1443259 2331344

DATASET 2

PHP Code:

id value2 1443259 132445 1443259 5674567 1443259 756789 1443259 96965 7333445 5837357 2435646 789678 2435646 5676266 6557789 336856 3322445 6567848 9900343 8969769 8799085 245535 8799085 643563

Therefore, in each dataset I have two variables and the ID is the common identifier in both. For example, the id:2435646 (the first one) is reputed three times in DATASET 1 and two times in DATASET 2. The id: 6557789 appeared two times in DATASET 1 and one time in DATASET 2, an so on. DATASET 1 has over a million observations while DATASET 2 has around 50 thousand.

What I need is to merge/combine both datasets in order to find out how many id´s are repeated in both databases (not all id´s are repeated in both bases -are unique id´s-) and then calculate the sum of the two values (value 1 and value 2 ) that appear in both bases.

My first attempt was to use the m:m merge, however I have seen that is not recommended to use since it is no reliable with the results.

Thanks in advance
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4458
#2

14 Dec 2019, 05:39

if you only want to find out how many id's are repeated, then make two new files each of which contains the id only; use -duplicates- to eliminate duplicates and then merge 1:1; see

Code:

help drop help duplicates
Comment
Nicolas Rodriguez

Join Date: Jul 2016

Posts: 63
#3

16 Dec 2019, 05:07

Thank you Rich. However, after crossing the bases, I would like to add the value 1 and value 2 in order to determine how much each id has.
Comment
Nicolas Rodriguez

Join Date: Jul 2016

Posts: 63
#4

16 Dec 2019, 12:45

Also, I would like to count how many times the id is repeated once I merge/combined the databases. Anyone can help me with this please?
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10186

17 Dec 2019, 05:58

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id value2)
1443259  132445
1443259 5674567
1443259  756789
1443259   96965
7333445 5837357
2435646  789678
2435646 5676266
6557789  336856
3322445 6567848
9900343 8969769
8799085  245535
8799085  643563
end

rename value2 value
gen dataset=2
bys id: egen count=count(id)
tempfile dataset2
save `dataset2'

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id value)
2435646 7894568
2435646 6589044
2435646 5673686
6557789 6577856
6557789 2535346
8799085 5632546
8799085 8678658
8799085 9788885
8799085 8325155
3422345 4334356
1443259 6453431
1443259 2331344
end

gen dataset=1
bys id: egen count=count(id)
append using `dataset2'
bys id: egen total=count(id)

"count" gives you the totals within each data set whereas "total" gives you the combined total. So the key here is to use append and apply standard egen functions.

Res.:

Code:

. sort id dataset

. l, sepby(id)

     +---------------------------------------------+
     |      id     value   dataset   count   total |
     |---------------------------------------------|
  1. | 1443259   2331344         1       2       6 |
  2. | 1443259   6453431         1       2       6 |
  3. | 1443259     96965         2       4       6 |
  4. | 1443259   5674567         2       4       6 |
  5. | 1443259    756789         2       4       6 |
  6. | 1443259    132445         2       4       6 |
     |---------------------------------------------|
  7. | 2435646   5673686         1       3       5 |
  8. | 2435646   6589044         1       3       5 |
  9. | 2435646   7894568         1       3       5 |
 10. | 2435646    789678         2       2       5 |
 11. | 2435646   5676266         2       2       5 |
     |---------------------------------------------|
 12. | 3322445   6567848         2       1       1 |
     |---------------------------------------------|
 13. | 3422345   4334356         1       1       1 |
     |---------------------------------------------|
 14. | 6557789   2535346         1       2       3 |
 15. | 6557789   6577856         1       2       3 |
 16. | 6557789    336856         2       1       3 |
     |---------------------------------------------|
 17. | 7333445   5837357         2       1       1 |
     |---------------------------------------------|
 18. | 8799085   8678658         1       4       6 |
 19. | 8799085   8325155         1       4       6 |
 20. | 8799085   9788885         1       4       6 |
 21. | 8799085   5632546         1       4       6 |
 22. | 8799085    643563         2       2       6 |
 23. | 8799085    245535         2       2       6 |
     |---------------------------------------------|
 24. | 9900343   8969769         2       1       1 |
     +---------------------------------------------+

Comment

Nicolas Rodriguez

Join Date: Jul 2016

Posts: 63
#6

17 Dec 2019, 10:48

Thank you very much Andrew. The problem with the sepby that is easy to use when the dataset is short. In my case, I have over a million observations. Anyone know of any alternative?
Comment

Announcement

Comment

Comment

Comment

Comment

Comment