Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge Databases

    Hello everyone,
    I'm trying to combine the following datasets (this is a subsample)

    DATASET 1
    PHP Code:
    id    value
    2435646    7894568
    2435646    6589044
    2435646    5673686
    6557789    6577856
    6557789    2535346
    8799085    5632546
    8799085    8678658
    8799085    9788885
    8799085    8325155
    3422345    4334356
    1443259    6453431
    1443259    2331344 
    DATASET 2
    PHP Code:
    id    value2
    1443259    132445
    1443259    5674567
    1443259    756789
    1443259    96965
    7333445    5837357
    2435646    789678
    2435646    5676266
    6557789    336856
    3322445    6567848
    9900343    8969769
    8799085    245535
    8799085    643563 
    Therefore, in each dataset I have two variables and the ID is the common identifier in both. For example, the id:2435646 (the first one) is reputed three times in DATASET 1 and two times in DATASET 2. The id: 6557789 appeared two times in DATASET 1 and one time in DATASET 2, an so on. DATASET 1 has over a million observations while DATASET 2 has around 50 thousand.

    What I need is to merge/combine both datasets in order to find out how many id´s are repeated in both databases (not all id´s are repeated in both bases -are unique id´s-) and then calculate the sum of the two values (value 1 and value 2 ) that appear in both bases.

    My first attempt was to use the m:m merge, however I have seen that is not recommended to use since it is no reliable with the results.

    Thanks in advance

  • #2
    if you only want to find out how many id's are repeated, then make two new files each of which contains the id only; use -duplicates- to eliminate duplicates and then merge 1:1; see
    Code:
    help drop
    help duplicates

    Comment


    • #3
      Thank you Rich. However, after crossing the bases, I would like to add the value 1 and value 2 in order to determine how much each id has.

      Comment


      • #4
        Also, I would like to count how many times the id is repeated once I merge/combined the databases. Anyone can help me with this please?

        Comment


        • #5
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input float(id value2)
          1443259  132445
          1443259 5674567
          1443259  756789
          1443259   96965
          7333445 5837357
          2435646  789678
          2435646 5676266
          6557789  336856
          3322445 6567848
          9900343 8969769
          8799085  245535
          8799085  643563
          end
          
          rename value2 value
          gen dataset=2
          bys id: egen count=count(id)
          tempfile dataset2
          save `dataset2'
          
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input float(id value)
          2435646 7894568
          2435646 6589044
          2435646 5673686
          6557789 6577856
          6557789 2535346
          8799085 5632546
          8799085 8678658
          8799085 9788885
          8799085 8325155
          3422345 4334356
          1443259 6453431
          1443259 2331344
          end
          
          gen dataset=1
          bys id: egen count=count(id)
          append using `dataset2'
          bys id: egen total=count(id)
          "count" gives you the totals within each data set whereas "total" gives you the combined total. So the key here is to use append and apply standard egen functions.

          Res.:

          Code:
          . sort id dataset
          
          . l, sepby(id)
          
               +---------------------------------------------+
               |      id     value   dataset   count   total |
               |---------------------------------------------|
            1. | 1443259   2331344         1       2       6 |
            2. | 1443259   6453431         1       2       6 |
            3. | 1443259     96965         2       4       6 |
            4. | 1443259   5674567         2       4       6 |
            5. | 1443259    756789         2       4       6 |
            6. | 1443259    132445         2       4       6 |
               |---------------------------------------------|
            7. | 2435646   5673686         1       3       5 |
            8. | 2435646   6589044         1       3       5 |
            9. | 2435646   7894568         1       3       5 |
           10. | 2435646    789678         2       2       5 |
           11. | 2435646   5676266         2       2       5 |
               |---------------------------------------------|
           12. | 3322445   6567848         2       1       1 |
               |---------------------------------------------|
           13. | 3422345   4334356         1       1       1 |
               |---------------------------------------------|
           14. | 6557789   2535346         1       2       3 |
           15. | 6557789   6577856         1       2       3 |
           16. | 6557789    336856         2       1       3 |
               |---------------------------------------------|
           17. | 7333445   5837357         2       1       1 |
               |---------------------------------------------|
           18. | 8799085   8678658         1       4       6 |
           19. | 8799085   8325155         1       4       6 |
           20. | 8799085   9788885         1       4       6 |
           21. | 8799085   5632546         1       4       6 |
           22. | 8799085    643563         2       2       6 |
           23. | 8799085    245535         2       2       6 |
               |---------------------------------------------|
           24. | 9900343   8969769         2       1       1 |
               +---------------------------------------------+

          Comment


          • #6
            Thank you very much Andrew. The problem with the sepby that is easy to use when the dataset is short. In my case, I have over a million observations. Anyone know of any alternative?

            Comment

            Working...
            X