Merging duplicated dataset

Paris Rira

Join Date: Dec 2022
Posts: 384

Merging duplicated dataset

27 Jan 2023, 05:53

Hey Hey
I am going to merge these two data sets. The first one has duplicated values I have to keep them.NPC_FIC is firms identifier.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double NPC_FIC
500000001
500000001
500000001
500000002
500000002
500000002
500000002
500000002
500000033
500000033
500000050
500000050
500000083
500000104
500000104
500000119
500000119
500000119
500000119
500000119
500000119
500000119
500000119
500000121
500000121
500000156
500000156
500000157
500000157
500000157
500000157
500000165
500000165
500000165
500000165
500000180
500000198
500000198
500000198
500000201
500000204
500000204
500000212
500000215
500000215
500000215
500000215
500000230
500000240
500000240
500000276
500000276
500000276
500000279
500000284
500000284
500000284
500000284
500000305
500000324
500000334
500000341
500000341
500000341
500000346
500000346
500000346
500000346
500000346
500000376
500000376
500000376
500000376
500000376
500000376
500000376
500000376
500000376
500000395
500000395
500000395
500000395
500000395
500000395
500000465
500000465
500000465
500000470
500000470
500000478
500000525
500000543
500000543
500000565
500000583
500000583
500000600
500000600
500000601
500000633
end

Listed 100 out of 2881513 observations

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double NPC_FIC
500000001
500000002
500000033
500000050
500000083
500000104
500000119
500000121
500000156
500000157
500000165
500000180
500000198
500000201
500000204
500000212
500000215
500000230
500000240
500000276
500000279
500000284
500000305
500000324
500000334
500000341
500000346
500000376
500000376
500000376
500000395
500000465
500000470
500000478
500000525
500000543
500000565
500000583
500000583
500000600
500000601
500000633
500000675
500000709
500000748
500000757
500000761
500000765
500000766
500000774
500000791
500000795
500000796
500000800
500000843
500000846
500000856
500000859
500000863
500000867
500000871
500000891
500000902
500000905
500000918
500000937
500000949
500001025
500001028
500001045
500001052
500001066
500001068
500001080
500001083
500001096
500001126
500001170
500001191
500001199
500001207
500001222
500001235
500001242
500001268
500001272
500001275
500001322
500001342
500001352
500001369
500001374
500001377
500001403
500001404
500001427
500001440
500001453
500001455
500001493
end

------------------ copy up to and including the previous line ------------------

Listed 100 out of 353029 observations

Code:

. merge m:1 NPC_FIC using "D:\chapter 2\Database\Trab_Emp\trab_Emp_2010.dta"
variable NPC_FIC does not uniquely identify observations in the master data

I already tried 'joiny' it did not work either, my PC freezes
How can I merge these two data sets?

Last edited by Paris Rira; 27 Jan 2023, 06:04.

Tags: None

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2400
#2

27 Jan 2023, 06:09

Are you sure that the -using- dataset has exactly one observation for all ids?

That said, your datasets are large for merging and I wouldn’t be surprised if it took some time to merge them. My suggestion would be to try again with a reduced set of both datasets to ensure that the merging looks correct, then when you are satisfied, do it again with the full data.
Comment
Paris Rira

Join Date: Dec 2022

Posts: 384
#3

27 Jan 2023, 06:29

Originally posted by Leonardo Guizzetti View Post

Are you sure that the -using- dataset has exactly one observation for all ids?

I dnot quite get that. The first dataset one firm may appear several times while in the second each firm identifier shows up only once.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4459
#4

27 Jan 2023, 07:03

sounds like you need an m:1 merge (assuming first data is the master); see

Code:

h merge
1 like
Comment

Paris Rira

Join Date: Dec 2022
Posts: 384

27 Jan 2023, 07:22

Originally posted by Rich Goldstein View Post

sounds like you need an m:1 merge (assuming first data is the master); see

Code:

h merge

Code:

h merge m:1 NPC_FIC using "D:\chapter 2\Database\Trab_Estab\trab_Es_2010"

http://www.stata.com/websee.cgi?r=2&s=bds&o=w&j=a&k=merge+m+:+1+NPC_FIC+using+"D:\chapter 2\Database\Trab_Estab\trab_Es_201
> 0 could not be opened for read by copytextfile

r(601);

whats the problem? could you please assist me?

Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2400

27 Jan 2023, 07:42

The problem is that you appear to not have an m:1 relationship between your master and using datasets, as Joseph and I have alluded to.

Assuming the first dataset is your master dataset, and the second is using, here's what I can see.

Code:

// duplicates in master
. duplicates report NPC_FIC

Duplicates in terms of NPC_FIC

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |           14             0
        2 |           22            11
        3 |           15            10
        4 |           16            12
        5 |           10             8
        6 |            6             5
        8 |            8             7
        9 |            9             8
--------------------------------------

// duplicates in using
. duplicates report NPC_FIC

Duplicates in terms of NPC_FIC

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |           95             0
        2 |            2             1
        3 |            3             2
--------------------------------------

So a -merge m:1- is of no help. Are quite sure the using dataset is supposed to have just one observation per NPC_FIC? If so, you will need to resolve that issue by going back to the data publishers, or making a choice on how to keep only one of these duplicate records.

If you are instead ok with the odd duplicate, -joinby- is the correct tool to use. You mentioned it doesn't work, but don't show us the command that you used.

Code:

joinby NPC_FIC using `using'

To show you what would happen when you perform this merge, I made a few minor additions to illustrate the joined result. In the master dataset, I add a variable called -first- that simply indexes the number of replicated NPC_FIC, and similarly added a variable called -second- for the using dataset. When the datasets are joined, these two new variables identify for you precisely which observations from each dataset were joined together. The result is show below where I show only two NPC_FIC chosen to highlight a case where there is a m:1 relationship, and another where this an m:n relationship. The code that produced it is at the end.

Code:

. list if inlist(NPC_FIC,  500000001, 500000376), sepby(NPC_FIC)

     +----------------------------+
     |   NPC_FIC   first   second |
     |----------------------------|
  1. | 500000001       1        1 |
  2. | 500000001       2        1 |
  3. | 500000001       3        1 |
     |----------------------------|
 70. | 500000376       1        1 |
 71. | 500000376       1        2 |
 72. | 500000376       1        3 |
 73. | 500000376       2        1 |
 74. | 500000376       2        2 |
 75. | 500000376       2        3 |
 76. | 500000376       3        1 |
 77. | 500000376       3        2 |
 78. | 500000376       3        3 |
 79. | 500000376       4        1 |
 80. | 500000376       4        2 |
 81. | 500000376       4        3 |
 82. | 500000376       5        1 |
 83. | 500000376       5        2 |
 84. | 500000376       5        3 |
 85. | 500000376       6        1 |
 86. | 500000376       6        2 |
 87. | 500000376       6        3 |
 88. | 500000376       7        1 |
 89. | 500000376       7        2 |
 90. | 500000376       7        3 |
 91. | 500000376       8        1 |
 92. | 500000376       8        2 |
 93. | 500000376       8        3 |
 94. | 500000376       9        1 |
 95. | 500000376       9        2 |
 96. | 500000376       9        3 |
     +----------------------------+

The code that produced the listing:

Code:

clear *

tempfile master using

// first dataex codeblock omitted for brevity
sort NPC_FIC
bys NPC_FIC : gen first = _n
save `master', replace
duplicates report NPC_FIC

clear
// second dataex codeblock omitted for brevity
sort NPC_FIC
bys NPC_FIC : gen second = _n
save `using', replace
duplicates report NPC_FIC

use `master', replace
//merge m:1 NPC_FIC using `using'
joinby NPC_FIC using `using'
sort NPC_FIC first second

format NPC_FIC %12.0g
list if inlist(NPC_FIC,  500000001, 500000376), sepby(NPC_FIC)

Comment

Paris Rira

Join Date: Dec 2022
Posts: 384

27 Jan 2023, 08:29

Dear Leonardo, Thank you so much for the comprehensive explanation and codes.
I did so, though, did not work.

Code:

use "D:\presentation excel\QP Parisa new title_sort.dta",clear
sort NPC_FIC
bys NPC_FIC : gen first = _n
save "C:\Users\35193\Desktop\master,2010.dta",replace


use  "D:\chapter 2\Database\Estab\Es_2010",clear
sort NPC_FIC
bys NPC_FIC : gen second = _n
save "C:\Users\35193\Desktop\usingr,2010.dta"

use "C:\Users\35193\Desktop\master,2010.dta",replace
merge m:1 NPC_FIC using "C:\Users\35193\Desktop\usingr,2010.dta"
variable NPC_FIC does not uniquely identify observations in the using data


joinby NPC_FIC using "C:\Users\35193\Desktop\usingr,2010.dta"

After 10 min, stata not responding while my PC keeps on freezing.

Comment

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2400
#8

27 Jan 2023, 08:49

You have 2881513 observations in the master file, and 353029 observations in the using file. The number of records after joining can grow tremendously of there are many many-to-many matches. It might be that your computer cannot hold this dataset in memory and so needs to read and write to your harddrive while it works, and if that's the case, it will be quite slow.

Please try this with a reduced dataset to see if it works when the datasets are smaller. If it works in small scale, then work with 10,000 records in each. As you work from smaller to larger subsets of your data, the time it takes to complete the operation will give you a sense of how long it might take with your full data.

Code:

use "D:\presentation excel\QP Parisa new title_sort.dta" in 1/1000,clear sort NPC_FIC bys NPC_FIC : gen first = _n save "C:\Users\35193\Desktop\master,2010_1k.dta",replace use "D:\chapter 2\Database\Estab\Es_2010"in 1/1000,clear sort NPC_FIC bys NPC_FIC : gen second = _n save "C:\Users\35193\Desktop\usingr,2010_1k.dta" use "C:\Users\35193\Desktop\master,2010_1k.dta",replace joinby NPC_FIC using "C:\Users\35193\Desktop\usingr,2010_1k.dta"
1 like
Comment
Paris Rira

Join Date: Dec 2022

Posts: 384
#9

27 Jan 2023, 09:07

This worked perfectly. You are right, there are millions of obs and my current PC is not capable to run it. I really appreciate your help.
Comment

Announcement

Merging duplicated dataset

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment