Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging duplicated dataset

    Hey Hey
    I am going to merge these two data sets. The first one has duplicated values I have to keep them.NPC_FIC is firms identifier.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double NPC_FIC
    500000001
    500000001
    500000001
    500000002
    500000002
    500000002
    500000002
    500000002
    500000033
    500000033
    500000050
    500000050
    500000083
    500000104
    500000104
    500000119
    500000119
    500000119
    500000119
    500000119
    500000119
    500000119
    500000119
    500000121
    500000121
    500000156
    500000156
    500000157
    500000157
    500000157
    500000157
    500000165
    500000165
    500000165
    500000165
    500000180
    500000198
    500000198
    500000198
    500000201
    500000204
    500000204
    500000212
    500000215
    500000215
    500000215
    500000215
    500000230
    500000240
    500000240
    500000276
    500000276
    500000276
    500000279
    500000284
    500000284
    500000284
    500000284
    500000305
    500000324
    500000334
    500000341
    500000341
    500000341
    500000346
    500000346
    500000346
    500000346
    500000346
    500000376
    500000376
    500000376
    500000376
    500000376
    500000376
    500000376
    500000376
    500000376
    500000395
    500000395
    500000395
    500000395
    500000395
    500000395
    500000465
    500000465
    500000465
    500000470
    500000470
    500000478
    500000525
    500000543
    500000543
    500000565
    500000583
    500000583
    500000600
    500000600
    500000601
    500000633
    end

    Listed 100 out of 2881513 observations
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double NPC_FIC
    500000001
    500000002
    500000033
    500000050
    500000083
    500000104
    500000119
    500000121
    500000156
    500000157
    500000165
    500000180
    500000198
    500000201
    500000204
    500000212
    500000215
    500000230
    500000240
    500000276
    500000279
    500000284
    500000305
    500000324
    500000334
    500000341
    500000346
    500000376
    500000376
    500000376
    500000395
    500000465
    500000470
    500000478
    500000525
    500000543
    500000565
    500000583
    500000583
    500000600
    500000601
    500000633
    500000675
    500000709
    500000748
    500000757
    500000761
    500000765
    500000766
    500000774
    500000791
    500000795
    500000796
    500000800
    500000843
    500000846
    500000856
    500000859
    500000863
    500000867
    500000871
    500000891
    500000902
    500000905
    500000918
    500000937
    500000949
    500001025
    500001028
    500001045
    500001052
    500001066
    500001068
    500001080
    500001083
    500001096
    500001126
    500001170
    500001191
    500001199
    500001207
    500001222
    500001235
    500001242
    500001268
    500001272
    500001275
    500001322
    500001342
    500001352
    500001369
    500001374
    500001377
    500001403
    500001404
    500001427
    500001440
    500001453
    500001455
    500001493
    end
    ------------------ copy up to and including the previous line ------------------

    Listed 100 out of 353029 observations

    Code:
    . merge m:1 NPC_FIC using "D:\chapter 2\Database\Trab_Emp\trab_Emp_2010.dta"
    variable NPC_FIC does not uniquely identify observations in the master data
    I already tried 'joiny' it did not work either, my PC freezes
    How can I merge these two data sets?
    Last edited by Paris Rira; 27 Jan 2023, 06:04.

  • #2
    Are you sure that the -using- dataset has exactly one observation for all ids?

    That said, your datasets are large for merging and I wouldn’t be surprised if it took some time to merge them. My suggestion would be to try again with a reduced set of both datasets to ensure that the merging looks correct, then when you are satisfied, do it again with the full data.

    Comment


    • #3
      Originally posted by Leonardo Guizzetti View Post
      Are you sure that the -using- dataset has exactly one observation for all ids?
      I dnot quite get that. The first dataset one firm may appear several times while in the second each firm identifier shows up only once.

      Comment


      • #4
        sounds like you need an m:1 merge (assuming first data is the master); see
        Code:
        h merge

        Comment


        • #5
          Originally posted by Rich Goldstein View Post
          sounds like you need an m:1 merge (assuming first data is the master); see
          Code:
          h merge
          Code:
          h merge m:1 NPC_FIC using "D:\chapter 2\Database\Trab_Estab\trab_Es_2010"
          
          http://www.stata.com/websee.cgi?r=2&s=bds&o=w&j=a&k=merge+m+:+1+NPC_FIC+using+"D:\chapter 2\Database\Trab_Estab\trab_Es_201
          > 0 could not be opened for read by copytextfile
          
          r(601);
          whats the problem? could you please assist me?

          Comment


          • #6
            The problem is that you appear to not have an m:1 relationship between your master and using datasets, as Joseph and I have alluded to.

            Assuming the first dataset is your master dataset, and the second is using, here's what I can see.

            Code:
            // duplicates in master
            . duplicates report NPC_FIC
            
            Duplicates in terms of NPC_FIC
            
            --------------------------------------
               Copies | Observations       Surplus
            ----------+---------------------------
                    1 |           14             0
                    2 |           22            11
                    3 |           15            10
                    4 |           16            12
                    5 |           10             8
                    6 |            6             5
                    8 |            8             7
                    9 |            9             8
            --------------------------------------
            
            // duplicates in using
            . duplicates report NPC_FIC
            
            Duplicates in terms of NPC_FIC
            
            --------------------------------------
               Copies | Observations       Surplus
            ----------+---------------------------
                    1 |           95             0
                    2 |            2             1
                    3 |            3             2
            --------------------------------------
            So a -merge m:1- is of no help. Are quite sure the using dataset is supposed to have just one observation per NPC_FIC? If so, you will need to resolve that issue by going back to the data publishers, or making a choice on how to keep only one of these duplicate records.

            If you are instead ok with the odd duplicate, -joinby- is the correct tool to use. You mentioned it doesn't work, but don't show us the command that you used.

            Code:
            joinby NPC_FIC using `using'
            To show you what would happen when you perform this merge, I made a few minor additions to illustrate the joined result. In the master dataset, I add a variable called -first- that simply indexes the number of replicated NPC_FIC, and similarly added a variable called -second- for the using dataset. When the datasets are joined, these two new variables identify for you precisely which observations from each dataset were joined together. The result is show below where I show only two NPC_FIC chosen to highlight a case where there is a m:1 relationship, and another where this an m:n relationship. The code that produced it is at the end.

            Code:
            . list if inlist(NPC_FIC,  500000001, 500000376), sepby(NPC_FIC)
            
                 +----------------------------+
                 |   NPC_FIC   first   second |
                 |----------------------------|
              1. | 500000001       1        1 |
              2. | 500000001       2        1 |
              3. | 500000001       3        1 |
                 |----------------------------|
             70. | 500000376       1        1 |
             71. | 500000376       1        2 |
             72. | 500000376       1        3 |
             73. | 500000376       2        1 |
             74. | 500000376       2        2 |
             75. | 500000376       2        3 |
             76. | 500000376       3        1 |
             77. | 500000376       3        2 |
             78. | 500000376       3        3 |
             79. | 500000376       4        1 |
             80. | 500000376       4        2 |
             81. | 500000376       4        3 |
             82. | 500000376       5        1 |
             83. | 500000376       5        2 |
             84. | 500000376       5        3 |
             85. | 500000376       6        1 |
             86. | 500000376       6        2 |
             87. | 500000376       6        3 |
             88. | 500000376       7        1 |
             89. | 500000376       7        2 |
             90. | 500000376       7        3 |
             91. | 500000376       8        1 |
             92. | 500000376       8        2 |
             93. | 500000376       8        3 |
             94. | 500000376       9        1 |
             95. | 500000376       9        2 |
             96. | 500000376       9        3 |
                 +----------------------------+
            The code that produced the listing:

            Code:
            clear *
            
            tempfile master using
            
            // first dataex codeblock omitted for brevity
            sort NPC_FIC
            bys NPC_FIC : gen first = _n
            save `master', replace
            duplicates report NPC_FIC
            
            clear
            // second dataex codeblock omitted for brevity
            sort NPC_FIC
            bys NPC_FIC : gen second = _n
            save `using', replace
            duplicates report NPC_FIC
            
            use `master', replace
            //merge m:1 NPC_FIC using `using'
            joinby NPC_FIC using `using'
            sort NPC_FIC first second
            
            format NPC_FIC %12.0g
            list if inlist(NPC_FIC,  500000001, 500000376), sepby(NPC_FIC)

            Comment


            • #7
              Dear Leonardo, Thank you so much for the comprehensive explanation and codes.
              I did so, though, did not work.
              Code:
              use "D:\presentation excel\QP Parisa new title_sort.dta",clear
              sort NPC_FIC
              bys NPC_FIC : gen first = _n
              save "C:\Users\35193\Desktop\master,2010.dta",replace
              
              
              use  "D:\chapter 2\Database\Estab\Es_2010",clear
              sort NPC_FIC
              bys NPC_FIC : gen second = _n
              save "C:\Users\35193\Desktop\usingr,2010.dta"
              
              use "C:\Users\35193\Desktop\master,2010.dta",replace
              merge m:1 NPC_FIC using "C:\Users\35193\Desktop\usingr,2010.dta"
              variable NPC_FIC does not uniquely identify observations in the using data
              
              
              joinby NPC_FIC using "C:\Users\35193\Desktop\usingr,2010.dta"
              After 10 min, stata not responding while my PC keeps on freezing.

              Comment


              • #8
                You have 2881513 observations in the master file, and 353029 observations in the using file. The number of records after joining can grow tremendously of there are many many-to-many matches. It might be that your computer cannot hold this dataset in memory and so needs to read and write to your harddrive while it works, and if that's the case, it will be quite slow.

                Please try this with a reduced dataset to see if it works when the datasets are smaller. If it works in small scale, then work with 10,000 records in each. As you work from smaller to larger subsets of your data, the time it takes to complete the operation will give you a sense of how long it might take with your full data.

                Code:
                use "D:\presentation excel\QP Parisa new title_sort.dta" in 1/1000,clear
                sort NPC_FIC
                bys NPC_FIC : gen first = _n
                save "C:\Users\35193\Desktop\master,2010_1k.dta",replace
                
                
                use  "D:\chapter 2\Database\Estab\Es_2010"in 1/1000,clear
                sort NPC_FIC
                bys NPC_FIC : gen second = _n
                save "C:\Users\35193\Desktop\usingr,2010_1k.dta"
                
                use "C:\Users\35193\Desktop\master,2010_1k.dta",replace
                
                joinby NPC_FIC using "C:\Users\35193\Desktop\usingr,2010_1k.dta"

                Comment


                • #9
                  This worked perfectly. You are right, there are millions of obs and my current PC is not capable to run it. I really appreciate your help.

                  Comment

                  Working...
                  X