Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging two datasets

    Hi,

    I would like to merge two datasets (see below).

    Dataset 1 consists of pairs
    Dataset 2 doesn't have pairs but covers a longer time period.

    I tried with merge but the new observations in dataset 2 (ie. those before 2010 and after 2015q2) didn't get matched with the ID2 variable. I basically would like each pair to cover the longer time period (as in dataset 2).

    Any tips would be appreciated.

    Dataset 1 Dataset 2 What the merge should look like
    ID1 ID2 TIME Y1 Y2 Y3 ID1 TIME Z1 Z2 Z3 ID1 ID2 TIME Y1 Y2 Y3 Z1 Z2 Z3
    A1 B 2010Q1 A1 1999Q4 A1 B 1999Q4
    A1 B 2010Q2 A1 2000Q1 A1 B 2000Q1
    A1 B 2010Q3 A1 2000Q2 A1 B 2000Q2
    A1 B 2010Q4 A1 .. A1 B ..
    A1 B A1 .. A1 B ..
    A1 B .. A1 2016Q2 A1 B 2016Q2
    A1 B 2015Q2 A2 1999Q4 A1 C 1999Q4
    A1 C 2010Q1 A2 2000Q1 A1 C 2000Q1
    A1 C 2010Q2 A2 2000Q2 A1 C 2000Q2
    A1 C 2010Q3 A2 .. A1 C ..
    A1 C 2010Q4 A2 .. A1 C ..
    A1 C A2 2016Q2 A1 C 2016Q2
    A1 C .. A2 B 1999Q4
    A1 C 2015Q2 A2 B 2000Q1
    A2 B 2010Q1 A2 B 2000Q2
    A2 B 2010Q2 A2 B ..
    A2 B 2010Q3 A2 B ..
    A2 B 2010Q4 A2 B 2016Q2
    A2 B A2 C 1999Q4
    A2 B .. A2 C 2000Q1
    A2 B 2015Q2 A2 C 2000Q2
    A2 C 2010Q1 A2 C ..
    A2 C 2010Q2 A2 C ..
    A2 C 2010Q3 A2 C 2016Q2
    A2 C 2010Q4 ..
    A2 C ..
    A2 C ..
    A2 C 2015Q2
    ..
    ..

  • #2
    Well, you have posted your data example in a way that makes it really cumbersome to try to import it into Stata and experiment on it. In the future, please use -dataex- to do that. (-ssc install dataex-, -help dataex-). Moreover, your use of ellipsis makes it difficult to perceive exactly what you have and what you are trying to accomplish.

    So I'm going to just make a somewhat wild guess that what you actually want to do is to pair each observation in data set 1 with every observation in data set 2 that has the same value of ID1. (I can't really infer that from what you've shown, but it comes up often on the forum when people are unable to get what they want from -merge-). If that's what you have in mind, then the command you need is:

    Code:
    use dataset1
    joinby ID1 using dataset2
    Please do read the FAQ for excellent advice on how to ask clear questions and how most helpfully to show useful supporting information such as example data, code that was tried, results that were obtained, etc.

    Comment


    • #3
      I think you are looking for joinby. Type help joinby for more details.
      Next time please use code tags (# in the advanced editor tools) to post your data and dataex to list sample of your data.

      Comment


      • #4
        Are you trying to stack the files or match-merge them? I think it would help greatly if you showed the command syntax you attempted previously.
        --
        Bruce Weaver
        Email: [email protected]
        Version: Stata/MP 19.5 (Windows)

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          Well, you have posted your data example in a way that makes it really cumbersome to try to import it into Stata and experiment on it. In the future, please use -dataex- to do that. (-ssc install dataex-, -help dataex-). Moreover, your use of ellipsis makes it difficult to perceive exactly what you have and what you are trying to accomplish.

          So I'm going to just make a somewhat wild guess that what you actually want to do is to pair each observation in data set 1 with every observation in data set 2 that has the same value of ID1. (I can't really infer that from what you've shown, but it comes up often on the forum when people are unable to get what they want from -merge-). If that's what you have in mind, then the command you need is:

          Code:
          use dataset1
          joinby ID1 using dataset2
          Please do read the FAQ for excellent advice on how to ask clear questions and how most helpfully to show useful supporting information such as example data, code that was tried, results that were obtained, etc.
          Sorry about that, I thought the data example I posted was easy to interpret.

          I did try
          Code:
          joinby ID1 using "dataset2"
          but it did quite not achieve what I am after. There are lots of duplicates and still can't figure out how the variables in dataset 2 (i.e. Z1, Z2, Z3) got matched. I basically want to add those 3 variables for each pair and make sure that the data for each pair has the time frame of dataset 2 (ie 1999q4-2016q2). Indeed variables Y1 Y2 Y3 will be empty before 2010q4 and after 2015q2.
          I hope it is clearer now.

          Comment


          • #6
            Originally posted by Bruce Weaver View Post
            Are you trying to stack the files or match-merge them? I think it would help greatly if you showed the command syntax you attempted previously.
            Kind of both...please see the data example.

            I tried
            Code:
            joinby ID1 using "dataset2"
            merge m:m ID1 using "dataset2"

            Comment


            • #7
              You ignore the advices in #2 and #3 so it is impossible to say much. Please follow the forum FAQ to get better suggestions and to help us to solve your problem.

              I think the following will do the trick:
              Code:
              g str1 ID2n=""
              levelsof ID2, local (levels)
                  foreach i of local levels {
                      append using data2, gen(ID2_`i')
              replace ID2n="`i'" if ID2_`i'==1
              drop ID2_`i'
              }    
              replace ID2=ID2n if ID2==""
              drop ID2n    
              sort ID1 ID2 TIME
              Last edited by Oded Mcdossi; 07 Sep 2016, 05:52.

              Comment


              • #8
                Originally posted by Oded Mcdossi View Post
                You ignore the advices in #2 and #3 so it is impossible to say much. Please follow the forum FAQ to get better suggestions and to help us to solve your problem.

                I think the following will do the trick:
                Code:
                g str1 ID2n=""
                levelsof ID2, local (levels)
                foreach i of local levels {
                append using data2, gen(ID2_`i')
                replace ID2n="`i'" if ID2_`i'==1
                drop ID2_`i'
                }
                replace ID2=ID2n if ID2==""
                drop ID2n
                sort ID1 ID2 TIME
                Thanks for your input. The problem with this is that the observations (those with the common time frame between the 2 datasets) get duplicated as follows:

                Code:
                 
                time Y1 Y2 Y2 Z1 Z2 Z3
                2006Q1 3 45 2 . . .
                2006Q1 . . . 11 24 56
                2006Q2 9 25 18 . . .
                2006Q2 . . . 20 31 58
                ..
                ..

                Comment


                • #9
                  Giovanni,

                  Several of our best responders have tried to answer your question, and it is clear that none of us is able to understand what you are trying to get as a result here. In my opinion, that is because, as I said earlier, the example you posted is very, very unclear. The use of ... really makes a problem because, evidently, nobody can figure out what is, and what is not, supposed to be in the ... places. Please post, using -dataex-, a small example of your dataset1, a small example of dataset2 that should match up with what you show of dataset1, and then a handworked complete example showing exactly what the result should look like for those examples.

                  Comment


                  • #10
                    Perhaps try
                    Code:
                    merge m:1 ID1 TIME
                    As far as I can read the example, that would give you start of the desired output.

                    Then you'd have to do some expanding to recreate the pairs outside the matched area, but without sample data I can't really help there.

                    Edit: Added "TIME" as per #11 and #12
                    Edit2: Added part on expanding, which I think will be necessary.
                    Last edited by Jesse Wursten; 07 Sep 2016, 10:33.

                    Comment


                    • #11
                      Jesse,

                      That clearly won't work because ID1 does not uniquely identify the observations in either data set.

                      Comment


                      • #12
                        Originally posted by Clyde Schechter View Post
                        Jesse,

                        That clearly won't work because ID1 does not uniquely identify the observations in either data set.
                        Sorry, I forgot to add the TIME variable. (I have edited the post)

                        Comment


                        • #13
                          While I'm reluctant to help any more, something inside me keep think about your results. You can get rid of the duplicates by using something like collapse (max) Y* Z* , by(ID1 ID2 TIME)

                          Comment


                          • #14
                            Here's my guess at what is wanted:

                            Code:
                            * Example generated by -dataex-. To install: ssc install dataex
                            clear
                            input str2 ID1 str1 ID2 str6 TIME float(Y1 Y2 Y3)
                            "A1" "B" "2010Q1"  1  2  3
                            "A1" "B" "2010Q2"  2  4  6
                            "A1" "B" "2010Q3"  3  6  9
                            "A1" "B" "2010Q4"  4  8 12
                            "A1" "B" "2015Q2"  5 10 15
                            "A1" "C" "2010Q1"  6 12 18
                            "A1" "C" "2010Q2"  7 14 21
                            "A1" "C" "2010Q3"  8 16 24
                            "A1" "C" "2010Q4"  9 18 27
                            "A1" "C" "2015Q2" 10 20 30
                            "A2" "B" "2010Q1" 11 22 33
                            "A2" "B" "2010Q2" 12 24 36
                            "A2" "B" "2010Q3" 13 26 39
                            "A2" "B" "2010Q4" 14 28 42
                            "A2" "B" "2015Q2" 15 30 45
                            "A2" "C" "2010Q1" 16 32 48
                            "A2" "C" "2010Q2" 17 34 51
                            "A2" "C" "2010Q3" 18 36 54
                            "A2" "C" "2010Q4" 19 38 57
                            "A2" "C" "2015Q2" 20 40 60
                            end
                            save "statalist_data1.dta", replace
                            
                            * save a list of unique pairs of identifiers in dataset1
                            bysort ID1 ID2: keep if _n == 1
                            keep ID1 ID2
                            save "statalist_data1b.dta", replace
                            list, sepby(ID1)
                            
                            * Example generated by -dataex-. To install: ssc install dataex
                            clear
                            input str2 ID1 str6 TIME float(Z1 Z2 Z3)
                            "A1" "1999Q4" -1  -2  -3
                            "A1" "2000Q1" -2  -4  -6
                            "A1" "2000Q2" -3  -6  -9
                            "A1" "2016Q2" -4  -8 -12
                            "A2" "1999Q4" -5 -10 -15
                            "A2" "2000Q1" -6 -12 -18
                            "A2" "2000Q2" -7 -14 -21
                            "A2" "2016Q2" -8 -16 -24
                            end
                            save "statalist_data2.dta", replace
                            
                            * create pairwise combinations using unique ID1 ID2 codes
                            joinby ID1 using "statalist_data1b.dta"
                            list, sepby(ID1)
                            
                            * match observations by ids and time code
                            merge 1:1 ID1 ID2 TIME using "statalist_data1.dta"
                            
                            sort ID1 ID2 TIME
                            order ID1 ID2 TIME Y1 Y2 Y3 Z1 Z2 Z3
                            list, sepby(ID1)

                            Comment


                            • #15
                              Sorry everyone for the lack of clarity, I will be more careful next time.

                              I am unable to use dataex at the moment but please see below an extract of each dataset and what I would like to achieve further down. I hope this is OK.

                              id1 id2 - int
                              id2n - string
                              year quarter - float
                              y1 y2 y3 z1 z2 z3 - double

                              Code:
                               
                              DATASET 1 DATASET 2
                              id1 id2n id2 year quarter y1 y2 y3 id1 year quarter z1 z2 z3
                              1005 AAA 7005 2013 1 . . . 1005 2012 1 96 49 25
                              1005 AAA 7005 2013 2 189 1082 19212 1005 2012 2 92 9 158
                              1005 AAA 7005 2013 3 . . . 1005 2012 3 87 3 94
                              1005 AAA 7005 2013 4 697 1010 63602 1005 2012 4 47 129 201
                              1005 AAA 7005 2014 1 . . . 1005 2013 1 . . .
                              1005 AAA 7005 2014 2 219 200 50025 1005 2013 2 7 146 221
                              1005 AAA 7005 2014 3 84 56 100 1005 2013 3 . . .
                              1005 AAA 7005 2014 4 58 97 5623 1005 2013 4 23 69 92
                              1005 AAA 7005 2015 1 . . . 1005 2014 1 1 57 38
                              1005 AAA 7005 2015 2 984 31 11638 1005 2014 2 . . .
                              1005 BBB 8005 2013 1 114 328 203 1005 2014 3 143 35 97
                              1005 BBB 8005 2013 2 721 163 323 1005 2014 4 136 69 108
                              1005 BBB 8005 2013 3 447 723 12 1005 2015 1 81 40 203
                              1005 BBB 8005 2013 4 549 239 220 1005 2015 2 67 56 39
                              1005 BBB 8005 2014 1 193 29 388 1005 2015 3 187 161 98
                              1005 BBB 8005 2014 2 308 75 189 1005 2015 4 1 34 204
                              1005 BBB 8005 2014 3 82 588 414 1005 2016 1 51 107 143
                              1005 BBB 8005 2014 4 176 61 147 1005 2016 2 23 45 33
                              1005 BBB 8005 2015 1 170 809 581
                              1005 BBB 8005 2015 2 301 207 409
                              recast double y1 recast double y2 recast double y3 recast float quarter recast float year recast double z1 recast double z2 recast double z3 recast float year recast float quarter
                              id1 id2n id2 year quarter y1 y2 y3 z1 z2 z3
                              1005 AAA 7005 2012 1 . . . 96 49 25
                              1005 AAA 7005 2012 2 . . . 92 9 158
                              1005 AAA 7005 2012 3 . . . 87 3 94
                              1005 AAA 7005 2012 4 . . . 47 129 201
                              1005 AAA 7005 2013 1 . . . . .
                              1005 AAA 7005 2013 2 189 1082 19212 7 146 221
                              1005 AAA 7005 2013 3 . . . . .
                              1005 AAA 7005 2013 4 697 1010 63602 23 69 92
                              1005 AAA 7005 2014 1 . . . 1 57 38
                              1005 AAA 7005 2014 2 219 200 50025 . .
                              1005 AAA 7005 2014 3 84 56 100 143 35 97
                              1005 AAA 7005 2014 4 58 97 5623 136 69 108
                              1005 AAA 7005 2015 1 . . . 81 40 203
                              1005 AAA 7005 2015 2 984 31 11638 67 56 39
                              1005 AAA 7005 2015 3 . . . 187 161 98
                              1005 AAA 7005 2015 4 . . . 1 34 204
                              1005 AAA 7005 2016 1 . . . 51 107 143
                              1005 AAA 7005 2016 2 . . . 23 45 33
                              1005 BBB 8005 2012 1 . . . 96 49 25
                              1005 BBB 8005 2012 2 . . . 92 9 158
                              1005 BBB 8005 2012 3 . . . 87 3 94
                              1005 BBB 8005 2012 4 . . . 47 129 201
                              1005 BBB 8005 2013 1 114 328 203 . .
                              1005 BBB 8005 2013 2 721 163 323 7 146 221
                              1005 BBB 8005 2013 3 447 723 12 . .
                              1005 BBB 8005 2013 4 549 239 220 23 69 92
                              1005 BBB 8005 2014 1 193 29 388 1 57 38
                              1005 BBB 8005 2014 2 308 75 189 . .
                              1005 BBB 8005 2014 3 82 588 414 143 35 97
                              1005 BBB 8005 2014 4 176 61 147 136 69 108
                              1005 BBB 8005 2015 1 170 809 581 81 40 203
                              1005 BBB 8005 2015 2 301 207 409 67 56 39
                              1005 BBB 8005 2015 3 . . . 187 161 98
                              1005 BBB 8005 2015 4 . . . 1 34 204
                              1005 BBB 8005 2016 1 . . . 51 107 143
                              1005 BBB 8005 2016 2 . . . 23 45 33

                              Comment

                              Working...
                              X