Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • merge 3 data sets with different identifiers, one string identifier

    Good day all,

    I have 3 datasets A,B and C. Dataset B and C I want to merge certain variables to dataset A.

    Dataset B I can merge with no issues by using id and year.

    Dataset C has a unique identifier which is a string variable. I thought of encoding this variable in dataset A and C but the code is not the same in each dataset.

    Does anyone know how I can best resolve this issue?

    Kind regards,

    Joshua

  • #2
    I would merge A and B, and then append C to the resulting database. Then create a new overall ID per observation.

    Comment


    • #3
      Hi Marijn,

      Thank you for your reply but that is the problem that I am describing.

      I created a new unique id "newid" in dataset C consisting out of id and id2 (did the same for dataset A) in order to merge per unique observation.

      Stata will not merge "newid" because it is a string variable. Normally I would resolve this by encoding a string variable, but because I have two different dataset which I will need to encode separately the assigned codes will not match the correct observation when I do the merge.

      Does anyone know how I can best merge these datasets?

      Code I've written now is;

      Code:
      merge 1:1 id year using "C:\dataset B.dta", keepusing(code)
      rename _merge _mergeI
      
      sort newid year 
      xtset newid fyear 
      
      merge 1:1 newid year using "C:\dataset C.dta", keepusing(Var)
      rename _merge _mergeII
      Dataset A
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int(id year) str5 id2 str11 newid
      1004 1994 "AIR"   "001004AIR"  
      1004 1995 "AIR"   "001004AIR"  
      1004 1996 "AIR"   "001004AIR"  
      1004 1997 "AIR"   "001004AIR"  
      1004 1998 "AIR"   "001004AIR"  
      1004 1999 "AIR"   "001004AIR"  
      1004 2000 "AIR"   "001004AIR"  
      1004 2001 "AIR"   "001004AIR"  
      1004 2002 "AIR"   "001004AIR"  
      1004 2003 "AIR"   "001004AIR"  
      1004 2004 "AIR"   "001004AIR"  
      1004 2005 "AIR"   "001004AIR"  
      1004 2006 "AIR"   "001004AIR"  
      1004 2007 "AIR"   "001004AIR"  
      1004 2008 "AIR"   "001004AIR"  
      1004 2009 "AIR"   "001004AIR"  
      1004 2010 "AIR"   "001004AIR"  
      1004 2011 "AIR"   "001004AIR"  
      1004 2012 "AIR"   "001004AIR"  
      1004 2013 "AIR"   "001004AIR"  
      1004 2014 "AIR"   "001004AIR"  
      1004 2015 "AIR"   "001004AIR"  
      1009 1994 "ABSI"  "001009ABSI" 
      1010 1994 "4165A" "0010104165A"
      1010 1995 "4165A" "0010104165A"
      1010 1996 "4165A" "0010104165A"
      1010 1997 "4165A" "0010104165A"
      1010 1998 "4165A" "0010104165A"
      1010 1999 "4165A" "0010104165A"
      1010 2000 "4165A" "0010104165A"
      1010 2001 "4165A" "0010104165A"
      1010 2002 "4165A" "0010104165A"
      1010 2003 "4165A" "0010104165A"
      1011 1994 "ACSE"  "001011ACSE" 
      1013 1994 "ADCT"  "001013ADCT" 
      1013 1995 "ADCT"  "001013ADCT" 
      1013 1996 "ADCT"  "001013ADCT" 
      1013 1997 "ADCT"  "001013ADCT" 
      1013 1998 "ADCT"  "001013ADCT" 
      1013 1999 "ADCT"  "001013ADCT" 
      1013 2000 "ADCT"  "001013ADCT" 
      1013 2001 "ADCT"  "001013ADCT" 
      1013 2002 "ADCT"  "001013ADCT" 
      1013 2003 "ADCT"  "001013ADCT" 
      1013 2004 "ADCT"  "001013ADCT" 
      1013 2005 "ADCT"  "001013ADCT" 
      1013 2006 "ADCT"  "001013ADCT" 
      1013 2007 "ADCT"  "001013ADCT" 
      1013 2008 "ADCT"  "001013ADCT" 
      1013 2009 "ADCT"  "001013ADCT" 
      1013 2010 "ADCT"  "001013ADCT" 
      1017 1994 "AELNA" "001017AELNA"
      1019 1994 "AFAP"  "001019AFAP" 
      1019 1995 "AFAP"  "001019AFAP" 
      1019 1996 "AFAP"  "001019AFAP" 
      1019 1997 "AFAP"  "001019AFAP" 
      1019 1998 "AFAP"  "001019AFAP" 
      1019 1999 "AFAP"  "001019AFAP" 
      1019 2000 "AFAP"  "001019AFAP" 
      1019 2001 "AFAP"  "001019AFAP" 
      1021 1994 "IWKS"  "001021IWKS" 
      1021 1995 "IWKS"  "001021IWKS" 
      1021 1996 "IWKS"  "001021IWKS" 
      1021 1997 "IWKS"  "001021IWKS" 
      1021 1998 "IWKS"  "001021IWKS" 
      1021 1999 "IWKS"  "001021IWKS" 
      1021 2000 "IWKS"  "001021IWKS" 
      1021 2001 "IWKS"  "001021IWKS" 
      1021 2002 "IWKS"  "001021IWKS" 
      1021 2003 "IWKS"  "001021IWKS" 
      1021 2004 "IWKS"  "001021IWKS" 
      1021 2005 "IWKS"  "001021IWKS" 
      1021 2006 "IWKS"  "001021IWKS" 
      1021 2007 "IWKS"  "001021IWKS" 
      1021 2008 "IWKS"  "001021IWKS" 
      1025 1994 "4225B" "0010254225B"
      1025 1995 "4225B" "0010254225B"
      1034 1994 "ALO.2" "001034ALO.2"
      1034 1995 "ALO.2" "001034ALO.2"
      1034 1996 "ALO.2" "001034ALO.2"
      1034 1997 "ALO.2" "001034ALO.2"
      end
      Dataset B
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int(id year) str4 code
      1004 1994 "BBB" 
      1004 1995 "BBB-"
      1004 1996 "BBB-"
      1004 1997 "BBB-"
      1004 1998 "BBB" 
      1004 1999 "BBB" 
      1004 2000 "BBB" 
      1004 2001 "BBB" 
      1004 2002 "BBB-"
      1004 2003 "BB-" 
      1004 2004 "BB-" 
      1004 2005 "BB-" 
      1004 2006 "BB-" 
      1004 2007 "BB"  
      1004 2008 "BB"  
      1004 2009 "BB"  
      1004 2010 "BB"  
      1004 2011 "BB"  
      1004 2012 "BB"  
      1004 2013 "BB"  
      1004 2014 "BB"  
      1004 2015 "BB+" 
      1004 2016 "BB+" 
      1004 2017 "BB+" 
      1009 1994 ""    
      1009 1995 ""    
      1009 1996 ""    
      1009 2002 ""    
      1009 2003 ""    
      1009 2004 ""    
      1009 2005 ""    
      1009 2006 ""    
      1009 2007 ""    
      1010 1994 "B+"  
      1010 1995 "B+"  
      1010 1996 "B+"  
      1010 1997 "B+"  
      1010 1998 "B+"  
      1010 1999 "B+"  
      1010 2000 "B+"  
      1010 2001 "B+"  
      1010 2002 "B+"  
      1010 2003 "B+"  
      1010 2004 "B+"  
      1010 2005 "B+"  
      1010 2006 "B+"  
      1010 2007 "B+"  
      1010 2008 "B+"  
      1010 2009 "B+"  
      1010 2010 "B+"  
      1011 1994 ""    
      1011 1995 ""    
      1013 1994 ""    
      1013 1995 ""    
      1013 1996 ""    
      1013 1997 ""    
      1013 1998 ""    
      1013 1999 ""    
      1013 2000 ""    
      1013 2001 ""    
      1013 2002 ""    
      1013 2003 ""    
      1013 2004 ""    
      1013 2005 ""    
      1013 2006 ""    
      1013 2007 ""    
      1013 2008 ""    
      1013 2009 ""    
      1013 2010 ""    
      1013 2011 ""    
      end
      Dataset C
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int id str4 id2 str10 newid int year double Var
      1004 "AIR"  "001004AIR"  1988  .795486
      1004 "AIR"  "001004AIR"  1989  .088383
      1004 "AIR"  "001004AIR"  1990 1.237442
      1004 "AIR"  "001004AIR"  1991 1.237437
      1004 "AIR"  "001004AIR"  1992 .7954952
      1004 "AIR"  "001004AIR"  1993 .4419421
      1004 "AIR"  "001004AIR"  1994  .176777
      1004 "AIR"  "001004AIR"  1995 1.149049
      1004 "AIR"  "001004AIR"  1996  1.06066
      1004 "AIR"  "001004AIR"  1997 3.800684
      1004 "AIR"  "001004AIR"  1998 2.431041
      1004 "AIR"  "001004AIR"  1999 1.900349
      1004 "AIR"  "001004AIR"  2000 .5303301
      1004 "AIR"  "001004AIR"  2001 .8396893
      1004 "AIR"  "001004AIR"  2002 1.958686
      1004 "AIR"  "001004AIR"  2003 .2192031
      1004 "AIR"  "001004AIR"  2004 .0636396
      1004 "AIR"  "001004AIR"  2005 .4949747
      1004 "AIR"  "001004AIR"  2006 1.046518
      1004 "AIR"  "001004AIR"  2007 1.308148
      1004 "AIR"  "001004AIR"  2008 2.595082
      1004 "AIR"  "001004AIR"  2009 .4242641
      1004 "AIR"  "001004AIR"  2010 1.513209
      1004 "AIR"  "001004AIR"  2011 .3535534
      1004 "AIR"  "001004AIR"  2012 .6434672
      1004 "AIR"  "001004AIR"  2013 .3747666
      1004 "AIR"  "001004AIR"  2014 .9616652
      1004 "AIR"  "001004AIR"  2015  .622254
      1004 "AIR"  "001004AIR"  2016 3.790092
      1004 "AIR"  "001004AIR"  2017 .3181981
      1009 "ABSI" "001009ABSI" 1988        .
      1009 "ABSI" "001009ABSI" 1989 .1767767
      1009 "ABSI" "001009ABSI" 1990 .1767767
      1009 "ABSI" "001009ABSI" 1991 .3535534
      1009 "ABSI" "001009ABSI" 1992 1.237437
      1009 "ABSI" "001009ABSI" 1993 .1767767
      1009 "ABSI" "001009ABSI" 1994 .7071068
      1009 "ABSI" "001009ABSI" 1995 .5303301
      1009 "ABSI" "001009ABSI" 1996 .9722718
      1009 "ABSI" "001009ABSI" 1997        0
      1009 "ABSI" "001009ABSI" 1998        0
      1009 "ABSI" "001009ABSI" 2000        0
      1009 "ABSI" "001009ABSI" 2001        0
      1009 "ABSI" "001009ABSI" 2002        0
      1009 "ABSI" "001009ABSI" 2003        0
      1009 "ABSI" "001009ABSI" 2004        0
      1009 "ABSI" "001009ABSI" 2005 .0211425
      1009 "ABSI" "001009ABSI" 2006  .006364
      1009 "ABSI" "001009ABSI" 2007 .0070004
      1011 "ACSE" "001011ACSE" 1988  .066468
      1011 "ACSE" "001011ACSE" 1989 .1103087
      1011 "ACSE" "001011ACSE" 1990 .4419417
      1011 "ACSE" "001011ACSE" 1991 .0438406
      1011 "ACSE" "001011ACSE" 1992 .1767767
      1011 "ACSE" "001011ACSE" 1993 3.005204
      1011 "ACSE" "001011ACSE" 1994 .5303301
      1011 "ACSE" "001011ACSE" 1995 .7954952
      1013 "ADCT" "001013ADCT" 1988 .7954952
      1013 "ADCT" "001013ADCT" 1989 .9722718
      1013 "ADCT" "001013ADCT" 1990 2.828427
      1013 "ADCT" "001013ADCT" 1991 .8838835
      1013 "ADCT" "001013ADCT" 1992 .4419417
      1013 "ADCT" "001013ADCT" 1993 1.856155
      1013 "ADCT" "001013ADCT" 1994 4.419417
      1013 "ADCT" "001013ADCT" 1995 1.414214
      1013 "ADCT" "001013ADCT" 1996 6.363961
      1013 "ADCT" "001013ADCT" 1997 6.275573
      1013 "ADCT" "001013ADCT" 1998 .4419417
      1013 "ADCT" "001013ADCT" 1999 3.579728
      1013 "ADCT" "001013ADCT" 2000 .7071068
      1013 "ADCT" "001013ADCT" 2001 .1202082
      1013 "ADCT" "001013ADCT" 2002 .1131371
      1013 "ADCT" "001013ADCT" 2003 .2262742
      1013 "ADCT" "001013ADCT" 2004 .0707107
      1013 "ADCT" "001013ADCT" 2005 .1979899
      1013 "ADCT" "001013ADCT" 2006 .7495332
      1013 "ADCT" "001013ADCT" 2007 .3676955
      1013 "ADCT" "001013ADCT" 2008 .6434672
      1013 "ADCT" "001013ADCT" 2009 .2333452
      1013 "ADCT" "001013ADCT" 2010 .2545584
      1013 "ADCT" "001013ADCT" 2011 .0141421
      end

      Comment


      • #4
        I don't get your problem. When I save those three data sets and just run:

        Code:
        use A, clear
        merge 1:1 id year using B
        drop _merge
        merge 1:1 id year using C
        everything runs fine. What's the problem?

        Comment


        • #5
          The datasets are much bigger so when I use id in dataset C to merge to dataset A it says that it is not a unique identifier. Therefore I need to create a unique identifier, but this new unique identifier is a string variable which seems to be a problem for the merge.

          Comment


          • #6
            If I understand correctly, what you want to do is encode the string ID in C, export the label, and then use the same label to encode the string ID in A.

            Code:
            use C, clear
            encode stringID, gen(numID) label(IDlabel)
            label save IDlabel using labels, replace
            
            use A, clear
            do labels
            encode stringID, gen(numID) label(IDlabel)
            hth,
            Jeph

            Comment


            • #7
              This thread just keeps getting more confusing. If you -encode- id2, you will get a new id variable, call it id3, whose values start at 1 and increase in parallel with the alphabetic order of variable id2. Those will not match the id's in datasets A and B in any sensible way, unless that is how the id variables in A and B were created in the first place.

              But I think that's a red herring anyway. Stata's error message is just telling you that id and year don't jointly identify unique observations in data set C. So just change -merge 1:1- into -merge 1:m- and then each observation in A (or A merged with B) will be paired up with all observations in C that have the same values for id and year--which is a very sensible way to -merge- most data sets. (Note that nothing you do with -encode- will overcome this error message anyway.)

              Comment


              • #8
                Thanks boht for your help. I've managed by merging C to B with a 1:m merge and then B to A with a 1:1 merge. All is fine now.

                Comment

                Working...
                X