Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging two tables with different spelling of obervations

    Hi all,

    I want to merge two tables which origin from two different databases. Therefore, the spelling of the companies which are used as the identifier in the variable "CompanyYear" is not equal.
    E.g. "3D Systems Corporation" is spelled "3D SYSTEMS CORP." and "3D SYSTEMS CORPORATION" in Table 2 while in Table 1 in is spelled "3D SYSTEMS CORP" (note the dot after "CORP" makes a difference).
    The problem is now that when I try to merge the two tables, the number of matched results is very low given the different spelling, although in fact both tables refer to the same company.

    Is there a possibility to solve this problem by e.g. making the comparison criteria less rigorous? I know there is a comparable function in Excel with the VLOOKUP function where you can select "true" or "false" to make the comparison criteria less rigorous.

    Table 1 (Master table)
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str53 CompanyYear
    "3COM CORP2000"      
    "3COM CORP2001"      
    "3COM CORP2002"      
    "3COM CORP2003"      
    "3COM CORP2004"      
    "3COM CORP2005"      
    "3COM CORP2006"      
    "3COM CORP2007"      
    "3D SYSTEMS CORP2007"
    "3D SYSTEMS CORP2008"
    "3D SYSTEMS CORP2009"
    "3D SYSTEMS CORP2010"
    "3D SYSTEMS CORP2011"
    "3D SYSTEMS CORP2012"
    "3D SYSTEMS CORP2013"
    "3D SYSTEMS CORP2014"
    "3D SYSTEMS CORP2015"
    "3D SYSTEMS CORP2016"
    "3M CO2003"          
    "3M CO2004"          
    end
    Table 2
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str53 CompanyYear
    "3COM CORP.2007"            
    "3COM CORP.2008"            
    "3COM CORP.2009"            
    "3D SYSTEMS CORP.2018"      
    "3D SYSTEMS CORPORATION2011"
    "3D SYSTEMS CORPORATION2012"
    "3D SYSTEMS CORPORATION2013"
    "3D SYSTEMS CORPORATION2014"
    "3D SYSTEMS CORPORATION2015"
    "3D SYSTEMS CORPORATION2016"
    "3D SYSTEMS CORPORATION2017"
    "3D SYSTEMS CORPORATION2019"
    "3M CO2007"                 
    "3M CO2008"                 
    "3M CO2009"                 
    "3M CO.2018"                
    "3M COMPANY2010"            
    "3M COMPANY2011"            
    "3M COMPANY2012"            
    "3M COMPANY2013"            
    end
    Thanks for your comments.

    Best

    Matthias

  • #2
    Hello Matthias. I am curious about why you have the company name and year combined in your ID variable. Would it not be more convenient to have two separate variables, one for the company name and another for year?

    Also, looking at the company names in the sample data you provided (thank you for that, by the way), if you take the first 5 characters of the CompanyYear variable, you'll have unique company identifiers that work properly. Is that the case generally in your larger data set?

    For the data you showed, does the following give you the result you want?

    Code:
    * Let temp be the path to a temp folder--modify as needed
    local temp "C:/Temp"
    
    * Table 1.
    clear *
    input str53 CompanyYear
    "3COM CORP2000"      
    "3COM CORP2001"      
    "3COM CORP2002"      
    "3COM CORP2003"      
    "3COM CORP2004"      
    "3COM CORP2005"      
    "3COM CORP2006"      
    "3COM CORP2007"      
    "3D SYSTEMS CORP2007"
    "3D SYSTEMS CORP2008"
    "3D SYSTEMS CORP2009"
    "3D SYSTEMS CORP2010"
    "3D SYSTEMS CORP2011"
    "3D SYSTEMS CORP2012"
    "3D SYSTEMS CORP2013"
    "3D SYSTEMS CORP2014"
    "3D SYSTEMS CORP2015"
    "3D SYSTEMS CORP2016"
    "3M CO2003"          
    "3M CO2004"          
    end
    
    * Use first 5 characters of CompanyYear as new variable company
    generate str5 company = substr(CompanyYear,1,5)
    * Use last 4 characters of CompanyYear as new variable year
    generate str4 year = substr(CompanyYear,strlen(CompanyYear)-3,4)
    destring year, replace
    sort company year
    drop CompanyYear
    save `temp'/table1, replace
    
    * Create Table 2 and save as table2.dta
    clear *
    input str53 CompanyYear
    "3COM CORP.2007"            
    "3COM CORP.2008"            
    "3COM CORP.2009"            
    "3D SYSTEMS CORP.2018"      
    "3D SYSTEMS CORPORATION2011"
    "3D SYSTEMS CORPORATION2012"
    "3D SYSTEMS CORPORATION2013"
    "3D SYSTEMS CORPORATION2014"
    "3D SYSTEMS CORPORATION2015"
    "3D SYSTEMS CORPORATION2016"
    "3D SYSTEMS CORPORATION2017"
    "3D SYSTEMS CORPORATION2019"
    "3M CO2007"                 
    "3M CO2008"                 
    "3M CO2009"                 
    "3M CO.2018"                
    "3M COMPANY2010"            
    "3M COMPANY2011"            
    "3M COMPANY2012"            
    "3M COMPANY2013"            
    end
    
    * Use first 5 characters of CompanyYear as new variable company
    generate str5 company = substr(CompanyYear,1,5)
    * Use last 4 characters of CompanyYear as new variable year
    generate str4 year = substr(CompanyYear,strlen(CompanyYear)-3,4)
    destring year, replace
    sort company year
    drop CompanyYear
    save `temp'/table2, replace
    
    * Now merge the two tables
    clear
    use `temp'/table1
    merge 1:1 company year using `temp'/table2
    sort company year
    
    list, clean
    tabulate _merge
    Output from the last two commands above:

    Code:
    . list, clean
    
           company   year            _merge  
      1.     3COM    2000   master only (1)  
      2.     3COM    2001   master only (1)  
      3.     3COM    2002   master only (1)  
      4.     3COM    2003   master only (1)  
      5.     3COM    2004   master only (1)  
      6.     3COM    2005   master only (1)  
      7.     3COM    2006   master only (1)  
      8.     3COM    2007       matched (3)  
      9.     3COM    2008    using only (2)  
     10.     3COM    2009    using only (2)  
     11.     3D SY   2007   master only (1)  
     12.     3D SY   2008   master only (1)  
     13.     3D SY   2009   master only (1)  
     14.     3D SY   2010   master only (1)  
     15.     3D SY   2011       matched (3)  
     16.     3D SY   2012       matched (3)  
     17.     3D SY   2013       matched (3)  
     18.     3D SY   2014       matched (3)  
     19.     3D SY   2015       matched (3)  
     20.     3D SY   2016       matched (3)  
     21.     3D SY   2017    using only (2)  
     22.     3D SY   2018    using only (2)  
     23.     3D SY   2019    using only (2)  
     24.     3M CO   2003   master only (1)  
     25.     3M CO   2004   master only (1)  
     26.     3M CO   2007    using only (2)  
     27.     3M CO   2008    using only (2)  
     28.     3M CO   2009    using only (2)  
     29.     3M CO   2010    using only (2)  
     30.     3M CO   2011    using only (2)  
     31.     3M CO   2012    using only (2)  
     32.     3M CO   2013    using only (2)  
     33.     3M CO   2018    using only (2)  
    
    . tabulate _merge
    
                     _merge |      Freq.     Percent        Cum.
    ------------------------+-----------------------------------
            master only (1) |         13       39.39       39.39
             using only (2) |         13       39.39       78.79
                matched (3) |          7       21.21      100.00
    ------------------------+-----------------------------------
                      Total |         33      100.00

    --
    Bruce Weaver
    Email: [email protected]
    Web: http://sites.google.com/a/lakeheadu.ca/bweaver/
    Version: Stata/MP 18.0 (Windows)

    Comment

    Working...
    X