Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge if variable names are similar in master and using data

    Hello stata users,

    I am trying to merge two datasets (master and university datas) and I would like to find out if there would be any algorithm method that makes them merged if the key variables are similar.
    For instance, let's say in my university data, there is university titled "Johns Hopkins U." and in my master data, it is titled "Johns Hop. U.".
    What I would like to do is to merge them if these variables have similar (or almost same) name.

    My university data looks like :

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str94 uni_name int year long funding
    "A. T. Still U."                           1971 1
    "A. T. Still U."                           1972 1
    "A. T. Still U."                           1973 1
    "A. T. Still U."                           1974 1
    "A. T. Still U."                           1975 1
    "A. T. Still U."                           1976 1
    "A. T. Still U."                           1977 1
    "A. T. Still U."                           1978 1
    "A. T. Still U."                           1979 1
    "A. T. Still U."                           1980 1
    "A. T. Still U."                           1981 1
    "A. T. Still U."                           1982 1
    "A. T. Still U."                           1983 1
    "A. T. Still U."                           1984 1
    "A. T. Still U."                           1985 1
    "A. T. Still U."                           1986 1
    "A. T. Still U."                           1987 1
    "A. T. Still U."                           1988 1
    "A. T. Still U."                           1989 1
    "A. T. Still U."                           1990 1
    "A. T. Still U."                           1991 1
    "A. T. Still U."                           1992 1
    "A. T. Still U."                           1993 1
    "A. T. Still U."                           1994 1
    "A. T. Still U."                           1995 1
    "A. T. Still U."                           1996 1
    "A. T. Still U."                           1997 1
    "A. T. Still U."                           1998 1
    "A. T. Still U."                           1999 1
    "A. T. Still U."                           2000 1
    "A. T. Still U."                           2001 1
    "A. T. Still U."                           2002 1
    "A. T. Still U."                           2003 1
    "A. T. Still U."                           2004 1
    "A. T. Still U."                           2005 1
    "A. T. Still U."                           2006 1
    "A. T. Still U."                           2007 1
    "A. T. Still U."                           2008 1
    "A. T. Still U."                           2009 1
    "A. T. Still U."                           2010 1
    "A. T. Still U."                           2011 1
    "A. T. Still U."                           2012 1
    "A. T. Still U."                           2013 1
    "A. T. Still U."                           2014 0
    "A. T. Still U."                           2015 0
    "A. T. Still U."                           2016 0
    "A. T. Still U."                           2017 1
    "A. T. Still U."                           2018 1
    "A. T. Still U."                           2019 0
    "AIB C. of Business"                       1971 0
    "AIB C. of Business"                       1972 0
    "AIB C. of Business"                       1973 0
    "AIB C. of Business"                       1974 0
    "AIB C. of Business"                       1975 0
    "AIB C. of Business"                       1976 0
    "AIB C. of Business"                       1977 0
    "AIB C. of Business"                       1978 0
    "AIB C. of Business"                       1979 0
    "AIB C. of Business"                       1980 0
    "AIB C. of Business"                       1981 0
    "AIB C. of Business"                       1982 0
    "AIB C. of Business"                       1983 0
    "AIB C. of Business"                       1984 0
    "AIB C. of Business"                       1985 0
    "AIB C. of Business"                       1986 0
    "AIB C. of Business"                       1987 0
    "AIB C. of Business"                       1988 0
    "AIB C. of Business"                       1989 0
    "AIB C. of Business"                       1990 0
    "AIB C. of Business"                       1991 0
    "AIB C. of Business"                       1992 0
    "AIB C. of Business"                       1993 0
    "AIB C. of Business"                       1994 0
    "AIB C. of Business"                       1995 0
    "AIB C. of Business"                       1996 0
    "AIB C. of Business"                       1997 0
    "AIB C. of Business"                       1998 0
    "AIB C. of Business"                       1999 0
    "AIB C. of Business"                       2000 0
    "AIB C. of Business"                       2001 0
    "AIB C. of Business"                       2002 0
    "AIB C. of Business"                       2003 0
    "AIB C. of Business"                       2004 0
    "AIB C. of Business"                       2005 0
    "AIB C. of Business"                       2006 0
    "AIB C. of Business"                       2007 0
    "AIB C. of Business"                       2008 0
    "AIB C. of Business"                       2009 0
    "AIB C. of Business"                       2010 1
    "AIB C. of Business"                       2011 1
    "AIB C. of Business"                       2012 1
    "AIB C. of Business"                       2013 1
    "AIB C. of Business"                       2014 1
    "AIB C. of Business"                       2015 0
    "AIB C. of Business"                       2016 0
    "AIB C. of Business"                       2017 0
    "AIB C. of Business"                       2018 0
    "AIB C. of Business"                       2019 0
    "AUI National Radio Astronomy Observatory" 1971 1
    "AUI National Radio Astronomy Observatory" 1972 1
    end

    And my master data looks like:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float uni_id str244 org_name double year
       . ""                                                                 1974
       . ""                                                                 1975
       . ""                                                                 1976
     230 "Regents of the University of California and BP Amoco Corporation" 1981
     534 "The University Hospital"                                          1975
     346 "Regents of the University of Minnesota"                           1977
    1224 "University of Health Sciences/The Chicago Medical School"         1978
     549 "Long Island University"                                           1976
       . ""                                                                 1963
       . ""                                                                 1964
       . ""                                                                 1965
     479 "Virginia State University"                                        1950
     912 "College of William and Mary"                                      1960
    1216 "The University of Chicago Development Corporation"                1991
     173 "Texas Wesleyan University, Inc."                                  1992
       . ""                                                                 1988
       . ""                                                                 1989
       . ""                                                                 1990
       . ""                                                                 1991
       . ""                                                                 1992
       . ""                                                                 1993
       . ""                                                                 1994
       . ""                                                                 1995
       . ""                                                                 1996
      66 "Polytechnic University"                                           1980
      28 "Nova University"                                                  1967
     240 "The University of California Los Angeles"                         1986
     957 "Erskine College"                                                  1994
     419 "The University of South Carolina"                                 2001
     794 "Clemson University Research Foundation"                           2006
     794 "Clemson University Research Foundation"                           2008
       . ""                                                                 1999
       . ""                                                                 2000
       . ""                                                                 2001
       . ""                                                                 2002
       . ""                                                                 2003
       . ""                                                                 2004
       . ""                                                                 2005
       . ""                                                                 2006
       . ""                                                                 2007
       . ""                                                                 2008
       . ""                                                                 2009
    1220 "Northerwestern University"                                        1950
     245 "The University of Chicago"                                        1982
     544 "Rosalind Franklin University of Medicine and Science"             1992
    1224 "University of Health Sciences/The Chicago Medical School"         1995
    1032 "Loyola University of Chicago"                                     1998
    1224 "University of Health Sciences/The Chicago Medical School"         2003
     837 "Northwestern University Medical School"                           2007
    1032 "Loyola University of Chicago"                                     2015
       . ""                                                                 1972
       . ""                                                                 1973
       . ""                                                                 1974
      38 "The Ohio State University Research Foundation"                    1967
     934 "USB Corporation"                                                  1968
     602 "Hoover Universal, Inc."                                           1986
     509 "California Institute of Technology"                               1972
     146 "Research Foundation of the State University of New York"          1983
     327 "Trustees of Boston University"                                    1973
     611 "The Reents of the University of California"                       1990
     941 "Duke University"                                                  1952
     547 "Long Island University"                                           1962
     411 "Board of Regents for Education of the State of Rhode Island"      1965
    1123 "All American University, Incorporated"                            1973
       . ""                                                                 1950
       . ""                                                                 1951
       . ""                                                                 1952
       . ""                                                                 1953
       . ""                                                                 1954
       . ""                                                                 1955
       . ""                                                                 1956
       . ""                                                                 1957
       . ""                                                                 1958
       . ""                                                                 1959
       . ""                                                                 1960
       . ""                                                                 1961
       . ""                                                                 1962
       . ""                                                                 1963
       . ""                                                                 1964
       . ""                                                                 1965
       . ""                                                                 1966
       . ""                                                                 1968
       . ""                                                                 1969
       . ""                                                                 1970
       . ""                                                                 1971
       . ""                                                                 1976
       . ""                                                                 1977
       . ""                                                                 1978
       7 "North Dakota State University"                                    1984
       . ""                                                                 1960
       . ""                                                                 1961
       . ""                                                                 1963
       . ""                                                                 1964
       . ""                                                                 1965
       . ""                                                                 1966
     513 "Wright State University"                                          1962
     246 "University of Cincinnati"                                         1964
     514 "Wright State University"                                          1966
     956 "Emory University"                                                 1967
     254 "The University of Dayton"                                         1970
    end



    I tried to find prior posts in the FAQ, but I haven't yet found the solution.
    Would there be any algorithm code for this ?

    Thanks !

  • #2
    Try the user-written -matchit- command?

    Code:
    net describe matchit, from(http://fmwww.bc.edu/RePEc/bocode/m)

    Comment


    • #3
      Thanks for your advice !
      Actually I have been running the code -matchit- for more than 6 hours, but is it normal that it takes such a long time ?
      I am using Stata 17 SE with master 2800000 observations..

      Comment


      • #4
        2,800,000 observations is a lot of observations.

        Perhaps Julio Raffo the author of matchit will see this and can advise us if anything is to be gained by
        1. first creating copies of your university data and your master data, each containing just one copy of each value of the university name
        2. then applying matchit to those two much-smaller datasets
        3. then using those results to link the two original datasets
        Note that in the output of help matchit we are told about the txtmaster and txtusing options

        Code:
            txtmaster is a string varname from the current file (masterfile) which will be matched to the
                string variable from the usingfile declared in txtusing().  Duplicated values are allowed,
                although at the cost of losing some computational efficiency.
        
            txtusing(varname) declares the string varname from the usingfile which will be matched to
                txtmaster.  Duplicated values are allowed, although at the cost of losing some
                computational efficiency.
        You need all the computational efficiency you can get, it seems to me.
        Last edited by William Lisowski; 16 Aug 2022, 08:19.

        Comment


        • #5
          Thanks William for your advice !
          I dropped all the duplicates and started with much smaller datasets with this command:
          Code:
          matchit year org_name using ${inter}NCSES_uni_funding.dta, idu(u_year) txtu(uni_name)

          Comment


          • #6
            (The following crossed with William Lisowski's), but I think is nevertheless of use.)

            When you have a problem like this, a standard thing to do is run a smaller version of it and see how long it takes. So, you might take samples of (say) 1000 observations from each of your files and see how long it takes per pair of observations from each file.

            Now, thinking from first principles: To match up files like this requires (to my knowledge) a brute force comparison of N1 X N2 observations. If your second file had only 10,000 observations, -matchit- would require examining 2.8e10 pairs That's a lot!

            I had some experience helping someone with a problem like yours last year, and we discovered that it would just take too long to do it in Stata. There was existing Python code for this kind of string matching problem, and for an experienced Python programmer (I'm not at all), it was easy to glue something together. The code used was for the Jaro-Winkler string matching algorithm (about which I know nothing), for which there is a Python home at https://pypi.org/project/jaro-winkler/. The bad news was that even for two files with 60K and 300K records, the run time was in the range of several days, if I recall correctly. I would guess that your problem will be too big even for a Python program, but perhaps someone here more directly experienced will have happier news.

            Comment


            • #7
              Thank you Mike for the information.
              That's quite depressive news which means that I could not run any other commands before completing this matching process..
              Hope anyone who is an expert on this could help !

              Comment


              • #8
                It is not clear to me if in removing duplicates you reduced duplicates to one per year or to one total (with an arbitrary year).

                For the benefit of anyone who finds this topic at a later date, let me be a little more explicit about my recommendation. But let me start by saying that Mike Lacy's experience leads me to pessimism about being able to reduce the problem posed in post #1 and #3 far enough to be able to effectively use matchit.

                My recommendation would be to create reduced copies of each dataset with just two variables.
                Code:
                use master, clear
                keep org_name
                duplicates drop
                generate org_id = _n
                save master_m, replace
                
                use university, clear
                keep uni_name
                duplicates drop
                generate uni_id = _n
                save university_m, replace
                
                use master_m, clear
                matchit org_id org_name using university_m, idu(uni_id) txtu(uni_name)
                And then use the results from this to add the appropriate uni_name to each of the observations in master creating a new datset master_uni.

                After doing that you can use a standard merge on your two datasets.
                Code:
                use master_uni, clear
                merge 1:1 uni_name year using university

                Comment


                • #9
                  Originally posted by William Lisowski View Post
                  2,800,000 observations is a lot of observations.

                  Perhaps Julio Raffo the author of matchit will see this and can advise us if anything is to be gained by
                  1. first creating copies of your university data and your master data, each containing just one copy of each value of the university name
                  2. then applying matchit to those two much-smaller datasets
                  3. then using those results to link the two original datasets
                  Note that in the output of help matchit we are told about the txtmaster and txtusing options

                  Code:
                  txtmaster is a string varname from the current file (masterfile) which will be matched to the
                  string variable from the usingfile declared in txtusing(). Duplicated values are allowed,
                  although at the cost of losing some computational efficiency.
                  
                  txtusing(varname) declares the string varname from the usingfile which will be matched to
                  txtmaster. Duplicated values are allowed, although at the cost of losing some
                  computational efficiency.
                  You need all the computational efficiency you can get, it seems to me.
                  Sorry to join late the discussion, but yes, -matchit- speed will definitely benefit from removing duplicated (name) entries in both files. -Matchit- just compares names against names, so any redundancy will affect the performance. As described somewhere in the thread, you can always reconnect -matchit-'s results using the de-duplicated files to the original ones.

                  Comment


                  • #10
                    If I say that my idmaster and idusing have no relationship at all, does it also affect the result ?
                    I just wanted to match two variables using txtmaster and textusing which are the only common (connection) between the two datasets.
                    Therefore I just used -year- for both idmaster and idusing.
                    Is it possible to match the datasets by using only txt variables ?


                    fyi, below is the output:
                    (the output is strange, they don't even match even if the two txt variables are obviously similar)
                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input int year str159 org_name int u_year str94 uni_name
                    1982 "100%25 California State University Fresno Foundation" 2016 "California State U., Stanislaus"     
                    1982 "100%25 California State University Fresno Foundation" 2007 "California State U., Monterey Bay"   
                    1982 "100%25 California State University Fresno Foundation" 1973 "California State U., Stanislaus"     
                    1982 "100%25 California State University Fresno Foundation" 1992 "Southern Union State Community C."   
                    1982 "100%25 California State University Fresno Foundation" 2017 "California State U., Stanislaus"     
                    1982 "100%25 California State University Fresno Foundation" 1995 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 2014 "Consortium of California State U."   
                    1982 "100%25 California State University Fresno Foundation" 1994 "U. California, Cooperative Extension"
                    1982 "100%25 California State University Fresno Foundation" 2012 "California State U., Monterey Bay"   
                    1982 "100%25 California State University Fresno Foundation" 1975 "Southern Union State Community C."   
                    1982 "100%25 California State University Fresno Foundation" 1982 "Southern Union State Community C."   
                    1982 "100%25 California State University Fresno Foundation" 2011 "California State U., Monterey Bay"   
                    1982 "100%25 California State University Fresno Foundation" 1984 "California State U., Stanislaus"     
                    1982 "100%25 California State University Fresno Foundation" 1984 "California State U., Fullerton"      
                    1982 "100%25 California State University Fresno Foundation" 2003 "Southern Union State Community C."   
                    1982 "100%25 California State University Fresno Foundation" 1972 "Southern Union State Community C."   
                    1982 "100%25 California State University Fresno Foundation" 1998 "California State U., Monterey Bay"   
                    1982 "100%25 California State University Fresno Foundation" 1992 "California State U., Fresno"         
                    1982 "100%25 California State University Fresno Foundation" 2015 "California State U., Fresno"         
                    1982 "100%25 California State University Fresno Foundation" 2008 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 1986 "U. California, Cooperative Extension"
                    1982 "100%25 California State University Fresno Foundation" 2001 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 1996 "California State U., Fullerton"      
                    1982 "100%25 California State University Fresno Foundation" 1987 "California State U., Monterey Bay"   
                    1982 "100%25 California State University Fresno Foundation" 1999 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 1985 "California State U., Fullerton"      
                    1982 "100%25 California State University Fresno Foundation" 1979 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 1994 "California State U., Fullerton"      
                    1982 "100%25 California State University Fresno Foundation" 2002 "Southern Union State Community C."   
                    1982 "100%25 California State University Fresno Foundation" 1978 "California State U., Monterey Bay"   
                    1982 "100%25 California State University Fresno Foundation" 1995 "Consortium of California State U."   
                    1982 "100%25 California State University Fresno Foundation" 2019 "California State U., Fullerton"      
                    1982 "100%25 California State University Fresno Foundation" 1973 "California State U., Fresno"         
                    1982 "100%25 California State University Fresno Foundation" 2016 "California State U., Fresno"         
                    1982 "100%25 California State University Fresno Foundation" 2006 "Southern Union State Community C."   
                    1982 "100%25 California State University Fresno Foundation" 1989 "U. California, Cooperative Extension"
                    1982 "100%25 California State University Fresno Foundation" 2012 "Consortium of California State U."   
                    1982 "100%25 California State University Fresno Foundation" 1972 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 2000 "California State U., Stanislaus"     
                    1982 "100%25 California State University Fresno Foundation" 1983 "California State U., Stanislaus"     
                    1982 "100%25 California State University Fresno Foundation" 2003 "California State U., Monterey Bay"   
                    1982 "100%25 California State University Fresno Foundation" 2004 "California State U., Fresno"         
                    1982 "100%25 California State University Fresno Foundation" 2017 "U. California, Cooperative Extension"
                    1982 "100%25 California State University Fresno Foundation" 1990 "California State U., Stanislaus"     
                    1982 "100%25 California State University Fresno Foundation" 1987 "Southern Union State Community C."   
                    1982 "100%25 California State University Fresno Foundation" 1975 "California State U., Stanislaus"     
                    1982 "100%25 California State University Fresno Foundation" 2005 "Southern Union State Community C."   
                    1982 "100%25 California State University Fresno Foundation" 2008 "California State U., Monterey Bay"   
                    1982 "100%25 California State University Fresno Foundation" 1978 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 1997 "Southern Union State Community C."   
                    1982 "100%25 California State University Fresno Foundation" 1971 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 1981 "U. California, Cooperative Extension"
                    1982 "100%25 California State University Fresno Foundation" 2000 "California State U., Fullerton"      
                    1982 "100%25 California State University Fresno Foundation" 1979 "California State U., Fresno"         
                    1982 "100%25 California State University Fresno Foundation" 2012 "U. California, Cooperative Extension"
                    1982 "100%25 California State University Fresno Foundation" 2006 "California State U., Fullerton"      
                    1982 "100%25 California State University Fresno Foundation" 1993 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 2012 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 1986 "California State U., Fresno"         
                    1982 "100%25 California State University Fresno Foundation" 1991 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 1985 "California State U., Stanislaus"     
                    1982 "100%25 California State University Fresno Foundation" 2016 "Consortium of California State U."   
                    1982 "100%25 California State University Fresno Foundation" 1976 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 2012 "California State U., Fullerton"      
                    1982 "100%25 California State University Fresno Foundation" 2013 "Consortium of California State U."   
                    1982 "100%25 California State University Fresno Foundation" 2008 "California State U., Fresno"         
                    1982 "100%25 California State University Fresno Foundation" 1992 "California State U., Monterey Bay"   
                    1982 "100%25 California State University Fresno Foundation" 2010 "California State U., Stanislaus"     
                    1982 "100%25 California State University Fresno Foundation" 1994 "California State U., Stanislaus"     
                    1982 "100%25 California State University Fresno Foundation" 2010 "U. California, Cooperative Extension"
                    1982 "100%25 California State University Fresno Foundation" 2011 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 1974 "U. California, Cooperative Extension"
                    1982 "100%25 California State University Fresno Foundation" 1974 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 2015 "California State U., Fullerton"      
                    1982 "100%25 California State University Fresno Foundation" 1976 "Consortium of California State U."   
                    1982 "100%25 California State University Fresno Foundation" 2015 "California State U., Stanislaus"     
                    1982 "100%25 California State University Fresno Foundation" 1975 "California State U., Fullerton"      
                    1982 "100%25 California State University Fresno Foundation" 1976 "California State U., Monterey Bay"   
                    1982 "100%25 California State University Fresno Foundation" 1998 "California State U., Stanislaus"     
                    1982 "100%25 California State University Fresno Foundation" 1994 "Southern Union State Community C."   
                    1982 "100%25 California State University Fresno Foundation" 1989 "Southern Union State Community C."   
                    1982 "100%25 California State University Fresno Foundation" 1996 "California State U., Fresno"         
                    1982 "100%25 California State University Fresno Foundation" 1977 "California State U., Fresno"         
                    1982 "100%25 California State University Fresno Foundation" 1995 "U. California, Cooperative Extension"
                    1982 "100%25 California State University Fresno Foundation" 1975 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 1997 "California State U., Fresno"         
                    1982 "100%25 California State University Fresno Foundation" 2005 "U. California, Cooperative Extension"
                    1982 "100%25 California State University Fresno Foundation" 1986 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 1997 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 1998 "U. California, Cooperative Extension"
                    1982 "100%25 California State University Fresno Foundation" 2018 "California State U., Stanislaus"     
                    1982 "100%25 California State University Fresno Foundation" 2003 "U. California, Cooperative Extension"
                    1982 "100%25 California State University Fresno Foundation" 1998 "California State U., Chico"          
                    1982 "100%25 California State University Fresno Foundation" 2002 "California State U., Fresno"         
                    1982 "100%25 California State University Fresno Foundation" 1986 "California State U., Fullerton"      
                    1982 "100%25 California State University Fresno Foundation" 1986 "Southern Union State Community C."   
                    1982 "100%25 California State University Fresno Foundation" 1992 "Consortium of California State U."   
                    1982 "100%25 California State University Fresno Foundation" 2009 "U. California, Cooperative Extension"
                    1982 "100%25 California State University Fresno Foundation" 2003 "California State U., Stanislaus"     
                    1982 "100%25 California State University Fresno Foundation" 1992 "California State U., Chico"          
                    end
                    Is there someone who can help me with this problem?




                    Last edited by Anne-Claire Jo; 19 Aug 2022, 20:41.

                    Comment


                    • #11
                      To be frank, I think the best way through the problem is to sit down with a copy of unique values for school name from both datasets and clean the raw text so that they are the same. That will make the resulting merging smoother. You'll probably even get through it faster than waiting for matchit.

                      Comment


                      • #12
                        Here is a demonstration of the use of matchit, following the guidance from post #8 to reduce both datasets to a single copy of each text value. The example data I use was created from your data, having reduced it to a single copy of each text value, and having added additional values for the organization data.
                        Code:
                        * Example generated by -dataex-. For more info, type help dataex
                        clear
                        input str58 org_name
                        "100%25 California State University Fresno Foundation"      
                        "100%25 California State University Fullerton Foundation"  
                        "100%25 California State University Chico Foundation"      
                        "100%25 California State University Monterey Bay Foundation"
                        "100%25 California State University Stanislaus Foundation"  
                        "100%25 Harvard University Foundation"  
                        end
                        generate org_id = _n
                        clonevar org_name_m = org_name
                        // remove first seven characters
                        replace org_name_m = substr(org_name_m,8,.)
                        // change all to upper case
                        replace org_name_m = strupper(org_name_m)
                        // what do we have
                        list org_name_m, clean
                        save "~/Downloads/org.dta", replace
                        
                        * Example generated by -dataex-. For more info, type help dataex
                        clear
                        input str36 uni_name
                        "California State U., Stanislaus"    
                        "California State U., Monterey Bay"  
                        "Southern Union State Community C."  
                        "California State U., Chico"          
                        "Consortium of California State U."  
                        "U. California, Cooperative Extension"
                        "California State U., Fullerton"      
                        "California State U., Fresno"        
                        end
                        generate uni_id = _n
                        clonevar uni_name_m = uni_name
                        // expand "U." to "University"
                        replace uni_name_m = subinstr(uni_name_m,"U. ","University ",1)
                        replace uni_name_m = subinstr(uni_name_m," U."," University",1)
                        // expand "C." to "College"
                        replace uni_name_m = subinstr(uni_name_m,"C. ","College ",1)
                        replace uni_name_m = subinstr(uni_name_m," C."," College",1)
                        // change all to upper case
                        replace uni_name_m = strupper(uni_name_m)
                        // what do we have
                        list uni_name_m, clean
                        save "~/Downloads/uni", replace
                        
                        matchit uni_id uni_name_m using "~/Downloads/org.dta", idusing(org_id) txtusing(org_name_m)
                        describe
                        gsort org_id org_name_m -similscore
                        by org_id org_name_m: list similscore uni_name_m, clean abbreviate(20)
                        Code:
                        . matchit uni_id uni_name_m using "~/Downloads/org.dta", idusing(org_id) txtusing(org_name_m)
                        Matching current dataset with ~/Downloads/org.dta
                        Similarity function: bigram
                        Loading USING file: ~/Downloads/org.dta
                        Indexing USING file.
                        0%
                        40%
                        60%
                        60%
                        80%
                        Done!
                        Computing results
                                Percent completed ...   (search space saved by index so far)
                                20%               ...   (0%)
                                60%               ...   (0%)
                                60%               ...   (0%)
                                80%               ...   (0%)
                                Done!
                        Total search space saved by index: 0%
                        
                        . describe
                        
                        Contains data
                         Observations:            40                  
                            Variables:             5                  
                        -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                        Variable      Storage   Display    Value
                            name         type    format    label      Variable label
                        -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                        uni_id          byte    %10.0g                
                        uni_name_m      str44   %44s                  
                        org_id          byte    %10.0g                
                        org_name_m      str51   %51s                  
                        similscore      double  %10.0g                
                        -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                        Sorted by: 
                             Note: Dataset has changed since last saved.
                        
                        . gsort org_id org_name_m -similscore
                        
                        . by org_id org_name_m: list similscore uni_name_m, clean abbreviate(20)
                        
                        -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                        -> org_id = 1, org_name_m = CALIFORNIA STATE UNIVERSITY FRESNO FOU..
                        
                               similscore                                     uni_name_m  
                          1.     .8618575            CALIFORNIA STATE UNIVERSITY, FRESNO  
                          2.     .7438395         CALIFORNIA STATE UNIVERSITY, FULLERTON  
                          3.    .72231512      CALIFORNIA STATE UNIVERSITY, MONTEREY BAY  
                          4.    .71306774             CALIFORNIA STATE UNIVERSITY, CHICO  
                          5.    .69751845      CONSORTIUM OF CALIFORNIA STATE UNIVERSITY  
                          6.    .69282032        CALIFORNIA STATE UNIVERSITY, STANISLAUS  
                          7.    .62387947   UNIVERSITY CALIFORNIA, COOPERATIVE EXTENSION  
                        
                        -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                        -> org_id = 2, org_name_m = CALIFORNIA STATE UNIVERSITY FULLERTON ..
                        
                               similscore                                     uni_name_m  
                          1.    .87982405         CALIFORNIA STATE UNIVERSITY, FULLERTON  
                          2.    .72554232            CALIFORNIA STATE UNIVERSITY, FRESNO  
                          3.    .71736431      CALIFORNIA STATE UNIVERSITY, MONTEREY BAY  
                          4.    .71418458      CONSORTIUM OF CALIFORNIA STATE UNIVERSITY  
                          5.    .69254985             CALIFORNIA STATE UNIVERSITY, CHICO  
                          6.    .66996452        CALIFORNIA STATE UNIVERSITY, STANISLAUS  
                          7.    .63878611   UNIVERSITY CALIFORNIA, COOPERATIVE EXTENSION  
                          8.    .56028295         SOUTHERN UNION STATE COMMUNITY COLLEGE  
                        
                        -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                        -> org_id = 3, org_name_m = CALIFORNIA STATE UNIVERSITY CHICO FOUN..
                        
                               similscore                                     uni_name_m  
                          1.    .85208592             CALIFORNIA STATE UNIVERSITY, CHICO  
                          2.     .7599606      CONSORTIUM OF CALIFORNIA STATE UNIVERSITY  
                          3.    .74681604            CALIFORNIA STATE UNIVERSITY, FRESNO  
                          4.    .74353583         CALIFORNIA STATE UNIVERSITY, FULLERTON  
                          5.    .72261012      CALIFORNIA STATE UNIVERSITY, MONTEREY BAY  
                          6.    .71290623        CALIFORNIA STATE UNIVERSITY, STANISLAUS  
                          7.    .69861079   UNIVERSITY CALIFORNIA, COOPERATIVE EXTENSION  
                          8.    .57190516         SOUTHERN UNION STATE COMMUNITY COLLEGE  
                        
                        -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                        -> org_id = 4, org_name_m = CALIFORNIA STATE UNIVERSITY MONTEREY B..
                        
                               similscore                                     uni_name_m  
                          1.    .88113422      CALIFORNIA STATE UNIVERSITY, MONTEREY BAY  
                          2.    .70932072         CALIFORNIA STATE UNIVERSITY, FULLERTON  
                          3.    .69721669            CALIFORNIA STATE UNIVERSITY, FRESNO  
                          4.    .66670068             CALIFORNIA STATE UNIVERSITY, CHICO  
                          5.    .66669372      CONSORTIUM OF CALIFORNIA STATE UNIVERSITY  
                          6.    .64465837   UNIVERSITY CALIFORNIA, COOPERATIVE EXTENSION  
                          7.    .64231723        CALIFORNIA STATE UNIVERSITY, STANISLAUS  
                          8.    .50559183         SOUTHERN UNION STATE COMMUNITY COLLEGE  
                        
                        -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                        -> org_id = 5, org_name_m = CALIFORNIA STATE UNIVERSITY STANISLAUS..
                        
                               similscore                                     uni_name_m  
                          1.    .88779605        CALIFORNIA STATE UNIVERSITY, STANISLAUS  
                          2.    .75906419            CALIFORNIA STATE UNIVERSITY, FRESNO  
                          3.    .74972279         CALIFORNIA STATE UNIVERSITY, FULLERTON  
                          4.    .74902534             CALIFORNIA STATE UNIVERSITY, CHICO  
                          5.    .72595401      CALIFORNIA STATE UNIVERSITY, MONTEREY BAY  
                          6.    .72371288      CONSORTIUM OF CALIFORNIA STATE UNIVERSITY  
                          7.    .59751552   UNIVERSITY CALIFORNIA, COOPERATIVE EXTENSION  
                          8.    .52068744         SOUTHERN UNION STATE COMMUNITY COLLEGE  
                        
                        -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                        -> org_id = 6, org_name_m = HARVARD UNIVERSITY FOUNDATION
                        
                               similscore                                     uni_name_m  
                          1.    .50056786   UNIVERSITY CALIFORNIA, COOPERATIVE EXTENSION

                        Comment


                        • #13
                          Thank you William for your help !
                          It works well

                          Comment


                          • #14
                            I would like to keep only the uni_name_m that has highest similscore in the data for each org_name_m, however, even though I have tried -bysort- with several methods, it is not generating what I really wanted..
                            Can I get more helps on this, please ?

                            Comment


                            • #15
                              Code:
                              bysort org_name_m (similscore): keep if _n==1

                              Comment

                              Working...
                              X