Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fuzzy match - names only

    Hi,

    I am trying to fuzzy match 2 datasets 2 name only. I do not have a number ID to match the 2 database. I have been trying to use "matchit". The results I'm currenlty getting are not convincing. Is there any ways to use this SSC without "ID1", which is the number ID?

    Here is the code I have been running:
    I have created 1 unique number per name in each dataset. They do not have anything in common which is why I do not want to use them
    Code:
    matchit mgrno mgrname using HFnames_MorningStar1.dta, idu(obs1) txtu(Name) sim(token) t(0) override
    here is how the 2 dataset look like:
    Click image for larger version

Name:	13F_TR.GIF
Views:	1
Size:	30.2 KB
ID:	1439199
    Click image for larger version

Name:	morningStar.GIF
Views:	1
Size:	30.6 KB
ID:	1439200




    I am just trying to fuzzy match the 2 dataset by mgrname and Name. Can anyone help? Thanks!

  • #2
    The output of help matchit warns us that matchit is case-sensitive. In your post, your first dataset is all upper-case; your second is not. If you have not already done so, you should work with a copy of your second dataset in which you have used strupper() to convert the name to all upper-case.

    Comment


    • #3
      Thanks for this advise. Could you also help me with the fact that I only want to fuzzy match names and not an additional variables (ID). Thanks.

      Comment


      • #4
        The output of help matchit tells us that the string variables (mgrname and Name in your command) are used for matching, and the numeric variables (mgrno and obs1) are used for identification. They are not used for matching.

        Comment


        • #5
          That's a part that I don't really understand on the help file. What does identification means here?

          I just want to fuzzy match "Name" with "mgrname", I do not have any other relevant (identification) variables. How do I reflect that in the code? Thanks

          I found a PDF that talks about it but does not explain how to use identification variables either:
          https://www.stata.com/meeting/switze...tzerland16.pdf

          Comment


          • #6
            The code you show in post #1 reflects what you want. The ID is not used for matching. The ID assists users who want to match IDs using text descriptions, but since your data do not have IDs, you have to invent IDs because they are not optional.

            In post #1 you said the results were not convincing. Your unconvincing results are not related to the presence of the ID. Unconvincing results would certainly be related to a master data being all upper-case and using data being mixed case.

            Comment


            • #7
              can you confirm that I should create an ID in each dataset? How can I mention in the code that I do not want to use this ID for matching? If it is there, I am assuming it will interfere somwhere in the process. Can you explain in greater detail how to handle this aspect?

              I have noted your comment about upper and lower case. I have corrected that in the new version of my code. I'm still not getting matches. Please help. Thanks.

              Comment


              • #8
                Hi François,

                Thanks for your question. Yes, matchit in the two datasets syntax requires two variables (one numeric and one string) in the master dataset and two variables (one numeric and one string) in the using dataset. The original idea was that you will have an unique identifier for your names (like social security number, gvkey, etc.) in order to link these with other variables later on. There are also some performance gains of having those numeric id.

                In any case, if you do not have such id in your data you can easily create it by using the group() function of the egen command. As follows an example:

                Code:
                use HFnames_MorningStar1.dta
                
                egen myid=group(Name)
                
                save HFnames_MorningStar1_ids.dta
                
                use YourOtherDataset.dta, clear
                
                matchit mgrno mgrname using HFnames_MorningStar1_ids.dta, idu(myid) txtu(Name) sim(token) t(0) override
                Additionally, looking at your examples I suggest using the following options: sim(bigram) w(logs). Because, sim(token) will not catch misspellings and not using weights will allow for less informative text (such as "LLP", "INC", etc) to have too much impact in the score.

                Best,

                J.

                Comment


                • #9
                  Click image for larger version

Name:	image_27708.png
Views:	1
Size:	23.9 KB
ID:	1669708

                  Click image for larger version

Name:	using file.PNG
Views:	1
Size:	40.4 KB
ID:	1669709
                  Click image for larger version

Name:	after matching.PNG
Views:	1
Size:	39.3 KB
ID:	1669710






                  Dear STATA helpers ,


                  I need to match using company name between two datasets using "matchit" command , however the matching not come in a correct way

                  this the command that I used :


                  matchit gvkey CONAME using "lobbying_data_edited_ID", idu(myid) txtu(registrant_raw) sim(token) t(0) override


                  I attached the photos for each dataset as follows:

                  1. Master file photo
                  2. using File photo
                  3. after matching photo

                  Could anyone familiar with match it command tell me what I did in wrong way ?


                  many thanks
                  Last edited by Mohamed Mahmoud; 17 Jun 2022, 09:32.

                  Comment


                  • #10
                    Originally posted by Mohamed Mahmoud View Post


                    Dear STATA helpers ,


                    I need to match using company name between two datasets using "matchit" command , however the matching not come in a correct way

                    this the command that I used :


                    matchit gvkey CONAME using "lobbying_data_edited_ID", idu(myid) txtu(registrant_raw) sim(token) t(0) override


                    I attached the photos for each dataset as follows:

                    1. Master file photo
                    2. using File photo
                    3. after matching photo

                    Could anyone familiar with match it command tell me what I did in wrong way ?


                    many thanks
                    Hi MM,

                    You are using the token option which is basically comparing strings separated by blanks, where one of the strings is "dirty" with punctuation marks, etc. You have two main options to improve the similarity scores:

                    1. Clean the registrant_raw variable (e.g. removing punctuation marks and other less informative symbols).
                    2. use the bigram option instead of token

                    Note that you can do both to get a better similarity score. Moreover, I suggest using weights, so less informative strings (e.g. "INC") do not artificially increase the similarity.

                    Best,

                    J.

                    Comment


                    • #11
                      Dear Julio Raffo,

                      Could you please let me know how to remove the punctuation marks and another less informative symbol for the "registrant row variable"? what is the command for this?

                      many thanks,

                      Best,
                      MM

                      Comment

                      Working...
                      X