Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using -reclink- to link big dataset

    Dear Statalist,

    Below is an email I sent to Michael Blasnik, author of -reclink- inquiring about a problem I am having. Would you please share any advice you might have regarding my issue?

    Thanks,
    Dom

    I am working on a project at the Minnesota Population Center which is attempting to link various people to their information in the 1940 U.S. census. I am attempting to do this using -reclink-. I have first and last names, parents' first and last names, as well as state of birth, which I am trying to match on.

    I am using match weights, non-match weights (for some reason I cannot use both weights at the same time), I am or-blocking on the first letter of last name, I am and-blocking (using the -required- option) on the state of birth.


    The strange thing though is that people are getting linked to completely wrong observations. Even thought I or-block on first letter of last name, people with different first letters of last names are getting linked, similarly for and-blocking on state of birth. In fact, everyone is getting linked to a person in the State of Alabama, which is the first state sorted in the census (using dataset). I am guessing that the algorith goes line by line in the using dataset, and since the census has millions of observations it is stopping right in Alabama.

    Any advice on what I could do to obtain successful matches?


  • #2
    I think that your prospects of getting concrete advice would be greatly enhanced if you showed the command(s) you are using to do this. You may be omitting or mis-specifying some options, or making other errors--we can only guess. It would probably also help if you showed a sample of the results you are getting--there may be some pattern to the nature of the incorrect matches that might shed light on it (besides the fact that everyone is being matched to somebody in Alabama.)

    Comment


    • #3
      Dear Clyde,

      Thank you for the suggestion. Please see the codes below. I have separated the census and the senators file (which I am trying to link) by birthyear and gender so it is more computationally manageable given how big the census file is. lastblock is a var with only the first letter of the last name. I am not sure how to show the results I am getting as the output is a huge dataset.

      foreach x of numlist 1930/1940 {

      use "/pkg/ancestryprojects/Warren_1940_Linking/Senators/msenators`x'.dta", clear

      reclink fname1 lname momfname1 momlname dadfname1 dadlname lastblock stateofbirth ///
      using "/pkg/ancestryprojects/Warren_1940_Linking/Data/By birthyear/us1940_mborn_`x'.dta", ///
      idmaster(id) idusing(id) gen(linkscore) required(stateofbirth lastblock) orblock(stateofbirth) wnomatch(15 20 5 10 5 10 25 25) ///
      exactstr(lname) minbigram(0.7)


      saveold "/pkg/ancestryprojects/Warren_1940_Linking/Data/Linked/mlinked`x'.dta", replace

      keep serial pernum histid id linkscore _merge

      saveold "/pkg/ancestryprojects/Warren_1940_Linking/Data/Linked/Linked_ID/mlinked`x'.dta", replace

      }

      Comment


      • #4
        Well, I haven't used -reclink- in a long time, so I don't know if this will be helpful, but give some consideration to these points:

        1. It makes no sense to specify only a single variable in the -orblock()- option, particularly when that variable is already included in the -required()- option. So I would just get rid of the -orblock()- option here. It may be that the inclusion of stateofbirth in both of these options is getting -reclink- confused somehow. Omitting the -orblock()- option, since you have more than four variables in the varlist, will cause -reclink- to require observations to match on at least one of the variables in the variable list. Since the -required()- option already requires them to match on both state of birth and lastblock, this will have no effect.

        2. -minbigram(0.7)- allows for pretty loose matching. Since you seem to be over-matching, I would raise this to something higher. Try .85 or even .9 first.

        3. I doubt that requiring -exactstr(lname)- makes sense. Spelling variants in surnames are quite common. I would venture a guess that easily 25% of databases that purport to contain information about me, perhaps even more, have some incorrect spelling of my last name. To treat those variations as if they were the same as completely unrelated names seems unlikely to help you find good matches.

        Others may have additional suggestions for you.

        Ultimately, if you are not successful with -reclink-, you could try Julio Raffo's newer program -matchit- (available from SSC) instead. It offers a wider variety of string distance metrics, I believe it generally runs faster, and has a number of other useful features.

        Comment


        • #5
          Thank you for the response. I have modified the reclink command according to your suggestion and am currently waiting for results to come out.

          In the mean time I am investigating -matchit-. It seems that this command only works when matching on a single variable. Is this correct?

          Comment


          • #6
            Yes, but you can apply it repeatedly to score the matches on each variable and then select best matches based on those scores.

            Comment


            • #7
              This is not specific to reclink, but I have found that it helps to build programs in stages. I would suggest that you start by matching one file (e.g.. msenators1930.dta) at a time. If each of these files is large, take a random sample of the observations and test out your code on that. Once you get this smaller example working you can build back up to the larger problem.

              Best,
              Devra
              Devra Golbe
              Professor Emerita, Dept. of Economics
              Hunter College, CUNY

              Comment


              • #8
                As Clyde says, you can always run -matchit- in two (or more) iterations. Given your case, one alternative can be to concatenate last and given names in the two datasets and apply -matchit- in the file syntax with txtid and txtu set to this new variable. In the second step (after merging the results to the two original datasets), you can apply -matchit- using the columns syntax to as many pairs of variables you want to, like the parents surnames.

                If you are concerned with the number of false positives, remember to increase the threshold which is by default .5. Also, using weights (usually) improves the quality of results.

                Best,

                J.

                Comment


                • #9
                  Hi everyone,

                  Thanks for the thoughtful responses. I will try -matchit- but I wanted to give -reclink- a last shot. Unfortunately, even with tightening the -minbigram()- option, removing -orblock()-, etc... the results still produce false positives. They in fact produce the exact same matches each time, regardless of these changes. This sounds quite strange to me as I fail to see how changing options so drastically can produce identical results.

                  Comment


                  • #10
                    Hi again,

                    First, let me amend a mistake in my previous entry. Where it says "txtid and txtu" it should say "txtu".

                    Second, I have only used -reclink- to compare it with -matchit-, so I can provide little help on this. However, there are other solutions out there beyond -reclink- and -matchit-. Foir instance, wbuchanan has developed a Stata wrapper for phonetic string comparison (and encoding). See more here: http://www.statalist.org/forums/foru...larity-metrics

                    Comment


                    • #11
                      Thank you for the suggestion Julio.

                      I have had mild success with -matchit- by only using the concatenated names! This is quite nice as I am achieving about a 40% match rate while I was at 0% earlier. I am working on tweaking the code to improve this.

                      I am however struggling to understand what you mean by the second step in your original post.

                      Let us say I have the merge data. I merge it back to the original census dataset using id. Then I do -matchit- on the columns for mother and father? What does that exactly do? Would this be a way to eliminate false positives or mismatches?

                      Thank you so very much to everyone. I really appreciate you taking the time to help me with this.

                      Comment


                      • #12
                        There are many alternative ways you can go for the second step. As a result of using -matchit- with the file syntax (first step), you will obtain a list of potential candidates containing the id and name for both census and senators dataset. Using the id of each dataset, you could bring any relevant information that can be used to improve your results. Usual suspects are birth dates (or age), addresses, etc. I think you mentioned that you have parents surnames, which you could use to disentangle cases where there is some ambiguity. Let's say you have many John Smith in your potential candidates list, but if they differ on their mother's maiden names then these are likely false positives. This applies also to having different ages and, in some extent, having different addresses (although there might be mobility as well).

                        The comparisons in the second step can be done with simple stata manipulation or using -matchit- in the columns syntax. In the latter you can get a similarity score between two variables in your dataset (e.g. the census and senators mothers' surnames).

                        Comment

                        Working...
                        X