Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Require() and Reclink

    Hello All,

    I have, what I assume to be, a simple problem - but I cannot seem to figure out the problem or find helpful answers in the forum.

    I am trying to match a cohort of individuals who were arrested to a file of individuals who were prosecuted. There is a standard number that should identify case (status_no), but it can often be duplicated. The best way to match is on status_no & last name, but as these are two different data sets, the spellings of last name often differ, so a simple merge is not perfect.

    I am trying to use the required() function with reclink in order to match on status_no and have a fuzzy match on last name.

    reclink lastname status_no using "prosecution.dta", idmaster(id_arrest) idusing(id_case) required(status_no) _merge(testmergge) gen(testoutput) minscore(.85)


    240578 perfect matches found

    Added: id= identifier from prosecution.dta testoutput = matching score
    Observations: Master N = 241113 prosectuion.dta N= 243817
    Unique Master Cases: matched = 241113 (exact = 240578), unmatched = 0)

    However, when I run the check, I compare status_no in the master to Ustatus_no and they are completely different. I am clearly misunderstanding something in the relation to the required function, any help would be much appreciated!

    And NB: I did try using status_no both as a string and as a numeric variable.

    Thank you!

  • #2
    The syntax looks right, and it is perplexing that you are finding different values for status_no. I have only been able to come up with one theory, and it is triggered by your statement that you tried status_no both as a string and as a numeric variable. I'm wondering somebody (you or the people who gave you these data sets) have -encoded- status_no. If that is the case, what look like values of 100476, 30975, whatever might actually, inside Stata, be 1, 2, ... etc., but with value labels attached that make them look like the numbers you see. And Stata would be matching on the 1, 2,..., not on the values that you see with your eye when you -list- or -browse- or -display- values of status_no. The value labels could be quite different in the two data sets, so that the actual number 1 might correspond to 198732 in one data set and to 415996 in the other. (Evidently I'm just making up numbers here to illustrate the point.)

    The best way to tell if something like this is going on is to -describe status_no- in both data sets and see if they have value labels. If they do, use -label list- to see what those value labels are, and if they are different (as they most likely will be), then that is almost certainly the cause of your difficulty. The solution is then to -decode- the variable in both sets, making string variables out of them, and then match on the string versions.

    If that isn't it, then I think in order to get helpful advice you will need to run -dataex- on both of these data sets and post that output here so that people can see and experiment with actual data. (If you are not familiar with -dataex-, please read Forum FAQ #12.)

    Comment


    • #3
      Thank you Clyde. I was able to find a fix - when I merged the larger file onto my small cohort, the error I mentioned was generated. But when I reversed it (started with large data set & used reclink to match to the smaller cohort), it was a perfect match as one would expect.

      I am currently away from my office computer for a few days; but I will check the describe function & see if there are value labels. And I can also provide the dataex output, too.

      But for now, problem solved...albeit in a still baffling way!

      Comment


      • #4
        Dear Julia, did you ever solve this problem? I just faced it and fixed the problem by matching the storage type of the variables in req() using recast.
        Best,
        Daniel
        Last edited by Daniel Colombo; 20 May 2022, 07:35.

        Comment

        Working...
        X