Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining two large datasets using Reclink Command

    Hi,

    I have two large datasets of diabetes patients receiving care, each with 600,000(master data) and 700,000 (using data) observations to merge. Command merge can't be used here because I do not have a unique ID in using data to uniquely identify the observations in master data. Thus, I have few variables to match that includes DOB, age, sex, ethnicity, facility and date of diagnosis. The reason to merge here is to include smoking variable from the using dataset into the master dataset. Master data is a longitudinal data while using data is a registry data where the values of the variables are entered into the system only once.

    I have tried using reclink command, but it only came out with 12,000 perfect matches and going through 647615 observation to assess fuzzy matches, each .=5% complete that ran whole night without a single dot. Can this command handle these large dataset and need to wait longer or I'm actually doing it wrong?

    here is my command:
    reclink demo_facility demo_dob demo_age demo_sex demo_ethnic diab_diagdate using ndr_general_1_1_smoking_20170313, idm(idmaster) idu(idusing) wmatch(2 5 5 2 8 4) _merge(_merge) orblock(demo_facility demo_dob demo_age demo_sex demo_ethnic diab_diagdate) gen(myscore) minscore(0.8)

    I have also tried match it, and it says that I have specified too many variables. and wasn't quite understand on how to use that command.

    I have actually read through all posts on STATA reclink matchit that I could possible find but still it seems like I really have no clue on how to proceed. I would really appreciate your comments and plus I am a very beginner in STATA.

    Thanks

    Eliana

  • #2
    -reclink- will be very slow with large data sets like this. But I'm not sure it's really the right tool for this job. -reclink-'s main virtue is its ability to do fuzzy matching of things like names that might be misspelled, or addresses that might be written with different kinds of abbreviations and omissions, etc. The variables you mention, sex, ethnicity, facility, date of birth, and date of diagnosis sound like they would be exact matches. -age- would be entirely redundant in the presence of dates of birth, so probably can be ignored. So maybe what you need is a -joinby- of the data sets on these variables.

    For which variables are you anticipating inexact matches? And what kind of inexactness are you looking for? There may be a different approach that will work better.

    Comment


    • #3
      Hi Clyde,

      I'm using registry data. The master dataset is a longitudinal data, with some of the observations have follow ups values while some might not and it is the subset of the using data. I'm trying to combine these two dataset because in using data there is smoking variable that needs to be merge with master data. However I do not have ID variable in the using data in which if I did, command merge would be simple. Because using these variables as mentioned have resulted me the answer of these variables do not uniquely identified the bservations. Therefore, I need to combine these two datasets and the match variables would be other than ID which makes it quite difficult (repeat measurements in master data).

      hope this clear things up. Appreciate further comments and thanks in advance.

      Eliana

      Comment


      • #4
        Well, it still sounds like there is nothing in these data sets that would be suitable for a fuzzy match. Your problem is that there simply isn't enough data to identify distinct individuals. All you have to go on is sex, ethnicity, facility, date of birth, and date of diagnosis. That won't get you down to individuals, although it probably gets you to small groups of individuals. You can match those to your master data using the -joinby- command: -merge- is not the correct tool here. That will create a data set where all pairs of observations in the two data sets that agree on sex, ethnicity, facility, date of birth, and date of diagnosis will be paired with each other. If you are very lucky, you might be able to review those combinations of sex, ethnicity, facility, date of birth, and date of diagnosis that don't identify a single person and then scrutinize them and in some way pick the "right" one. But that depends on it working out so that there aren't too many such groups, and those groups are not too large, and those groups contain other information that informally help you pick out the correct match even though you can't specify an algorithm for doing so.

        But I still see no role for -reclink- or -matchit- here. Those are for dealing with fuzzy matching, i.e. things like names with variable spellings. And you don't seem to have anything like that.

        To be honest, I am pessimistic that this can be done at all.

        Comment


        • #5
          Hi Clyde,

          Yes, it seems like -reclink- and -matchit- are not the solutions here and so does merge. I will try on -joinby- and see how it goes from there. Thanks a lot for the comments.

          Eliana

          Comment


          • #6
            Hi Clyde,

            I tried -joinby- and seems it worked in combining both the datasets.

            " If you are very lucky, you might be able to review those combinations of sex, ethnicity, facility, date of birth, and date of diagnosis that don't identify a single person and then scrutinize them and in some way pick the "right" one. But that depends on it working out so that there aren't too many such groups, and those groups are not too large, and those groups contain other information that informally help you pick out the correct match even though you can't specify an algorithm for doing so."

            With regards to this, I suppose there is no way for me to check if I have joined in the correct individuals right? Since I do not have IDs in one of the dataset which the only variable that can uniquely identify each observation.

            Anyway, thanks a lot Clyde for the help.

            Eliana

            Comment


            • #7
              I suppose there is no way for me to check if I have joined in the correct individuals right?
              I don't see any way to do this unless there are other variables in the two data sets that could identify (or at least narrow down the possible identities) the individuals.

              Comment


              • #8
                Yes. Indeed. I will look into that. Thanks again Clyde.

                Eliana

                Comment


                • #9
                  I think I'm joining the discussion too late. But I just want to share that there is a way to use -matchit- as a kind of fuzzy joinby on several variables at the same time. The easiest way would be to create a string variable concatenating all the desired key variables. Just be sure to concatenate them with spaces as the separator and that the variables do not include spaces (if you have numeric or date variables, convert them to text using the same pattern in both datasets). Then use -matchit- with the sim(token) option. You will get all possible candidates (based on the selected threshold) and a score which will reflect the amount of variables (tokens) matching.

                  Another option is to create your own similarity function in MATA and define it as simf_yourfunction. -matchit- should pick it up when you set the option sim(yourfunction)

                  Comment


                  • #10
                    Hi Julio. Thanks for your feedback. I have solved the problem, but your feedback is very much appreciated.

                    Eliana

                    Comment


                    • #11
                      Hi everyone,

                      I have an issue related to the question mentioned in this forum. I have various datasets that I want to merge using the individual's names and other variables (sex, birthdate, etc..) using reclink command and its option require. These data came from different sources and there is'nt an exact id to complete my task. The problem is that these datasets are very large (2 million from one side and 4 million on the other). I have access to a server but still reclink runs too slowly. I would thank any suggestion?
                      Last edited by Sebastian Montano Correa; 10 Jul 2017, 08:28.

                      Comment


                      • #12
                        -matchit- (available from SSC, by Julio Raffo) may be faster than -reclink-. But when you are trying to compare 2 million observations in one data set with each of 4 million observations in another, that is a huge task and is going to take a long time no matter how you approach it.

                        One thing that will help is if you first break each of the data sets down into subsets defined by sex, birthdate, and whatever else is part of the "etc." that you want to match on. If you then fuzzy-match the name variables separately in each subset, the task will become much quicker. Let's say, for a simple example of the principle, that you want to match on sex and name. And suppose that each of your data sets is 50% male and 50% female. If you just match on name first, you have to compare each of 2 million observations with each of 4 million, observations, a total of 8*1012. comparisons. But if you first break each data set up into male and female subsets and do name-match only of males with males and females with females, the number of comparisons needed is 2*(1 million * 2 million) = 4*1012, which is only half as many. With additional variables besides sex to partition the data set, you can greatly reduce the total number of comparisons needed. (This is what -reclink-'s require option does for you automatically; with -matchit- you have to do this yourself.)

                        Comment


                        • #13
                          Dear Statalisters, I ran my command within a Linux server using the recommendations you gave me. Thank you very much for your help. Some initial datasets were merged. However, a new problem came up after two weeks of running (at the end of the rutine!!):

                          Going through 3330191 observation to assess fuzzy matches, each .=5% complete
                          ....................
                          file /tmp/St44764.000007 not found
                          unable to restore data due to insufficient memory; try discard first
                          r(601);

                          I asked the server“s personel and they said that eventually the server deletes temp files. Could you help me to figure out what to do? May be changing the temp folder path within a Stata session?

                          Comment

                          Working...
                          X