Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Distributing reclink to multiple computers

    It just takes a lot of time to run reclink on a large dataset.

    But I have three quite good computers.

    I thought maybe I can do it quickly by dividing the data into three pieces and run each subdata on each computer.

    But I can't be sure if it will speed it up because I don't know how reclink actually works internally.

    Which dataset should I divide by three to speed this up? Master data or using data?

    If the number of company names is 100K, then does reclink take roughly half the time than when there are 200K company names?

  • #2
    Without looking deeply inside the user-written program -reclink-, I don't know, but my a priori guess would be that the time -reclink- requires is proportional to the product of the size of the master data and the using data, and that reducing either size should have comparable effects. Rather than just speculate, though, a standard and easy thing to do in a situation like yours is to try an experiment with a sample version of your files, and see what happens. Perhaps there is some reason why you can't try an experiment, but if you can, you might try using all of your master file and (say) 1% of the observations in your using file, then try with 2%, etc. In doing that, Stata's timing features can be convenient:
    Code:
    timer clear 1
    timer on 1
    reclink  ......
    timer off 1
    timer list 1
    Fancier variations of this are possible, but this is an easy place to start, and should require only 5 min. or so.

    Because I'd presume that you want all possible comparisons to be made, you'd want to use all of the observations in one of the files, and a part of the observations in the other file, however you eventually split apart the files.

    Comment


    • #3
      I can't follow what precise proposals are embedded in #1, although I strongly agree with Mike Lacy that such problems are intrinsically not linear in dataset size.

      But if either dataset is split, how do you know that the best match is not in some other part of the dataset not being examined in this run? How are you going to post-process results? If these are naive questions, do rebut or ignore them as seems appropriate.

      Comment


      • #4
        Nick's point about possibly missing the *best* match is of course correct, but there might be some ways to approach this situation, i.e., keep all the really high-scoring matches (which in my limited experience are generally the correct ones), and then go back and re-examine observations that only had low to medium scoring matches.

        Comment

        Working...
        X