Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • checking string similarity within the same variable

    Dear all,

    This message was initially posted in the discussion thread
    HTML Code:
    https://www.statalist.org/forums/forum/general-stata-discussion/general/1307980-matchit-command-to-match-two-datasets-based-on-similar-text-pattern,
    , but was advised to post as a new post, with a title better matching my question, so here we go!

    In most of the string similarity discussions on Statalist, users are trying to find similarities between variables. I however, would like to get a similarity score for observations within the same string variable. My data set contains more than 10000 person records and most likely there will be hundreds of people that occur in the data set multiple times, but with slightly different spelled names.

    Do you have any experience with checking for string similarity within the same variable and may I ask what package you decided using in the end?

    Thank you for sharing your experience!

    Best wishes,

    Moniek

  • #2
    Code:
    sysuse auto
    keep make
    preserve
    rename make make2
    tempfile make2
    save `make2'
    restore
    cross using `make2'
    drop if make2>=make
    matchit make make2
    list if similscore>0.5, clean
    Result:

    Code:
    . list if similscore>0.5, clean
    
            make                make2               similsc~e  
     106.   Buick Regal         Buick Opel          .52704628  
     116.   Buick Riviera       Buick Regal         .54772256  
     165.   Cad. Seville        Cad. Deville        .81818182  
     197.   Chev. Impala        Chev. Chevette      .55337157  
     213.   Chev. Malibu        Chev. Chevette      .55337157  
     214.   Chev. Malibu        Chev. Impala        .54545455  
     230.   Chev. Monte Carlo   Chev. Chevette      .57353933  
     248.   Chev. Monza         Chev. Chevette        .580381  
     250.   Chev. Monza         Chev. Malibu        .57207755  
     251.   Chev. Monza         Chev. Monte Carlo   .63245553  
     267.   Chev. Nova          Chev. Chevette      .61177529  
     268.   Chev. Nova          Chev. Impala        .50251891  
     269.   Chev. Nova          Chev. Malibu        .50251891  
     271.   Chev. Nova          Chev. Monza         .52704628  
     344.   Dodge Magnum        Dodge Colt          .50251891  
     433.   Ford Mustang        Ford Fiesta         .57207755  
     652.   Merc. Marquis       Merc. Cougar        .52223297  
     692.   Merc. Monarch       Merc. Cougar        .56407607  
     693.   Merc. Monarch       Merc. Marquis        .6172134  
     732.   Merc. XR-7          Merc. Bobcat        .50251891  
     733.   Merc. XR-7          Merc. Cougar        .50251891  
     735.   Merc. XR-7          Merc. Monarch       .53452248  
     778.   Merc. Zephyr        Merc. XR-7          .50251891  
     913.   Olds Cutlass        Olds Cutl Supr      .66899361  
    1005.   Olds Omega          Olds 98             .54433105  
    1318.   Plym. Sapporo       Plym. Arrow         .54772256  
    1413.   Pont. Catalina      Linc. Continental   .55815631  
    1670.   Pont. Phoenix       Pont. Grand Prix    .52174919  
    1730.   Pont. Sunbird       Pont. Firebird      .67082039  
    1797.   Datsun 210          Datsun 200          .77777778  
    1819.   Datsun 510          Datsun 200          .66666667  
    1820.   Datsun 510          Datsun 210          .77777778  
    1842.   Datsun 810          Datsun 200          .66666667  
    1843.   Datsun 810          Datsun 210          .77777778  
    1844.   Datsun 810          Datsun 510          .77777778  
    2284.   Toyota Corolla      Toyota Celica       .56044854  
    2350.   Toyota Corona       Toyota Celica       .58333333  
    2351.   Toyota Corona       Toyota Corolla      .80064077

    Comment


    • #3
      Thank you very much Andrew for suggesting cross me.

      However, as my dataset contains over 10'000 patients, Stata could not create all unique pairs within a reasonable time frame (after 2 hours, still nothing).

      In my case I checked for duplicates per town in the end. This way I create multiple smaller data subsets, which makes the number of unique pairs that Stata needs to create a lot lower. With this workaorund Stata was able to run cross on all my data in the end. .

      I am wondering though for other people with large datasets, who are not also filtering for other variables such as town and cannot split their dataset into smaller subsets, is there a way to still use cross without it taking Stata multiple hours/days to run?

      Thank you for sharing your ideas!

      Best wishes,

      Moniek

      Comment


      • #4
        I guess that this has to do with memory available and processor speed. You can always break the task into several crosses. Replicating your task with 10000 observations, my computer refuses to provide memory if I do it all at once but breaking it up into tenths, it takes slightly under 4 minutes.

        Code:
        *CREATE DATA SET
        clear
        set obs 10000
        gen make= string(_n)
        tempfile data
        save `data'
        *TURN TIMER ON
        timer on 1
        *BREAK IT UP INTO TENTHS AND CROSS
        forval i=1/10{
        keep if inrange(_n,`i'000-999,`i'000)
        rename make make2
        tempfile ds`i'
        save `ds`i''
        use `data', clear
        }
        
        forval i=1/10{
        cross using `ds`i''
        tempfile cds`i'
        save `cds`i''
        use `data', clear
        }
        
        *APPEND DATA SETS
        use `cds1', clear
        forval i=2/10{
        append using `cds`i''
        }
        *TURN OFF TIMER
        timer off 1

        Result:

        Code:
        . timer list 1
           1:    234.20 /        1 =     234.2040

        Comment

        Working...
        X