Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Jaccard Similarity programming

    Hello Statalisters,

    I've be wrangling with this problem for a while and have not developed a feasible code. I trying to calculate a Jaccard similarity measure using the data below. What I have is a patent number and technological class categories in which the patent is assigned (a patent can have multiple classes). What I want to do is measure the similarity between sets of classes.

    A Jaccard similarity is the number of elements shared between set divided by number of elements in both sets (share and unshared).
    For my purposes, this would be the number of patents which cite the same classes divided by the number of patents that cite either class.

    Code:
    input patno str3 class
    1 aa
    1 bb
    2 cc
    2 dd
    3 aa
    3 bb
    4 aa
    4 cc
    end
    For example, for class aa and bb, the Jaccard similarity would be .67 since the numerator is 2 (both patents 1 and 2 cite both classes aa and bb) and the denominator is 3 (since patents 1, 2 and 4 cite either class aa, bb or both.

    Could anyone suggest code that would allow me to calculate Jaccard similarity measures for a larger dataset?

    Thanks in advance,
    Ed

  • #2
    This is a bit of a kludge, and my instinct is that there is a more elegant way to do this that I'm missing. But, for what it's worth:

    Code:
    tempfile original_data
    save `original_data'
    
    keep class
    duplicates drop
    tempfile copy
    rename class class2
    save `copy'
    rename class2 class1
    cross using `copy'
    keep if class1 < class2
    gen source = `"`original_data'"'
    
    capture program drop one_pair
    program define one_pair
        local class1 = class1[1]
        local class2 = class2[1]
        local original_data = source[1]
        use if inlist(class, `"`class1'"', `"`class2'"') using `original_data', clear
        sort patno class
        by patno: egen byte mention1 = max(class == `"`class1'"')
        by patno: egen byte mention2 = max(class == `"`class2'"')
        by patno: keep if _n == 1
        egen numerator = total(mention1 & mention2)
        gen jaccard = numerator/_N
        keep jaccard
        gen class1 = `"`class1'"'
        gen class2 = `"`class2'"'
        exit
    end
    
    runby one_pair, by(class1 class2) status
    -runby- is written by Robert Picard and me, and is available from SSC.

    Comment


    • #3
      Hi Clyde,

      thank you very much for this! It's just what I needed.

      Ed

      Comment

      Working...
      X