Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting occurence of pairwise combination

    Hello,

    I have a string variable that is a list of words.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input strL items
    "d;c;a;"
    "a;b;c;d;"                         
    "d;a;f;e;h;"                
    "n;e;o;c;d"                    
    end

    I would like to count how many time each possible pairwise combination of two words occurs in the whole database and then generate a variable that sums the counts for all the pairwise combinations in the observation.

    So for example, for the first observation I want to generate a variable that is the sum of the total occurrence of "d" and "c", which is 3, plus the total occurrence of "d" and "a", which is 3, plus the total occurrence of "c" and "a", which is 2.

    Thank you for any hint you can provide!



  • #2
    This will work if the data set isn't too large ( in terms of total number and length of tokens contained within all observations of items). If there are too many, it will exceed the maximum length of a macro, and, in that case, some other approach would be needed.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input strL items
    "d;c;a;"
    "a;b;c;d;"                        
    "d;a;f;e;h;"                
    "n;e;o;c;d"                    
    end
    
    levelsof items, local(words) clean
    local words: subinstr local words ";" " ", all
    local words: list uniq words
    local words: list sort words
    display `"`words'"'
    
    local word_count: word count `words'
    
    gen long co_occurrences = 0
    gen byte both = 0
    forvalues i = 1/`word_count' {
        forvalues j = `=`i'+1'/`word_count' {
            replace both = strpos(items, `"`:word `i' of `words''"') ///
                & strpos(items, `"`:word `j' of `words''"')
            summ both, meanonly
            replace co_occurrences = co_occurrences + r(sum) if both
        }
    }
    
    list, noobs clean
    Added: This code assumes that:
    1. The tokens occurring in each observation of items are distinct, i.e. there is no value of item like "a; b; b; c", OR
    2. If such tokens do occur, examples like that would count as only a single occurrence of b and c together.

    1. is true in your example data.
    Last edited by Clyde Schechter; 14 Oct 2017, 12:17.

    Comment


    • #3
      Here's another way to get there using joinby to form all pairwise combinations.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input strL items
      "d;c;a;"
      "a;b;c;d;"                        
      "d;a;f;e;h;"                
      "n;e;o;c;d"                    
      end
      gen long obs = _n
      tempfile master
      save "`master'"
      
      * separate words and reshape to a long layout
      split items, gen(w) parse(;)
      reshape long w, i(obs) j(n)
      keep obs w
      drop if mi(w)
      tempfile hold
      save "`hold'"
      
      * form all pairwise combinations between obs, drop duplicates and self pair
      rename w w0
      joinby obs using "`hold'"
      keep if w0 < w
      
      * the count of pairs across all observations
      bysort w0 w: gen N = _N
      collapse (sum) N, by(obs)
      
      * recombine with original observations
      merge 1:1 obs using "`master'"
      sort obs
      list
      and the results:
      Code:
      . list
      
           +-------------------------------------+
           | obs    N        items        _merge |
           |-------------------------------------|
        1. |   1    8       d;c;a;   matched (3) |
        2. |   2   11     a;b;c;d;   matched (3) |
        3. |   3   13   d;a;f;e;h;   matched (3) |
        4. |   4   13    n;e;o;c;d   matched (3) |
           +-------------------------------------+

      Comment


      • #4
        Thank you Robert! It takes ages but it does the job!

        Comment

        Working...
        X