Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Measuring similarity and dissimilarity indices between each observation and the reference row

    Hi,

    I have a dataset with 16000 observations and 68 variables. I want to measure similarity (correlation) and dissimilarity (L2squared distance) of each row of the data compared with the last row (userid=98) which is my reference row. My goal is to have two new variables, (similarity and dissimilarity ) which show the similarity and dissimilarity of each user's profile compared to the reference row. I found this link, but I have never worked with Matrices in Stata and don't know how I should approach this problem. I really appreciate your help.

    Here is a sample of my data:


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long userid float(v1 v2 v3 v4)
    24      3.12       2.5      2.88      3.12
    40 3.4424145 2.8373425  3.281276  3.370227
    51      4.12      3.12      3.88         4
    67      4.12      3.12      4.88      3.88
    76  3.685956  3.127154  3.620283  3.439283
    84  3.352679 3.2907455 3.2907455 3.2711875
    95 3.7548585  3.990283  3.757191  3.615336
    97      2.88      2.88         3      3.12
    98  3.235533  2.831092 3.1384676 2.9281576
    end


  • #2
    I figured the issue. My data is too big and my version of Stata only allows Matrix of dimension 800. Do you know how I can split my dataset into 20 datasets? The only way I can think of is keeping 800 observations, save part1, open the main data, merge with part1, drop if _merge==3, keep 800 observations, .... I should repeat this process 20 times. Is there a faster way to split the data set?

    Thanks

    Comment


    • #3
      Couldn't you do something like the following?
      Code:
      quietly generate double L2squared = .
      quietly generate double corr = .
      
      tempname M
      forvalues row = 1/`=_N-1' {
      
           matrix dissimilarity `M' = v1-v4 if inlist(_n, `row', `=_N'), observation L2squared
           quietly replace L2squared = `M'[2, 1] in `row'
      
           matrix dissimilarity `M' = v1-v4 if inlist(_n, `row', `=_N'), observation correlation
           quietly replace corr = `M'[2, 1] in `row'
      }

      Comment


      • #4
        Thank Joseph, I don't quite understand your code, but here is how I did it for a part of my data and my result is not the same as the result I get from your code. Have I made a mistake?
        I replaced the userid for the reference row with -1, so it goes to the first row.
        Code:
        set matsize 800
        foreach i of numlist 1/4 {
        use profile`i', replace
        drop int* val* sty*
        sort userid
        matrix dissimilarity Dis = v1 v2 v3 v4
        svmat Dis
        matrix dissimilarity Corr1 = v1 v2 v3 v4 , corr
        svmat Corr1

        Comment


        • #5
          Joseph's (neatly and nicely done) code loops over all but one row, comparing each of those rows with that other row, the reference row.

          To see what the inlist() call does, compare some different examples anyone can run.

          Code:
          . sysuse auto, clear
          (1978 Automobile Data)
          
          . list mpg weight if inlist(_n, 1, _N)
          
               +--------------+
               | mpg   weight |
               |--------------|
            1. |  22    2,930 |
           74. |  17    3,170 |
               +--------------+
          
          . list mpg weight if inlist(_n, 2, _N)
          
               +--------------+
               | mpg   weight |
               |--------------|
            2. |  17    3,350 |
           74. |  17    3,170 |
               +--------------+
          
          . di _N
          74
          You could write

          Code:
          ... if _n == `row' | _n == _N
          to get the same effect.

          Comment


          • #6
            Thanks Nick. I got the inlist part. The part I didn't completely understand is

            Code:
             
             tempname M     
             quietly replace corr = `M'[2, 1] in `row'
            And the reason why I didn't get the same result from my code compared to Joseph's is that my code calculated L2 not L2squared. I changed both to L2 and I got the exact same numbers.

            Thanks to both of you for the help. I really appreciate it.

            Comment

            Working...
            X