Measuring similarity and dissimilarity indices between each observation and the reference row

Monica Muller

Join Date: Jul 2014

Posts: 226
#1

Measuring similarity and dissimilarity indices between each observation and the reference row

10 Apr 2017, 17:38

Hi,

I have a dataset with 16000 observations and 68 variables. I want to measure similarity (correlation) and dissimilarity (L2squared distance) of each row of the data compared with the last row (userid=98) which is my reference row. My goal is to have two new variables, (similarity and dissimilarity ) which show the similarity and dissimilarity of each user's profile compared to the reference row. I found this link, but I have never worked with Matrices in Stata and don't know how I should approach this problem. I really appreciate your help.

Here is a sample of my data:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long userid float(v1 v2 v3 v4) 24 3.12 2.5 2.88 3.12 40 3.4424145 2.8373425 3.281276 3.370227 51 4.12 3.12 3.88 4 67 4.12 3.12 4.88 3.88 76 3.685956 3.127154 3.620283 3.439283 84 3.352679 3.2907455 3.2907455 3.2711875 95 3.7548585 3.990283 3.757191 3.615336 97 2.88 2.88 3 3.12 98 3.235533 2.831092 3.1384676 2.9281576 end
Tags: None
Monica Muller

Join Date: Jul 2014

Posts: 226
#2

10 Apr 2017, 21:39

I figured the issue. My data is too big and my version of Stata only allows Matrix of dimension 800. Do you know how I can split my dataset into 20 datasets? The only way I can think of is keeping 800 observations, save part1, open the main data, merge with part1, drop if _merge==3, keep 800 observations, .... I should repeat this process 20 times. Is there a faster way to split the data set?

Thanks
Comment

Joseph Coveney

Join Date: Apr 2014
Posts: 4420

10 Apr 2017, 23:08

Couldn't you do something like the following?

Code:

quietly generate double L2squared = .
quietly generate double corr = .

tempname M
forvalues row = 1/`=_N-1' {

     matrix dissimilarity `M' = v1-v4 if inlist(_n, `row', `=_N'), observation L2squared
     quietly replace L2squared = `M'[2, 1] in `row'

     matrix dissimilarity `M' = v1-v4 if inlist(_n, `row', `=_N'), observation correlation
     quietly replace corr = `M'[2, 1] in `row'
}

Comment

Monica Muller

Join Date: Jul 2014

Posts: 226
#4

11 Apr 2017, 00:19

Thank Joseph, I don't quite understand your code, but here is how I did it for a part of my data and my result is not the same as the result I get from your code. Have I made a mistake?
I replaced the userid for the reference row with -1, so it goes to the first row.

Code:

set matsize 800 foreach i of numlist 1/4 { use profile`i', replace drop int* val* sty* sort userid matrix dissimilarity Dis = v1 v2 v3 v4 svmat Dis matrix dissimilarity Corr1 = v1 v2 v3 v4 , corr svmat Corr1
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35721

11 Apr 2017, 01:08

Joseph's (neatly and nicely done) code loops over all but one row, comparing each of those rows with that other row, the reference row.

To see what the inlist() call does, compare some different examples anyone can run.

Code:

. sysuse auto, clear
(1978 Automobile Data)

. list mpg weight if inlist(_n, 1, _N)

     +--------------+
     | mpg   weight |
     |--------------|
  1. |  22    2,930 |
 74. |  17    3,170 |
     +--------------+

. list mpg weight if inlist(_n, 2, _N)

     +--------------+
     | mpg   weight |
     |--------------|
  2. |  17    3,350 |
 74. |  17    3,170 |
     +--------------+

. di _N
74

You could write

Code:

... if _n == `row' | _n == _N

to get the same effect.

Comment

Monica Muller

Join Date: Jul 2014

Posts: 226
#6

11 Apr 2017, 09:44

Thanks Nick. I got the inlist part. The part I didn't completely understand is

Code:

tempname M quietly replace corr = `M'[2, 1] in `row'

And the reason why I didn't get the same result from my code compared to Joseph's is that my code calculated L2 not L2squared. I changed both to L2 and I got the exact same numbers.

Thanks to both of you for the help. I really appreciate it.
Comment

Announcement

Measuring similarity and dissimilarity indices between each observation and the reference row

Comment

Comment

Comment

Comment

Comment