Different values of Spearman rank coefficient with same data

Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#1

Different values of Spearman rank coefficient with same data

29 Jan 2017, 10:43

Hi Stata users, I need your help with something that is driving me crazy!

I'm calculating the Spearman rank coefficient of a variable (occupation ranked by job satisfaction) in different years. In order to do this, I am creating several datasets (one for each waves that I have) in the following way:

cd "C:\Users\Randone\Desktop\PHD\Upgrading\analisi\ES S\analisi"
use "Ess_merged.dta", clear
preserve
drop if essround != 3
bys isco88h9_ct : egen perc_sat =mean((job_sat>7)& !mi(job_sat)) /// isco88h9_ct = occupations
collapse perc_sat, by( isco88h9_ct)
egen r_d6 = rank (perc_sat), unique
sort r_d6
save d6, replace
restore

After this, I am calculating the Spearman for each couple of datasets:

cd "C:\Users\Randone\Desktop\PHD\Upgrading\analisi\ES S\analisi"
use "d6.dta", clear
merge 1:1 isco88h9_ct using d10
spearman r_d6 r_d10

Everything seems working, but what is really strange to me is that if I run the same command more than once, the Spearman coefficient is not exactly the same, but varies a bit (for instance, the first time is 0.1113 and the second 0.1182). Do you have any clues on why? I use exactly the same procedure in creating the yearly datasets, so why does Stata give me different values? The problem lies in the creation of datasets, because if I do not run again the first block of code above the Spearman is always the same.

Could you help me please? Thanks a lot, G.P.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

29 Jan 2017, 10:48

I think your problem comes from

Code:

egen r_d6 = rank (perc_sat), unique

This command, according to -help egen- breaks ties arbitrarily. In particular, it does so irreproducibly. So your r_d6 comes out with different orderings of the calculated rank among the tied values of perc_sat when you re-run the code.
Comment
Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#3

29 Jan 2017, 11:06

Dear Clyde, thank you very much for your answer.
So, if I correctly understood,, the problem is that when I have two categories of my variable with the same value of perc_sat the command egen put them, time after time, in a different order in the rank, right? The fact that in the end the range of values is quite limitated (when running several times the two block of commands, the Spearman values are always the same 5-6 that comes alternatively) means that I should not have a lot of ties anyway, correct?

Is there any chance to fix this problem, so to obtain one unical Spearman once and for all?

Thanks a lot, G.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#4

29 Jan 2017, 11:20

So, if I correctly understood,, the problem is that when I have two categories of my variable with the same value of perc_sat the command egen put them, time after time, in a different order in the rank, right?

Correct.

The fact that in the end the range of values is quite limitated (when running several times the two block of commands, the Spearman values are always the same 5-6 that comes alternatively) means that I should not have a lot of ties anyway, correct?

That depends on what you mean by "not a lot of ties." The limited variation you are observing suggests that there are few if any large blocks of ties: that is, most of your tied observations are probably only tied with one or two other observations. But you could have a data set with thousands of observations, each of which is tied with one, but only one, of the others. That would be a data set with many ties, but the correlation calculated your way would show only a little bit of variation.

Is there any chance to fix this problem, so to obtain one unical Spearman once and for all?

I would not use the rank in the first place. After all, the -spearman- command calculates the ranks internally, and it handles ties by leaving them tied. So I would just do -spearman perc_sat d_10-. (Of course, I don't know how d_10 was created and whether it has the same problem as d_6. If it, too, was calculated with -egen rank(), unique-, then you will still have irreproducibility difficulties. So if that is the case, instead of using d_10 in the -spearman- command I would use whatever variable d_10 is the rank of.)
Comment
Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#5

29 Jan 2017, 11:32

Dear Clyde,
yes, the variable d_10 is created exactly as the d_06 (only, of course, using the observation referring to wave 2010 instead of 2006).

If instead of d_10 I use in the -spearman- command the variable which d_10 is the rank of, namely isco88h9_ct, would'nt I obtain meaningless values of Spearman, beacause it compares a variable ranked by job satisfaction (d_06) with a variable not ranked at all ( isco88h9_ct)?

Thanks a lot, G.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#6

29 Jan 2017, 11:51

If instead of d_10 I use in the -spearman- command the variable which d_10 is the rank of, namely isco88h9_ct, would'nt I obtain meaningless values of Spearman, beacause it compares a variable ranked by job satisfaction (d_06) with a variable not ranked at all ( isco88h9_ct)?

I don't follow you. You don't show the code calculating d_10, so I can't really comment on it, but if it is analogous to what you did for d_6, it is the rank of perc_sat, not the rank of isco88h9_ct, that is being calculated.

But look, whether you do

Code:

egen r1 = rank(v1) egen r2 = rank(v2) spearman r1 r2

OR

Code:

spearman v1 v2

you will get the same result.* The two will be equally meaningful or equally meaningless, depending on what the variables mean in your context.

*In fact, if you read down the code of spearman.ado, you will see that what -spearman v1 v2- does is calculate the ranks of v1 and v2, and then calculate their Pearson correlation coefficient. If you use r1 and r2, those variables are already equal to their own ranks, so the pre-calculation of r1 and r2 changes nothing. The reason you ran into problems is you chose to calculate the ranks in an alternative way that breaks ties arbitrarily.
Comment
Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#7

29 Jan 2017, 12:06

Dear Clyde,
I tried to follow what you said, and I modified the code as following:

cd "C:\Users\Randone\Desktop\PHD\Upgrading\analisi\ES S\analisi"
use "Ess_merged.dta", clear
preserve
drop if country_n != 1
drop if essround != 3
bys isco88h9_ct : egen perc_sat_d6 =mean((job_sat>7)& !mi(job_sat))
collapse perc_sat_d6, by( isco88h9_ct)
sort perc_sat_d6
save d6, replace
restore

then:

cd "C:\Users\Randone\Desktop\PHD\Upgrading\analisi\ES S\analisi"
use "d6.dta", clear
merge 1:1 isco88h9_ct using d10
spearman perc_sat_d6 perc_sat_d10

Is this which you suggested? I tried several time the Spearman and it remains the same, so thank you very much for your suggestion.

I wish you all the best, G.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#8

29 Jan 2017, 13:12

Yes, that is what I meant. Glad it's working for you. All the best to you, too.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35755
#9

29 Jan 2017, 17:09

The unique option of egen's rank function was added for graphical purposes.

See http://www.stata.com/manuals14/degen.pdf and (more emphatically) the original discussion from 1999 http://www.stata.com/products/stb/journals/stb51.pdf p.6

I can't imagine that it's a good idea for any statistical purpose, as the result is arbitrary and not reproducible.
Comment

Announcement

Different values of Spearman rank coefficient with same data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment