Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Different values of Spearman rank coefficient with same data

    Hi Stata users, I need your help with something that is driving me crazy!

    I'm calculating the Spearman rank coefficient of a variable (occupation ranked by job satisfaction) in different years. In order to do this, I am creating several datasets (one for each waves that I have) in the following way:

    cd "C:\Users\Randone\Desktop\PHD\Upgrading\analisi\ES S\analisi"
    use "Ess_merged.dta", clear
    preserve
    drop if essround != 3
    bys isco88h9_ct : egen perc_sat =mean((job_sat>7)& !mi(job_sat)) /// isco88h9_ct = occupations
    collapse perc_sat, by( isco88h9_ct)
    egen r_d6 = rank (perc_sat), unique
    sort r_d6
    save d6, replace
    restore

    After this, I am calculating the Spearman for each couple of datasets:

    cd "C:\Users\Randone\Desktop\PHD\Upgrading\analisi\ES S\analisi"
    use "d6.dta", clear
    merge 1:1 isco88h9_ct using d10
    spearman r_d6 r_d10

    Everything seems working, but what is really strange to me is that if I run the same command more than once, the Spearman coefficient is not exactly the same, but varies a bit (for instance, the first time is 0.1113 and the second 0.1182). Do you have any clues on why? I use exactly the same procedure in creating the yearly datasets, so why does Stata give me different values? The problem lies in the creation of datasets, because if I do not run again the first block of code above the Spearman is always the same.

    Could you help me please? Thanks a lot, G.P.

  • #2
    I think your problem comes from
    Code:
    egen r_d6 = rank (perc_sat), unique
    This command, according to -help egen- breaks ties arbitrarily. In particular, it does so irreproducibly. So your r_d6 comes out with different orderings of the calculated rank among the tied values of perc_sat when you re-run the code.

    Comment


    • #3
      Dear Clyde, thank you very much for your answer.
      So, if I correctly understood,, the problem is that when I have two categories of my variable with the same value of perc_sat the command egen put them, time after time, in a different order in the rank, right? The fact that in the end the range of values is quite limitated (when running several times the two block of commands, the Spearman values are always the same 5-6 that comes alternatively) means that I should not have a lot of ties anyway, correct?

      Is there any chance to fix this problem, so to obtain one unical Spearman once and for all?

      Thanks a lot, G.

      Comment


      • #4
        So, if I correctly understood,, the problem is that when I have two categories of my variable with the same value of perc_sat the command egen put them, time after time, in a different order in the rank, right?
        Correct.

        The fact that in the end the range of values is quite limitated (when running several times the two block of commands, the Spearman values are always the same 5-6 that comes alternatively) means that I should not have a lot of ties anyway, correct?
        That depends on what you mean by "not a lot of ties." The limited variation you are observing suggests that there are few if any large blocks of ties: that is, most of your tied observations are probably only tied with one or two other observations. But you could have a data set with thousands of observations, each of which is tied with one, but only one, of the others. That would be a data set with many ties, but the correlation calculated your way would show only a little bit of variation.


        Is there any chance to fix this problem, so to obtain one unical Spearman once and for all?
        I would not use the rank in the first place. After all, the -spearman- command calculates the ranks internally, and it handles ties by leaving them tied. So I would just do -spearman perc_sat d_10-. (Of course, I don't know how d_10 was created and whether it has the same problem as d_6. If it, too, was calculated with -egen rank(), unique-, then you will still have irreproducibility difficulties. So if that is the case, instead of using d_10 in the -spearman- command I would use whatever variable d_10 is the rank of.)

        Comment


        • #5
          Dear Clyde,
          yes, the variable d_10 is created exactly as the d_06 (only, of course, using the observation referring to wave 2010 instead of 2006).

          If instead of d_10 I use in the -spearman- command the variable which d_10 is the rank of, namely isco88h9_ct, would'nt I obtain meaningless values of Spearman, beacause it compares a variable ranked by job satisfaction (d_06) with a variable not ranked at all ( isco88h9_ct)?

          Thanks a lot, G.

          Comment


          • #6
            If instead of d_10 I use in the -spearman- command the variable which d_10 is the rank of, namely isco88h9_ct, would'nt I obtain meaningless values of Spearman, beacause it compares a variable ranked by job satisfaction (d_06) with a variable not ranked at all ( isco88h9_ct)?
            I don't follow you. You don't show the code calculating d_10, so I can't really comment on it, but if it is analogous to what you did for d_6, it is the rank of perc_sat, not the rank of isco88h9_ct, that is being calculated.

            But look, whether you do

            Code:
            egen r1 = rank(v1)
            egen r2 = rank(v2)
            spearman r1 r2
            OR

            Code:
            spearman v1 v2
            you will get the same result.* The two will be equally meaningful or equally meaningless, depending on what the variables mean in your context.

            *In fact, if you read down the code of spearman.ado, you will see that what -spearman v1 v2- does is calculate the ranks of v1 and v2, and then calculate their Pearson correlation coefficient. If you use r1 and r2, those variables are already equal to their own ranks, so the pre-calculation of r1 and r2 changes nothing. The reason you ran into problems is you chose to calculate the ranks in an alternative way that breaks ties arbitrarily.

            Comment


            • #7
              Dear Clyde,
              I tried to follow what you said, and I modified the code as following:

              cd "C:\Users\Randone\Desktop\PHD\Upgrading\analisi\ES S\analisi"
              use "Ess_merged.dta", clear
              preserve
              drop if country_n != 1
              drop if essround != 3
              bys isco88h9_ct : egen perc_sat_d6 =mean((job_sat>7)& !mi(job_sat))
              collapse perc_sat_d6, by( isco88h9_ct)
              sort perc_sat_d6
              save d6, replace
              restore

              then:

              cd "C:\Users\Randone\Desktop\PHD\Upgrading\analisi\ES S\analisi"
              use "d6.dta", clear
              merge 1:1 isco88h9_ct using d10
              spearman perc_sat_d6 perc_sat_d10

              Is this which you suggested? I tried several time the Spearman and it remains the same, so thank you very much for your suggestion.

              I wish you all the best, G.

              Comment


              • #8
                Yes, that is what I meant. Glad it's working for you. All the best to you, too.

                Comment


                • #9
                  The unique option of egen's rank function was added for graphical purposes.

                  See http://www.stata.com/manuals14/degen.pdf and (more emphatically) the original discussion from 1999 http://www.stata.com/products/stb/journals/stb51.pdf p.6

                  I can't imagine that it's a good idea for any statistical purpose, as the result is arbitrary and not reproducible.

                  Comment

                  Working...
                  X