Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sort by correlation coefficient

    I am trying to find out which variable produces the highest correlation coefficient with var1 within subjects.

    I have 10 variables (var1-var10) for 10 days (day) for 10 individuals (id) in long format. However, I recreated an example with 3 variables, for 5 days for 3 individuals in dataex:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(id day) float(var1 var2 var3)
    1 1   .7806329   .2003541   .9929246
    1 2   .8852594   .7479552   .3049926
    1 3    .573707  .22255337   .8362682
    1 4   .2431343   .9553812   .1578373
    1 5 .008077622   .8259419   .8112878
    2 1   .4933739   .1188735  .36237845
    2 2   .7138895  .55584866    .884728
    2 3   .4031537   .3037445 .008933238
    2 4   .1250264  .53452736   .7117536
    2 5   .7928984   .3798434   .6441231
    3 1  .25131217 .063004404   .9543635
    3 2   .6858981   .6998435   .8498245
    3 3   .9965313    .483217   .4657771
    3 4   .9135284   .8579231  .13088849
    3 5  .29056227  .25628337   .5243536
    end
    I want to find out, for each id, which of variables var2-var10 has the highest Pearson's correlation with var1.

    The best I can do is:

    Code:
    local varlist "var2 var3"
    
    foreach i in 1 2 3 {
        display _newline
        foreach a in `varlist' {
            qui pwcorr var1 `a' if id==`i'
            local `i'_`a'_r = r(rho)
            if r(rho) !=. & r(rho)!=0 { //in the actual dataset, var1 may not be available or may not vary. This is how I exclude them from the list. 
                display "`i'_`a'_r:" ``i'_`a'_r'
            }
        }    
    }
    which outputs:

    Code:
    1_var2_r:-.54861605
    1_var3_r:.04435157
    
    
    2_var2_r:-.05894353
    2_var3_r:.25521847
    
    
    3_var2_r:.79714705
    3_var3_r:-.61136201
    The next step is to sort these by absolute correlation coefficient within id so that I know which of var2-var3 correlates highest with var1 within each id. For this exercise, I do not care for the significance.

    Just to remind you that the example data is much smaller (3 x 5 x 3) than the actual data (10 x 10 x 10), so in the latter case it makes more sense to sort the correlation coefficients.

    Thanks so much.


  • #2
    Thanks for the data example. You are really limited with pwcorr. The following uses Ben Jann's estout command from SSC.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(id day) float(var1 var2 var3)
    1 1   .7806329   .2003541   .9929246
    1 2   .8852594   .7479552   .3049926
    1 3    .573707  .22255337   .8362682
    1 4   .2431343   .9553812   .1578373
    1 5 .008077622   .8259419   .8112878
    2 1   .4933739   .1188735  .36237845
    2 2   .7138895  .55584866    .884728
    2 3   .4031537   .3037445 .008933238
    2 4   .1250264  .53452736   .7117536
    2 5   .7928984   .3798434   .6441231
    3 1  .25131217 .063004404   .9543635
    3 2   .6858981   .6998435   .8498245
    3 3   .9965313    .483217   .4657771
    3 4   .9135284   .8579231  .13088849
    3 5  .29056227  .25628337   .5243536
    end
    
    *GENERATE CONSECUTIVE ID VAR
    egen id2= group(id)
    levelsof id2, local(ids)
    foreach id in `ids'{
    estpost corr var* if id==`id', matrix
    mat corr`id'= e(b)
    }
    mat corr= corr1\corr2\corr3
    svmat corr
    gen idcorr=cond(!missing(corr2), _n, .)
    So you want to be careful to pick the relevant variables. Below, corr2 and corr3. You link the correlation id and the real id via id and id2.


    Code:
    l id var* idcorr corr*
    
         +---------------------------------------------------------------------------------------------------------+
         | id       var1       var2       var3   idcorr   corr1       corr2      corr3   corr4       corr5   corr6 |
         |---------------------------------------------------------------------------------------------------------|
      1. |  1   .7806329   .2003541   .9929246        1       1   -.5486162   .0443516       1   -.7691134       1 |
      2. |  1   .8852594   .7479552   .3049926        2       1   -.0589435   .2552184       1    .7058921       1 |
      3. |  1    .573707   .2225534   .8362682        3       1     .797147   -.611362       1   -.6015242       1 |
      4. |  1   .2431343   .9553812   .1578373        .       .           .          .       .           .       . |
      5. |  1   .0080776   .8259419   .8112878        .       .           .          .       .           .       . |
         |---------------------------------------------------------------------------------------------------------|
      6. |  2   .4933739   .1188735   .3623784        .       .           .          .       .           .       . |
      7. |  2   .7138895   .5558487    .884728        .       .           .          .       .           .       . |
      8. |  2   .4031537   .3037445   .0089332        .       .           .          .       .           .       . |
      9. |  2   .1250264   .5345274   .7117536        .       .           .          .       .           .       . |
     10. |  2   .7928984   .3798434   .6441231        .       .           .          .       .           .       . |
         |---------------------------------------------------------------------------------------------------------|
     11. |  3   .2513122   .0630044   .9543635        .       .           .          .       .           .       . |
     12. |  3   .6858981   .6998435   .8498245        .       .           .          .       .           .       . |
     13. |  3   .9965313    .483217   .4657771        .       .           .          .       .           .       . |
     14. |  3   .9135284   .8579231   .1308885        .       .           .          .       .           .       . |
     15. |  3   .2905623   .2562834   .5243536        .       .           .          .       .           .       . |
         +---------------------------------------------------------------------------------------------------------+
    Last edited by Andrew Musau; 24 Sep 2019, 07:48.

    Comment


    • #3
      Thank you so much for your answer.

      It works on the data example. However, when I try it on the actually data, I get an error: estimates post: matrix has missing values.

      In my code, I had a similar problem, and I dealt with it by using:

      Code:
       if r(rho) !=. & r(rho)!=0 { //in the actual dataset, var1 may not be available or may not vary. This is how I exclude them from the list.              display "`i'_`a'_r:" ``i'_`a'_r'         }
      However, this obviously doesn't work with matrices. I was able to reproduce the error by removing between-day variation in var1 for id 1 in the example data:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte(id day) float(var1 var2 var3)
      1 1   .5   .2003541   .9929246
      1 2   .5   .7479552   .3049926
      1 3    .5  .22255337   .8362682
      1 4   .5   .9553812   .1578373
      1 5 .5   .8259419   .8112878
      2 1   .4933739   .1188735  .36237845
      2 2   .7138895  .55584866    .884728
      2 3   .4031537   .3037445 .008933238
      2 4   .1250264  .53452736   .7117536
      2 5   .7928984   .3798434   .6441231
      3 1  .25131217 .063004404   .9543635
      3 2   .6858981   .6998435   .8498245
      3 3   .9965313    .483217   .4657771
      3 4   .9135284   .8579231  .13088849
      3 5  .29056227  .25628337   .5243536
      end
      
      *GENERATE CONSECUTIVE ID VAR
      egen id2= group(id)
      levelsof id2, local(ids)
      foreach id in `ids'{
      estpost corr var* if id==`id', matrix
      mat corr`id'= e(b)
      }
      mat corr= corr1\corr2\corr3
      svmat corr
      gen idcorr=cond(!missing(corr2), _n, .)
      How can your solution ignore or handle missing values?

      Thanks!

      Comment


      • #4
        estpost corr has a -listwise- option that implements listwise deletion of missing values, so you can do something as follows:

        Code:
        preserve
        bys id (var1):gen tag= var1[1]==var1[_N]
        replace var1=. if tag
        <IN CODE: estpost corr var* if id==`id', matrix listwise>
        restore
        You can replace other constant variables to missing similar to the method above. By the way, there is a way to use -by- in the code if you can figure out how to identify the observations.

        Code:
        bysort id: eststo: estpost correlate var*, listwise
        esttab est*
        mat corr= r(coefs)
        svmat corr

        Comment


        • #5
          Thanks Andrew.

          So, if I understood you correctly, that would give us the following code:

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input byte(id day) float(var1 var2 var3)
          1 1   .5   .2003541   .9929246
          1 2   .5   .7479552   .3049926
          1 3    .5  .22255337   .8362682
          1 4   .5   .9553812   .1578373
          1 5 .5   .8259419   .8112878
          2 1   .4933739   .1188735  .36237845
          2 2   .7138895  .55584866    .884728
          2 3   .4031537   .3037445 .008933238
          2 4   .1250264  .53452736   .7117536
          2 5   .7928984   .3798434   .6441231
          3 1  .25131217 .063004404   .9543635
          3 2   .6858981   .6998435   .8498245
          3 3   .9965313    .483217   .4657771
          3 4   .9135284   .8579231  .13088849
          3 5  .29056227  .25628337   .5243536
          end
          
          *GENERATE CONSECUTIVE ID VAR
          bys id (var1):gen tag= var1[1]==var1[_N]
          replace var1=. if tag
          egen id2= group(id)
          levelsof id2, local(ids)
          foreach id in `ids'{
              estpost corr var* if id==`id', matrix listwise
              mat corr`id'= e(b)
          }
          mat corr= corr1\corr2\corr3
          svmat corr
          gen idcorr=cond(!missing(corr2), _n, .)
          However, this produces the 'correlation' error no observations for id 1.

          I did something nasty, which works to avoid the error above:

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input byte(id day) float(var1 var2 var3)
          1 1   .5   .2003541   .9929246
          1 2   .5   .7479552   .3049926
          1 3    .5  .22255337   .8362682
          1 4   .5   .9553812   .1578373
          1 5 .5   .8259419   .8112878
          2 1   .4933739   .1188735  .36237845
          2 2   .7138895  .55584866    .884728
          2 3   .4031537   .3037445 .008933238
          2 4   .1250264  .53452736   .7117536
          2 5   .7928984   .3798434   .6441231
          3 1  .25131217 .063004404   .9543635
          3 2   .6858981   .6998435   .8498245
          3 3   .9965313    .483217   .4657771
          3 4   .9135284   .8579231  .13088849
          3 5  .29056227  .25628337   .5243536
          end
          
          *GENERATE CONSECUTIVE ID VAR
          bys id (var1):gen tag= var1[1]==var1[_N]
          egen id2= group(id)
          levelsof id2, local(ids)
          foreach id in `ids'{
              if id==`id' & tag!=1 {
                  estpost corr var* if id==`id', matrix listwise
                  mat corr`id'= e(b)
              }
          }
          mat corr= corr1\corr2\corr3
          svmat corr
          gen idcorr=cond(!missing(corr2), _n, .)
          However, this produces a new error corr1 not found, obviously from the mat corr=corr1\corr\corr3 line.

          I can easily avoid this error by removing corr1 manually. Not very nice, and not practical or scalable either, but it is the quickest solution for now.

          Thanks so much for your help. This was a great exercise for me.

          Comment


          • #6
            Two things that you can do

            1. Create an id variable only for relevant groups
            2. Automate the definition of the matrix

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input byte(id day) float(var1 var2 var3)
            1 1   .5   .2003541   .9929246
            1 2   .5   .7479552   .3049926
            1 3    .5  .22255337   .8362682
            1 4   .5   .9553812   .1578373
            1 5 .5   .8259419   .8112878
            2 1   .4933739   .1188735  .36237845
            2 2   .7138895  .55584866    .884728
            2 3   .4031537   .3037445 .008933238
            2 4   .1250264  .53452736   .7117536
            2 5   .7928984   .3798434   .6441231
            3 1  .25131217 .063004404   .9543635
            3 2   .6858981   .6998435   .8498245
            3 3   .9965313    .483217   .4657771
            3 4   .9135284   .8579231  .13088849
            3 5  .29056227  .25628337   .5243536
            end
            
            *GENERATE CONSECUTIVE ID VAR
            bys id (var1):gen tag= var1[1]==var1[_N]
            egen id2= group(id) if !tag
            levelsof id2, local(ids)
            local n: word count `ids'
            foreach id in `ids'{
               estpost corr var* if id2==`id', matrix listwise
               mat corr`id'= e(b)
            }
            local corr= "corr1"
            forval i=2/`n'{
               local corr `corr'\corr`i'
            }
            
            mat corr= `corr'
            svmat corr
            gen idcorr=cond(!missing(corr2), _n, .)

            Comment

            Working...
            X