Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • What test should I use to see whether two variables are significantly different from each other?

    Hi,

    Apologies for the rather basic question.

    I have one sample of participants. I have four different measures to calculate the percentage risk of developing a disease. All 4 measures are designed to calculate the exact same thing, however they have produced different results for my sample. I want to run a test to see whether the results of each measure are the same or significantly different from each other. The 4 measures are continuous variables and are a percentage.

    I hope that makes sense.

  • #2
    This is a very interesting question. Normally, as you know, we have participants nested within two groups. We have the t-test or the t-test for proportions to see if the means (or proportions) of the groups differ. If we have more than two groups, we can use ANOVA to simultaneously test if any one group's mean differs from the rest.

    Here, I assume you actually have tests nested within participants. Despite that, my first inclination would still be (repeated measures) ANOVA. If this is the right approach, I think you would need to reshape your data such that each person has 4 observations, then run an ANOVA. Using some fake variable names:

    Code:
    preserve
    rename test_a prevalence1
    ...
    rename test_d prevalence4
    
    keep id prevalence?
    reshape long prevalence, i(id) j(testnum)
    anova prevalence id, repeated(testnum)
    restore
    -preserve- and -restore- preserve the original data and restore it, so don't worry that you're throwing away a bunch of variables. You need to rename each test to some stub variable name ending in a number for reshape to work properly. The new variable testnum denotes the number of the test (i.e. is it the first, second, third, or fourth test; reshape will strip out the number behind each variable preserve and go assign it to -testnum-. The last command runs the ANOVA. You effectively have 4 repeated measures on each person. That's the approach that I think I would run, but I'm by no means certain it's correct.

    If you have an error message, post it in the forum. I don't typically use ANOVA, so I may have botched the syntax! In fact, I have an idea of how I botched it, so note the fact that I removed the underscore (_) from the variables starting with preserve.
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #3
      Thanks! This is the error I get:

      Code:
      reshape long prevalence, i(Participant) j(testnum)
      (note: j = 1 2 3 4)
      variable id does not uniquely identify the observations
          Your data are currently wide.  You are performing a reshape long.  You specified i(Participant) and j(testnum).  In the current wide form, variable Participant should
          uniquely identify the observations.  Remember this picture:
      
               long                                wide
              +---------------+                   +------------------+
              | i   j   a   b |                   | i   a1 a2  b1 b2 |
              |---------------| <--- reshape ---> |------------------|
              | 1   1   1   2 |                   | 1   1   3   2  4 |
              | 1   2   3   4 |                   | 2   5   7   6  8 |
              | 2   1   5   6 |                   +------------------+
              | 2   2   7   8 |
              +---------------+
          Type reshape error for a list of the problem observations.
      r(9);
      
      . 
      . anova prevalence Participant, repeated(testnum)
      variable testnum not found
      (error in option repeated())

      Comment


      • #4
        I was operating under the assumption that you have a unique ID variable for each person. Do you actually have multiple observations for each person already?

        This is why, under the FAQ, we ask for examples of your data using -dataex- (some details in my signature). Can you post some example data?
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          I'd add that percentage risks are often better considered on a logit scale, partly because of how those quantities behave and partly because of their substantive interpretation, just as I don't care whether my chance of getting wet in the rain changes from 50% to 51% but I do care much more whether my chance of being hit by lightning changes from 1% to 2%. (Or even from 0.000001% to twice that.)

          Concordance correlation measures agreement and there is corresponding Stata stuff.

          Comment


          • #6
            Hi,

            Thanks for your help. I assume I am doing something wrong:

            Code:
            . ssc install dataex
            checking dataex consistency and verifying not already installed...
            all files already exist and are up to date.
            
            . -dataex-
            - is not a valid command name
            r(199);
            I also attempted this:

            Code:
             concord logitprev1 logitprev2 logitprev3 logitprev4
            too many variables specified

            Comment


            • #7
              dataex is the command; many people on this list surround a command name by hyphens in this list just to set it off - but you should not type the hyphens

              Comment


              • #8
                I did try that too:

                Code:
                . dataex
                input statement exceeds linesize limit. Try specifying fewer variables
                r(1000);

                Comment


                • #9
                  I'll create a new dataset with just the variables required for this question

                  Comment


                  • #10
                    Code:
                     . dataex
                    
                    ----------------------- copy starting from the next line -----------------------
                    
                    
                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input byte Participant float(prev1 prev2 prev4 prev3 logitprev1 logitprev2 logitprev3 logitprev4)
                     1  2.967858 1.5984064  .8863378 1.4922162          .          .         .  2.0538669
                     2 14.070985 4.4026513 1.1049236  1.942504          .          .         .          .
                     3   9.14358         .         .  3.949243          .          .         .          .
                     4  13.60417  4.816825 2.7212884  6.405684          .          .         .          .
                     5         0  .8983527  .1496917 .18035504          .  2.1790543 -1.513944  -1.737021
                     6  4.482531 1.6181076  .5019374 1.6671507          .          .         . .007749596
                     7   .976722 1.6757084   .754011 1.2927463  3.7366934          .         .  1.1201203
                     8  14.45461  7.827424 4.1563263  4.874775          .          .         .          .
                     9  38.45862 15.632953   6.19238  6.048964          .          .         .          .
                    10  1.049809  .9204037 .38537505   .802256          .   2.447844 1.4004548  -.4667952
                    11  2.045249 1.7633067  .7400888 1.5751708          .          .         .    1.04643
                    12 16.032372  3.197002  4.738522 4.6955466          .          .         .          .
                    14 20.577965 11.138762  8.531375  7.435046          .          .         .          .
                    15  1.178254 .20282856  .2420856  .6580936          . -1.3687087    .65481  -1.141279
                    16  1.436456  1.616737 .22165443  .3152621          .          . -.7756316  -1.256051
                    17 22.599764  12.42271  3.139279  4.037205          .          .         .          .
                    18  5.128489 4.4069552 1.1763887 2.2554524          .          .         .          .
                    19  20.08589 11.428288  4.188088  4.721145          .          .         .          .
                    21   6.91531  4.880307  1.971524  3.730975          .          .         .          .
                    22  6.340435 2.9332335 2.0938213  3.327607          .          .         .          .
                    23  7.027753  .5582108 1.1606071 2.1429203          .   .2339038         .          .
                    24         0 .11128157 .08769826  .1733915          . -2.0777168 -1.561779 -2.3420687
                    25   .799016 1.2400947 .31622165  .6414833  1.3801557          .  .5818079  -.7711904
                    26  7.922649  4.228576 2.1718347  4.859337          .          .         .          .
                    27   .356924 .24251093  .3661532   .354345 -.58874005 -1.1389623 -.5999942  -.5487555
                    29  13.83318  6.872275   2.47116  5.645855          .          .         .          .
                    30 13.870323  4.157504  3.129016  3.417343          .          .         .          .
                    31  10.44165  4.020252  2.829525 3.7610774          .          .         .          .
                     .         .         .         .         .          .          .         .          .
                     .         .         .         .         .          .          .         .          .
                     .         .         .         .         .          .          .         .          .
                     .         .         .         .         .          .          .         .          .
                     .         .         .         .         .          .          .         .          .
                     .         .         .         .         .          .          .         .          .
                    end
                    ------------------ copy up to and including the previous line ------------------ Listed 34 out of 34 observations

                    Comment


                    • #11
                      concord (Stata Journal) will only compare two variables at a time. I've seen work on developing a single overarching measure, which I found unconvincing. But you could loop over variables to get a matrix. Such results are only descriptive but will flag which measures are closest (least close). Elsewhere I've suggested looking at the eigenvectors and eigenvalues of that matrix.

                      Token code:

                      Code:
                      clear 
                      set obs 100 
                      set seed 2803 
                      
                      forval j = 1/5 { 
                           gen y`j' = rnormal()
                      } 
                      
                      matrix concord = J(5, 5, 1) 
                      quietly forval i = 1/4 { 
                          local J = `i' + 1 
                          forval j = `J'/5 { 
                              concord y`j' y`J' 
                              matrix concord[`j', `J'] = r(rho_c) 
                              matrix concord[`J', `j'] = r(rho_c)
                          }
                      } 
                      
                      matrix li concord

                      Comment


                      • #12
                        Thank you. I have copied that code (was I supposed to amend it?) It's produced these results which I am not sure how to interpret?

                        Code:
                        . matrix li concord
                        
                        symmetric concord[5,5]
                                    c1          c2          c3          c4          c5
                        r1           1
                        r2           1           1
                        r3           1    .0671523           1
                        r4           1  -.08111923  -.03520745           1
                        r5           1  -.00596733  -.03347908  -.03679992           1

                        Comment


                        • #13
                          Originally posted by Joe Tuckles View Post
                          Thank you. I have copied that code (was I supposed to amend it?) It's produced these results which I am not sure how to interpret?

                          Code:
                          . matrix li concord
                          
                          symmetric concord[5,5]
                          c1 c2 c3 c4 c5
                          r1 1
                          r2 1 1
                          r3 1 .0671523 1
                          r4 1 -.08111923 -.03520745 1
                          r5 1 -.00596733 -.03347908 -.03679992 1
                          Thanks to Nick for reminding me about the concept of concordance. There are a few related measures for binary data, which you don't have.

                          I could have divined a bit more about your data structure if I'd read your post more closely. I didn't fully apprehend that your 4 measures were risk scores (which is why I mis-named them prevalence).

                          I think Nick is saying that any correlation measure, or ANOVA as well, may be better performed on the logits of the risk scores (since they are percentages). Nick, please correct me if I'm wrong!

                          Whatever he meant, the concordance correlation measure is a modification of the Pearson correlation. Nick's code assembled a matrix of the concordance correlation coefficients from his simulated data. You can manually run concordance correlation measures on each pair of risk scores you have, e.g.

                          Code:
                          concord prev1 prev2
                          ...
                          concord prev3 prev4
                          I think this modification of Nick's code corresponds to your example:

                          Code:
                          matrix concord = J(4, 4, .)
                          forval i = 1/4 {
                              forval j = 1/4 {
                                  concord prev`i' prev`j'
                                  matrix concord[`j', `i'] = r(rho_c)
                                  }
                              }
                          matrix list concord
                          
                          symmetric concord[4,4]
                                     c1         c2         c3         c4
                          r1          1
                          r2  .55961197          1
                          r3  .24427789  .58674752          1
                          r4  .22959564  .53394803   .8182989          1
                          Last couple notes. First, you have some people with missing data. Second, it looks like your risk scores are in percentage points. You would calculate logits on a proportion, so you would need to divide the risk scores by 100 if I am correct. This code would do that:

                          [CODE]
                          drop logitprev?
                          forvalues i = 1 / 4 {
                          generate risk_`i' = prev`i' / 100
                          generate risk_logit_`i' = logit(risk_`i')
                          }
                          [\CODE]
                          Last edited by Weiwen Ng; 23 Aug 2018, 09:37.
                          Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

                          When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                          Comment


                          • #14
                            Thanks that makes sense :-) I did generate new variables which were logits but the majority of numbers are missing (as shown in my dataex). I'm not really sure why that is. Do I need to log instead of logit?

                            Comment


                            • #15
                              It's just a toy dataset and nothing to do with your data. The inputs are just Gaussian noise, so off-diagonal concordance correlations are essentially zero. Noise always agrees with itself so the diagonal concordance correlations are identically 1.

                              Your logit calculations are wrong. Logit requires input that is within (0, 1) so you have an easy fix (divide by 100 first) and a more difficult fix (think what to do about two cases with supposedly 0 percent risk).

                              Comment

                              Working...
                              X