What test should I use to see whether two variables are significantly different from each other?

Joe Tuckles

Join Date: Jul 2018

Posts: 180
#1

What test should I use to see whether two variables are significantly different from each other?

23 Aug 2018, 01:27

Hi,

Apologies for the rather basic question.

I have one sample of participants. I have four different measures to calculate the percentage risk of developing a disease. All 4 measures are designed to calculate the exact same thing, however they have produced different results for my sample. I want to run a test to see whether the results of each measure are the same or significantly different from each other. The 4 measures are continuous variables and are a percentage.

I hope that makes sense.
Tags: None
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#2

23 Aug 2018, 06:56

This is a very interesting question. Normally, as you know, we have participants nested within two groups. We have the t-test or the t-test for proportions to see if the means (or proportions) of the groups differ. If we have more than two groups, we can use ANOVA to simultaneously test if any one group's mean differs from the rest.

Here, I assume you actually have tests nested within participants. Despite that, my first inclination would still be (repeated measures) ANOVA. If this is the right approach, I think you would need to reshape your data such that each person has 4 observations, then run an ANOVA. Using some fake variable names:

Code:

preserve rename test_a prevalence1 ... rename test_d prevalence4 keep id prevalence? reshape long prevalence, i(id) j(testnum) anova prevalence id, repeated(testnum) restore

-preserve- and -restore- preserve the original data and restore it, so don't worry that you're throwing away a bunch of variables. You need to rename each test to some stub variable name ending in a number for reshape to work properly. The new variable testnum denotes the number of the test (i.e. is it the first, second, third, or fourth test; reshape will strip out the number behind each variable preserve and go assign it to -testnum-. The last command runs the ANOVA. You effectively have 4 repeated measures on each person. That's the approach that I think I would run, but I'm by no means certain it's correct.

If you have an error message, post it in the forum. I don't typically use ANOVA, so I may have botched the syntax! In fact, I have an idea of how I botched it, so note the fact that I removed the underscore (_) from the variables starting with preserve.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Joe Tuckles

Join Date: Jul 2018
Posts: 180

23 Aug 2018, 07:09

Thanks! This is the error I get:

Code:

reshape long prevalence, i(Participant) j(testnum)
(note: j = 1 2 3 4)
variable id does not uniquely identify the observations
    Your data are currently wide.  You are performing a reshape long.  You specified i(Participant) and j(testnum).  In the current wide form, variable Participant should
    uniquely identify the observations.  Remember this picture:

         long                                wide
        +---------------+                   +------------------+
        | i   j   a   b |                   | i   a1 a2  b1 b2 |
        |---------------| <--- reshape ---> |------------------|
        | 1   1   1   2 |                   | 1   1   3   2  4 |
        | 1   2   3   4 |                   | 2   5   7   6  8 |
        | 2   1   5   6 |                   +------------------+
        | 2   2   7   8 |
        +---------------+
    Type reshape error for a list of the problem observations.
r(9);

. 
. anova prevalence Participant, repeated(testnum)
variable testnum not found
(error in option repeated())

Comment

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

23 Aug 2018, 07:13

I was operating under the assumption that you have a unique ID variable for each person. Do you actually have multiple observations for each person already?

This is why, under the FAQ, we ask for examples of your data using -dataex- (some details in my signature). Can you post some example data?

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35720
#5

23 Aug 2018, 07:22

I'd add that percentage risks are often better considered on a logit scale, partly because of how those quantities behave and partly because of their substantive interpretation, just as I don't care whether my chance of getting wet in the rain changes from 50% to 51% but I do care much more whether my chance of being hit by lightning changes from 1% to 2%. (Or even from 0.000001% to twice that.)

Concordance correlation measures agreement and there is corresponding Stata stuff.
2 likes
Comment

Joe Tuckles

Join Date: Jul 2018
Posts: 180

23 Aug 2018, 07:41

Hi,

Thanks for your help. I assume I am doing something wrong:

Code:

. ssc install dataex
checking dataex consistency and verifying not already installed...
all files already exist and are up to date.

. -dataex-
- is not a valid command name
r(199);

I also attempted this:

Code:

 concord logitprev1 logitprev2 logitprev3 logitprev4
too many variables specified

Comment

Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#7

23 Aug 2018, 07:44

dataex is the command; many people on this list surround a command name by hyphens in this list just to set it off - but you should not type the hyphens
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#8

23 Aug 2018, 07:45

I did try that too:

Code:

. dataex input statement exceeds linesize limit. Try specifying fewer variables r(1000);
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#9

23 Aug 2018, 07:46

I'll create a new dataset with just the variables required for this question
Comment

Joe Tuckles

Join Date: Jul 2018
Posts: 180

#10

23 Aug 2018, 07:48

Code:

 . dataex

----------------------- copy starting from the next line -----------------------


	Code:
	* Example generated by -dataex-. To install: ssc install dataex
clear
input byte Participant float(prev1 prev2 prev4 prev3 logitprev1 logitprev2 logitprev3 logitprev4)
 1  2.967858 1.5984064  .8863378 1.4922162          .          .         .  2.0538669
 2 14.070985 4.4026513 1.1049236  1.942504          .          .         .          .
 3   9.14358         .         .  3.949243          .          .         .          .
 4  13.60417  4.816825 2.7212884  6.405684          .          .         .          .
 5         0  .8983527  .1496917 .18035504          .  2.1790543 -1.513944  -1.737021
 6  4.482531 1.6181076  .5019374 1.6671507          .          .         . .007749596
 7   .976722 1.6757084   .754011 1.2927463  3.7366934          .         .  1.1201203
 8  14.45461  7.827424 4.1563263  4.874775          .          .         .          .
 9  38.45862 15.632953   6.19238  6.048964          .          .         .          .
10  1.049809  .9204037 .38537505   .802256          .   2.447844 1.4004548  -.4667952
11  2.045249 1.7633067  .7400888 1.5751708          .          .         .    1.04643
12 16.032372  3.197002  4.738522 4.6955466          .          .         .          .
14 20.577965 11.138762  8.531375  7.435046          .          .         .          .
15  1.178254 .20282856  .2420856  .6580936          . -1.3687087    .65481  -1.141279
16  1.436456  1.616737 .22165443  .3152621          .          . -.7756316  -1.256051
17 22.599764  12.42271  3.139279  4.037205          .          .         .          .
18  5.128489 4.4069552 1.1763887 2.2554524          .          .         .          .
19  20.08589 11.428288  4.188088  4.721145          .          .         .          .
21   6.91531  4.880307  1.971524  3.730975          .          .         .          .
22  6.340435 2.9332335 2.0938213  3.327607          .          .         .          .
23  7.027753  .5582108 1.1606071 2.1429203          .   .2339038         .          .
24         0 .11128157 .08769826  .1733915          . -2.0777168 -1.561779 -2.3420687
25   .799016 1.2400947 .31622165  .6414833  1.3801557          .  .5818079  -.7711904
26  7.922649  4.228576 2.1718347  4.859337          .          .         .          .
27   .356924 .24251093  .3661532   .354345 -.58874005 -1.1389623 -.5999942  -.5487555
29  13.83318  6.872275   2.47116  5.645855          .          .         .          .
30 13.870323  4.157504  3.129016  3.417343          .          .         .          .
31  10.44165  4.020252  2.829525 3.7610774          .          .         .          .
 .         .         .         .         .          .          .         .          .
 .         .         .         .         .          .          .         .          .
 .         .         .         .         .          .          .         .          .
 .         .         .         .         .          .          .         .          .
 .         .         .         .         .          .          .         .          .
 .         .         .         .         .          .          .         .          .
end
------------------ copy up to and including the previous line ------------------

Listed 34 out of 34 observations

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35720
#11

23 Aug 2018, 08:00

concord (Stata Journal) will only compare two variables at a time. I've seen work on developing a single overarching measure, which I found unconvincing. But you could loop over variables to get a matrix. Such results are only descriptive but will flag which measures are closest (least close). Elsewhere I've suggested looking at the eigenvectors and eigenvalues of that matrix.

Token code:

Code:

clear set obs 100 set seed 2803 forval j = 1/5 { gen y`j' = rnormal() } matrix concord = J(5, 5, 1) quietly forval i = 1/4 { local J = `i' + 1 forval j = `J'/5 { concord y`j' y`J' matrix concord[`j', `J'] = r(rho_c) matrix concord[`J', `j'] = r(rho_c) } } matrix li concord
Comment

Joe Tuckles

Join Date: Jul 2018
Posts: 180

#12

23 Aug 2018, 08:25

Thank you. I have copied that code (was I supposed to amend it?) It's produced these results which I am not sure how to interpret?

Code:

. matrix li concord

symmetric concord[5,5]
            c1          c2          c3          c4          c5
r1           1
r2           1           1
r3           1    .0671523           1
r4           1  -.08111923  -.03520745           1
r5           1  -.00596733  -.03347908  -.03679992           1

Comment

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#13

23 Aug 2018, 09:20

Originally posted by Joe Tuckles View Post

Thank you. I have copied that code (was I supposed to amend it?) It's produced these results which I am not sure how to interpret?

Code:

. matrix li concord symmetric concord[5,5] c1 c2 c3 c4 c5 r1 1 r2 1 1 r3 1 .0671523 1 r4 1 -.08111923 -.03520745 1 r5 1 -.00596733 -.03347908 -.03679992 1

Thanks to Nick for reminding me about the concept of concordance. There are a few related measures for binary data, which you don't have.

I could have divined a bit more about your data structure if I'd read your post more closely. I didn't fully apprehend that your 4 measures were risk scores (which is why I mis-named them prevalence).

I think Nick is saying that any correlation measure, or ANOVA as well, may be better performed on the logits of the risk scores (since they are percentages). Nick, please correct me if I'm wrong!

Whatever he meant, the concordance correlation measure is a modification of the Pearson correlation. Nick's code assembled a matrix of the concordance correlation coefficients from his simulated data. You can manually run concordance correlation measures on each pair of risk scores you have, e.g.

Code:

concord prev1 prev2 ... concord prev3 prev4

I think this modification of Nick's code corresponds to your example:

Code:

matrix concord = J(4, 4, .) forval i = 1/4 { forval j = 1/4 { concord prev`i' prev`j' matrix concord[`j', `i'] = r(rho_c) } } matrix list concord symmetric concord[4,4] c1 c2 c3 c4 r1 1 r2 .55961197 1 r3 .24427789 .58674752 1 r4 .22959564 .53394803 .8182989 1

Last couple notes. First, you have some people with missing data. Second, it looks like your risk scores are in percentage points. You would calculate logits on a proportion, so you would need to divide the risk scores by 100 if I am correct. This code would do that:

[CODE]
drop logitprev?
forvalues i = 1 / 4 {
generate risk_`i' = prev`i' / 100
generate risk_logit_`i' = logit(risk_`i')
}
[\CODE]

Last edited by Weiwen Ng; 23 Aug 2018, 09:37.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#14

23 Aug 2018, 09:25

Thanks that makes sense :-) I did generate new variables which were logits but the majority of numbers are missing (as shown in my dataex). I'm not really sure why that is. Do I need to log instead of logit?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35720
#15

23 Aug 2018, 09:39

It's just a toy dataset and nothing to do with your data. The inputs are just Gaussian noise, so off-diagonal concordance correlations are essentially zero. Noise always agrees with itself so the diagonal concordance correlations are identically 1.

Your logit calculations are wrong. Logit requires input that is within (0, 1) so you have an easy fix (divide by 100 first) and a more difficult fix (think what to do about two cases with supposedly 0 percent risk).
Comment

Announcement

What test should I use to see whether two variables are significantly different from each other?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment