What test should I use to see whether two variables are significantly different from each other?

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#16

23 Aug 2018, 09:49

Originally posted by Nick Cox View Post

It's just a toy dataset and nothing to do with your data. The inputs are just Gaussian noise, so off-diagonal concordance correlations are essentially zero. Noise always agrees with itself so the diagonal concordance correlations are identically 1.

Your logit calculations are wrong. Logit requires input that is within (0, 1) so you have an easy fix (divide by 100 first) and a more difficult fix (think what to do about two cases with supposedly 0 percent risk).

I was modifying my answer as Joe's question and Nick's answer came in. I may have addressed some of this in #13.

For the two cases where the risk score is 0, assuming that is a legitimate risk assessment, I'd probably vote for replacing the logit with an arbitrarily small number, e.g. 1 * 10^-15.

There appear to be 6 or so observations where all information, including participant ID, is missing. These could be data errors. Maybe they're participants with missing risk assessment data, and somehow their IDs got eliminated. Joe will have to decide what to do about them.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Joe Tuckles

Join Date: Jul 2018
Posts: 180

#17

24 Aug 2018, 04:21

Hello, thanks for your ongoing help. I attempted your code:

Code:

drop logitprev?
forvalues i = 1 / 4 {
generate risk_`i' = prev`i' / 100
generate risk_logit_`i' = logit(risk_`i')
}

and got the error

Code:

invalid '1' 
r(198);

.

Not really sure what this means. Would this code work instead?

Code:

generate newvar = prev1 /100
(6 missing values generated)

. generate logitprev1 = logit(newvar)
(8 missing values generated)

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35696
#18

24 Aug 2018, 05:19

The error message means what it says. At some point you typed

Code:

'1'

whereas you should have typed.

Code:

`i'

There are two errors there. First , the character 1 (one) is not the letter i which you need here. That error isn't what Stata is objecting to but it is wrong any way. Second, you need left and right single quote marks. Many people rarely use the left quote, perhaps because they rely on MS Word or some such instrument to be smart on their behalf and correct single quotes on the fly to left and right curly single quotes. But Stata requires different quote characters in local macro references and it has no notion of auto-correcting what you type.

The bigger deal remains that you aren't modifying the calculation for the zeros (and any ones) in your data which logit can't cope with (hence the report of missing values as logit(0) and logit(1) are indeterminate). I'll guess wildly that your percent risk all come from looking at (count of actual cases) and (count of possible cases) (the terminology may not match your field or application). If so, then zero actual cases necessarily imply zero risk, usually an underestimate of risk.

Books like David Cox and Joyce Snell's 1989. The analysis of binary data. London: Chapman and Hall recommend

log[(actual + 1/2) / (possible - actual + 1/2)]

as a work-around here.
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#19

24 Aug 2018, 06:21

Thanks for your help. I typed this code

Code:

drop logitprev? forvalues i = 1 / 4 { generate risk_`i' = prev`i' / 100 generate risk_logit_`i' = logit(risk_`i') }

I cannot see the 1 other than where it says i=1/4. I have also used all the correct quote marks as far as I can see.
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#20

24 Aug 2018, 07:15

Just to clarify the missing ID numbers for participants are deliberate - I have had to remove them. There are two cases where assessment data is missing so I have not generated a score for them. And there are two cases (the 0s) that cannot be scored due to their age and have been given a zero due to that. I have changed their score to be 1 * 10^-15. I am now a little stumped how to proceed as I am unable to see where the code is wrong - where there is a 1 or incorrect quote marks.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#21

24 Aug 2018, 07:54

I don't see what identifiers have to do with any of the calculations here, but naturally agree that we don't need to see them.

I am afraid I don't really agree with the advice from Weiwen Ng to change 0 to 10e-15. It's utterly arbitrary and liable to create outliers. In contrast there's literature justifying the work-around given in my last.

You're in denial that you typed what Stata is stating you typed. Sorry, but in these cases I have learned to believe Stata.

That may seem mostly negative, and to compound that I am shortly travelling and unlikely to reply to anything until late Monday. (I don't travel with a lap-top and Statalist won't accept input from my phone.)

The general advice remains what it always has been: to show us exactly what you typed into Stata for which copying and pasting code is the best way to show us.
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#22

24 Aug 2018, 07:56

Just to be clear - the code above is copy and pasted
Comment

Joe Tuckles

Join Date: Jul 2018
Posts: 180

#23

24 Aug 2018, 07:58

Code:

. forvalues i = 1 / 4 {
  2. 
. generate risk_`i' = prev1 `i' / 100
  3. 
. generate risk_logit_`i' = logit(risk_`i')
  4. 
. }
invalid '1' 
r(198);

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35696
#24

24 Aug 2018, 09:11

That is slightly different from the code earlier, but I think you have "l" not "1" in the first line (before the slash) and a spurious space in the second line. I didn't spot the first error before (sorry) but the second error is new.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#25

24 Aug 2018, 11:36

Originally posted by Nick Cox View Post

...

I am afraid I don't really agree with the advice from Weiwen Ng to change 0 to 10e-15. It's utterly arbitrary and liable to create outliers. In contrast there's literature justifying the work-around given in my last.
...

Nick's right. I had wires crossed with a different regression model.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#26

29 Aug 2018, 04:19

Hi,

I have tried again but it seems to think the i on the second row is a 1. I assure you it isn't I typed i and retyped i to check again.

Code:

. forvalues i = 1 / 4 { 2. . generate risk_`i' = prev1`i' / 100 3. . generate risk_logit_`i' = logit(risk_`i') 4. . } prev11 not found r(111);
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#27

29 Aug 2018, 04:29

I do not understand how to apply the code:
log[(actual + 1/2) / (possible - actual + 1/2)]

These are all risk percentages so I would guess this means my percentages are all 'possible' cases, as they have not become 'actual cases' yet... the disease hasn't happened yet, this is just their risk of it possibly happening. I'm not sure I have understood what you are saying?

Do I need to change those with scores such as 0.1, 0.2, 1.7, 1.2 etc etc?

Last edited by Joe Tuckles; 29 Aug 2018, 04:32.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#28

29 Aug 2018, 04:55

On #26: Stata is telling you, as I read it, that it doesn't know of a variable or scalar called prevl1. We can't comment beyond asking whether it's true. From #10 and #17 it would seem that your variables are called prev1 to prev4.

On #27:

Your percentage risk data came from somewhere. You can't interpret them easily or even correctly if you have no idea of what they are based on.

If you don't have the original numerator and denominator, then you can't apply this work-around. Thatis on all fours with noting that zero percent risk could be 0 cases out of any denominator from 1 up.

Also, the calculation should be for all observations. It's a consistent way of approximating the logit when faced with zeros or ones as argument. If possible >> 1, actual >> 1, the adjustments are minute.

Your bottom line may be that you can't apply logit scale. That would be a pity because preliminary plots of your data suggest that it helps, with the proviso that you can't plot the zeros.

Last edited by Nick Cox; 29 Aug 2018, 04:59.
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#29

29 Aug 2018, 05:00

Ahh I thought I gave a data example in post #10. Was this something else?

My percentage risk data came from published algorithms using risk factors, with the exception of one measure - the owners of that measure generated the percentages on my behalf because their risk algorithm is not in the public domain. Does this mean I do not have the original numerator and denominator?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#30

29 Aug 2018, 05:04

Indeed. I don't believe zero risk for anything except my voting for Trump.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment