Evaluating the precision of a novel algorithm depending on the current standards

Simon Henry

Join Date: Apr 2021

Posts: 4
#1

Evaluating the precision of a novel algorithm depending on the current standards

18 Jul 2023, 05:08

Dear all,

I am in need of a more expert statistician to please review my methodology.
I have read extensively on the subject to try to fine-tune my analysis, but I am not an expert statistician, and am worried to have made some errors. I would be very happy if one of you could point out to me some errors, and how to further improve.

I have a project, where I am comparing signs on a diagnostic tool, to a diagnostic variable given by an AI-based algorithm.
The AI-algorithm predicts the intensity (range) of the disease (0-5, 5-10, 10-20, 20-30, 30+ %)

Each ID has numerous pictures with a collection of categorical variables 0-1 for each of the signs (x) present. There are 5-6 signs per pictures, 6-9 pictures per ID, and 117 ID;
I am generating several variables :

SUMx = SUM of pictures where the sign is present.
ALLSUM = SUM of all the SUMx for a given ID
NEOx = categorical variable for a given ID (0-absence, 1-presence (or SUMx > chosen cutoff)
ALLNEO = SUM of the NEOx for a given ID.

The data looks as such :

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int ID float range byte(bulles blanc) float(SUMbulles SUMblanc SUMplis ALLSUM NEObulles NEOblanc NEOplis ALLNEOh) 1 3 0 0 0 3 0 10 0 0 0 0 1 3 0 0 0 3 0 10 0 0 0 0 1 3 0 0 0 3 0 10 0 0 0 0 1 3 0 1 0 3 0 10 0 0 0 0 1 3 0 1 0 3 0 10 0 0 0 0 1 3 0 1 0 3 0 10 0 0 0 0 1 3 0 0 0 3 0 10 0 0 0 0 1 3 0 0 0 3 0 10 0 0 0 0 2 4 0 1 1 7 3 17 1 1 1 3 2 4 0 1 1 7 3 17 1 1 1 3 2 4 0 1 1 7 3 17 1 1 1 3 2 4 0 1 1 7 3 17 1 1 1 3 2 4 0 1 1 7 3 17 1 1 1 3 2 4 0 1 1 7 3 17 1 1 1 3 2 4 0 1 1 7 3 17 1 1 1 3 2 4 1 0 1 7 3 17 1 1 1 3 end

(In this example, the variable NEOblanc is coded as 1 if SUMblanc >3
Some signs 'x' are not displayed for the sake of the example, therefore ALLSUM does not match the sum of all SUMx)

In order to determine which signs best determine the AI-produced 'range', I ran a regression.
As range is an ordinal categorical variable (0= 0-5 %, 1.., 2.., 3= 20-30%, 4=30+%), I ran ologit for the SUMx:
ologit range SUMx SUMy SUMz ...., or

I ran the same for the NEOx, using the i.prefix :

ologit range i.NEOx iNEOy ... , or

I also tried probing for interactions using the ##prompt :

ologit range c.SUMx##c.SUMy##c.SUMz, or

My questions are the following :

1) Results vary using or not using the ##prompt, naturally. Some variables are significant depending on the model and the other variables included.
Each variable was first tested using univariate analysis and when trends of association were seen, were included in the latter multivariate ones.
Variables were then tested for collinearity using regress followed by the vif command.
Even so, multiple models can be generated using this method, and all seem to be acceptable.
I am tempted to use the latter, using the ##prompt, accounting for variables' interaction. Is this correct ?

2) When analysing the NEO variables, I would like a method to account for the fact that "the presence of 2, or 3, or N of the signs may successfully predict the range given by the algorithm".
The ologit command with the ##prompt accounts for interaction, but not the simultaneous presence of a given 2 or 3 signs or any combination of 2 or 3 signs to predict the outcome.
I have therefore used ologit as such :

ologit range ALLNEO

The resulting coefficient can therefore be interpreted as "Any increase in ALLNEO (=the presence of more signs) increases the log likelihood prediction of the range.
Is this correct ?
This however does not let me analyse the impact of the presence of two specific signs. Or is this what's given to me by ologit range i.NEOx##i.NEOy and the resulting NEOx#NEOy coefficient ?

3) Finally, neither ALLNEO, nor ALLSUM are able to be included in the multivariate analysis with their fellow NEOx or SUMx without inducing collinearity issues. "X omitted because of colinearity"
This is because one sign ("plis") seems to very accurately predict the severity of the disease, therefore the range, the ALLSUM, and the ALLNEO.
I would like to further improve its predictability and the model by still including the presence of other signs (ALLNEO), and the number of other signs present (SUMx, ALLSUM, etc).
Is there a way to still include it in the multivariate model, or should this (or one of the ALLNEO / ALLSUM) be completely omitted ?

Thank you.
As said, I am welcoming any feedback on my methodology, and how to further improve it without causing more bias.

Regards,
Tags: None

Announcement

Evaluating the precision of a novel algorithm depending on the current standards