Building multivariable model - categorical variable overall p value

Tara Boyle

Join Date: Nov 2022

Posts: 142
#1

Building multivariable model - categorical variable overall p value

09 Aug 2023, 05:52

Hello

I really do hope some one can help Clyde Schechter - saw your post here, and I think you can help...
Just read this post here: How to get an overall p-value for an independent categorical variable using binary logistic regression ? - Statalist

I've got a similar problem

I'm performing a COX-Regression model

I have one categorical variable: ASA (this is a grade of co-morbid status of patients) with 4 orders:
ASA1 , ASA2, ASA3, ASA4

All p values from ASA 1- ASA3 are p <0.001
ASA4 has a p value of 0.551

I would like to perform a Cox regression without ASA4 to see how my model fits (I doubt this will change anything as the p value isn't high, and to be honest there aren't many patients with ASA4. However I don't know if
A. Does it make sense
OR
B. Do I need to remove ASA as an entire variable

Most books:
APplied Survival Analysis
Survival Analysis
UCLA

All demonstrate examples with a categorical variable with 2 orderseg ASA has ASA1 or ASA2).
So therefore there is only one p value and hence the new model can easily be examined if ASA is removed (as it has 2 orders)

Code:

stcox i.var1 i.var2 i.ASA, nohr estimates store A stcox i.var1 i.var2, nohr

Last edited by Tara Boyle; 09 Aug 2023, 06:15.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#2

09 Aug 2023, 08:35

I don't quite get what you want to do. If you want to test the joint significance of all four categories of ASA you can do a Wald test

Code:

stcox i.var1 i.var2 i.ASA testparm i.ASA

or you can do a likelihood ratio test

Code:

stcox i.var1 i.var2 i.ASA estimates store full stcox i.var1 i.var2 estimates store reduced lrtest full reduced

If you only want to look at the effect of the ASA == 4 level, then

Code:

stcox i.var1 i.var2 i.ASA test 4.ASA

is the simplest way to do that. Actually, you don't even need the test command: you can just look at the coefficient of 4.ASA in the results, and its test statistic.

I would not do this by dropping the observations where ASA = 4. Then you are changing the estimation sample, and that is going to affect the test statistics in its own right, even if the coefficient of ASA were itself exactly 0.

I don't know if any of this answers your question. If not, please post back and try to phrase the question more exactly.
Comment
Tara Boyle

Join Date: Nov 2022

Posts: 142
#3

09 Aug 2023, 08:46

Thanks Clyde Schechter was looking forward to reading your reply.

Yes I have performed a Wald Test, perhaps I am just being pedantic. And perhaps I shouldn't.
ASA - has 5 levels (apologies not 4!)

I performed a Wald test for all ASA levels
p < 0.001

//This would mean keep ASA right ?

But then I thought, well there are only 91 deaths in ASA 4 group and 5 in ASA 5 group and this is giving me my high p value, so I thought of performing Wald test between groups which does show there is a difference between ASA 4 and ASA 5. Would it be reasonable to drop ASA 4, and ASA 5. Or is this changing the estimation size OR are the rules to drop the ASA as one whole variable rather than subsets (ASA-4, ASA 5)

Here I show how I then performed a Wald test for each level

Code:

test 2.ASA 3.ASA //p <0.001

[CODE]
test 3.ASA 4.ASA
//p <0.001
[/CODE

[CODE]
test 4.ASA 5.ASA
//p 0.4601
[/CODE

I look forward to your reply

In fact I performed a logrank test - here is the graph which does indicate ASA 4/ASA 5 are not parallel . P - 0.001 chi2(4) = 506.96

Last edited by Tara Boyle; 09 Aug 2023, 08:49.
Comment
Tara Boyle

Join Date: Nov 2022

Posts: 142
#4

09 Aug 2023, 11:33

Hi I was just reading this on a book

Survival Analysis - Joseph P Klein and moedchberger

(See photo insert)

I wonder if I’m coding all wrong and the categorical variable such as ASA which has 4 levels should infact be recoded as with 4 dummy variables

ASA1 - 1 if Asa 1, 0 otherwise

ASA2 - 1 if asa 2, 0 otherwise

Asa3 - 1 if asa3, 0 otherwise

ASA4 - 1 if asa 4, 0 otherwise

Asa5- 1 if asa5, 0 otherwise

This would infact explain why most resources I have seen have categorical variables with just 2 levels.

This would make it easier to perform LR TESTs

Am I wrong to state this ?

Attached Files
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#5

09 Aug 2023, 11:47

First, let's be clear about what the meaning of some of the options available to you would be. The tests being done on these pairs of levels of ASA test whether or not that pair of levels jointly has a hazard function that differs from that of the base level (which I guess is ASA = 1?) So if you were to act on the "non-significant" finding with regard to levels 4 and 5 you would, in effect, be reclassifying ASA 4 and 5 to be ASA 1. This makes no sense to me given the explanations of what those categories are. In fact, this ASA variable (American Society of Anesthesiologists classification?) is really an ordinal variable. So you might want to look into treating it as if it were a continuous variable in your model. (Though from the graphs it appears that failure is not related in an order-preserving way with this variable--so I suspect it really doesn't belong in the model.)

Next, I would never endorse choosing a model by looking at significance tests. For one thing, p-values are contaminated by the sample size. Looking at your graph, it seems that with the possible exception of ASA-P1, which, oddly, seems to have the highest failure rate, the differences among these groups may be small. It is hard to say for sure because, oddly, the vertical axis has no scale on it: only the upper end of 1.00 is labeled, so it is hard to know if we are seeing a range from 1 down to, say 0.97 or from 1 down to nearly 0 in these graphs. But if your sample size is very large, even meaningless minuscule differences will be "statistically significant." Adding a variable just because it draws a "statistically significant" hazard ratio in the analysis can result in just overfitting the noise in the model. I would be much more concerned with things like the effect of modeling choices on things like model discrimination (-estat concordance-). And -estat ic- produces some information criteria that can be helpful in distinguishing improvements that are likely to be overfitting of noise from those large enough to represent "real" effects.

Next, I would proceed in order: should I include ASA at all? If you decide to include ASA, you can then look at whether some tinkering with its levels might improve the model. But never tinker in ways that make no sense, like merging the deaths-door levels with fit and healthy while leaving intermediate levels of comorbidity separate.

More general advice: model selection is as much art as it is science. Attempts to delegate variable selection to some automatic statistical tests often produce spectacularly ridiculous models that also fall apart when tested against new data. The "best fitting" model, as measured by any statistic, is usually not a good choice: it usually achieves its best fit, in no small measure, by overfitting noise and reproduces poorly in out of sample validation. While there are more sophisticated ways of guarding against this, a simple way, if your sample size is large enough, is to split your sample into random halves, develop the model using only the data from one half, and then seeing how the model holds up in the other half of the data. At the end of the day, judgment must be applied. The variables selected need to be sensibly defined and measured, and they need to have some credible relationship to the outcome being modeled, in addition to performing well from a purely statistical perspective.
Comment
Tara Boyle

Join Date: Nov 2022

Posts: 142
#6

09 Aug 2023, 12:07

Thanks for this

Would you kindly explain what you mean by this

So if you were to act on the "non-significant" finding with regard to levels 4 and 5 you would, in effect, be reclassifying ASA 4 and 5 to be ASA 1. T

And you mention

‘you can then look at whether some tinkering with its levels might improve the model’

How would you recommend with tinkering with its levels because too my mind - a patient who is ASA1 can not be moved to ASA2 ? So i think this is not possible

Also, what are your thoughts re my post #4 ? If any of course

And, yes I was just testing model fitting with full and reduced models and using LR test and wald test.

I was then going to test again using estat and bic levels.

And agreed with basing on p values. Its just I didnt know what to do with Asa4 and asa5 as there are very few in the sample

Asa1 have highest failure rate as they are the largest population, healthiest and more likely to have the procedure.

I altered the scale just to see the graphs more separate , minimal value is 0.97

I appreicate you taking your time to reply as I’m sure you’re rather busy with all your work
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#7

09 Aug 2023, 12:55

So if you were to act on the "non-significant" finding with regard to levels 4 and 5 you would, in effect, be reclassifying ASA 4 and 5 to be ASA 1.

When you use i.ASA notation, Stata expands this in the regression to a set of indicator ("dummy") variables representing all but one level of ASA. So it is as if you had variables for ASA = 2, ASA = 3, ASA = 4, and ASA = 5. (No variable for ASA = 1, that's the jbase category.) If you eliminate level 4 from the model, and have only (virtual) indicators for ASA = 2, 3, or 5, then notice that level 4 and level 1 will both be represented as all zeroes on these three indicators. So level 4 becomes equivalent to level 1 in this model.

How would you recommend with tinkering with its levels because too my mind - a patient who is ASA1 can not be moved to ASA2 ? So i think this is not possible

It is quite possible. Just -recode ASA (2 = 1)- and it is done. And it is certainly more plausible to group levels 1 and 2 together than levels 1 and 4. Levels 1 and 2 are not that far apart in real-world terms: the fully fit vs those with mild illness. Treating them as a single group is not necessarily unreasonable. In general, with ordinal variables where some of the levels are rare, combining adjacent levels to get larger groups is usually reasonable.

Also, what are your thoughts re my post #4 ? If any of course

As I have pointed out in my first response in this post, when you use Stata's factor variable notation, behind the scenes, Stata creates the internal equivalent of precisely those variables mentioned in #4. So there is no difference here. Factor variable notation is both more convenient when writing commands, and essential if one goes on to use the -margins- command. The use of separate indicator variables for levels of a single categorical construct is no longer necessary in Stata, except in a very limited number of situations that require older commands that don't allow factor-variable notation.

Its just I didn't know what to do with Asa4 and asa5 as there are very few in the sample

Well, 91 failures in ASA 4 is not all that few in terms of being able to support a coefficient estimate with a possibly acceptable level of precision. Granted 5 failures in ASA 5 is really not enough. But probably the most sensible thing to do is then combine categories 4 and 5.

Asa1 have highest failure rate as they are the largest population, healthiest and more likely to have the procedure.

Thanks for clarifying that. I had been assuming that the failure event was something adverse like dying or ending up in the ICU or something like that. Now the results make more sense. (But it still doesn't make sense to combine 4 with 1.)

I altered the scale just to see the graphs more separate , minimal value is 0.97

So you are dealing with very small differences in a very large sample. With that little real outcome variation, even small amounts of noise can wreck the model as they can easily obscure or mimic the real effects. It also makes the models more sensitive to the choices made about what to include in the model: a confounder that alters the outcome probability by as little as 1 percentage point is having an effect that is comparable to the between-group differences you are seeing. So including or excluding that confounder can lead to models with rather different results. You are in a difficult situation here. Ironically, in situations like this your clinical judgment and biological/medical understanding of the real world mechanisms relating these variables to each other becomes even more important and statistics becomes even less helpful in model building. I suggest you "step away from the keyboard" for a while and spend some time thinking carefully about the real-world processes that relate these variables. Draw diagrams with arrows connecting related variables. Figure out which arrows represent well-understood causal effects and which are of less certain importance. Make sure you understand clearly which of the categorical variables in your model might, like ASA, be better thought of as ordinal. Investing some time in this will probably be more productive than fitting and testing more models at this point.
Comment
Tara Boyle

Join Date: Nov 2022

Posts: 142
#8

09 Aug 2023, 13:25

Excellent explanation.

Yes I have an alternative to ASA and can use a comorbid index from Bottle et Al. Paper who has reclassifed Charlson with modern hes values.

i just dislike dropping variables, it is my weakness and find every excuse to try keep them in.

with regards to,

When you use i.ASA notation, Stata expands this in the regression to a set of indicator ("dummy") variables representing all but one level of ASA. So it is as if you had variables for ASA = 2, ASA = 3, ASA = 4, and ASA = 5. (No variable for ASA = 1, that's the jbase category.)

- Agreed ASA =1 would be the reference. However with regards to this:

If you eliminate level 4 from the model, and have only (virtual) indicators for ASA = 2, 3, or 5, then notice that level 4 and level 1 will both be represented as all zeroes on these three indicators. So level 4 becomes equivalent to level 1 in this model.

with regards to eliminating level 4 and this becomign equivalent to level 1- what code what you have used to better explain how you eliminated level4 and it being represented as 0 in future models ?

Just to be devil’s advocate I can eliminate it by using:

Code:

char ASA[omit]4 xi: stcox i.asa i.var2 i.var3

then level 4 won’t be represented as 0.

although, with regards to elimination I may have already decided to drop it and use an alternative variable.

However I thought I’d seek further clarification re the above just to further my understanding

thank you so much for your time
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#9

09 Aug 2023, 13:35

In factor variable notation, to omit category 4 from the representation, I would code it as ib1o4.ASA. That tells Stata to keep level 1 as the base level, and then also code level 4 as all zeroes on the virtual indicators it creates behind the scenes.
Comment
Tara Boyle

Join Date: Nov 2022

Posts: 142
#10

09 Aug 2023, 14:07

Perhaps I am being silly but I still don’t understand post 9 or perhaps you can refer me to a resource i can read to try understand.

would code it as ib1o4.ASA. That tells Stata to keep level 1 as the base level, and then also code level 4

- If i tell stata

Code:

stcox i.var1 i.var2 i.ASA estimates store full

asa still has all 5 levels (Asa1 to Asa5)
or do you mean to generate a new variable
gen ib1o4.ASA = 0
replace ib1o4.ASA = 1 if Asa ==4

then

Code:

stcox i.var1 i.var2 i.ASA i.ib1o4.ASA estimates store full

i’m so sorry for wasting your time
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#11

09 Aug 2023, 15:32

No, no new variable. Re-read -help fvvarlist- and the linked chapter in the PDF documentation that comes installed with your Stata to get a full understanding of factor variable notation.

But to just focus on the matter at hand, with the ASA variable having 5 levels, which are numbered 1 through 5 in the data, if you run

Code:

any_Stata_estimation_command ... i.ASA ...

Stata will, behind the scenes, create four indicator "variables." (I'm using scarequotes around variables because these variables never appear in the data set and exist only in Stata's "mind.") Those variables are designated 2.ASA, 3.ASA, 4.ASA, and 5.ASA. There is one for each level of ASA except the first, which serves as the reference category. The "variable" 2.ASA takes on the value 1 whenever ASA = 2, missing (.) when ASA is missing, and 0 for any non-missing value of ASA other than 2. Analogously for 3.ASA through 5.ASA.

You can also see this in your output. In the regression table, there will be coefficients given for 2.ASA through 5.ASA. There will be no coefficient for 1.ASA. At most, depending on the particular command and what options you specified, 1.ASA might be listed, but where a coefficient might otherwise be found it will say say 0 (base level) instead. (The only way to get a coefficient for 1.ASA is to either specifically designate some level other than 1 as the baselevel, or, to run a no-constant-term command.)

Now, if you run the same command with o4.ASA, Stata will first agree to omit level 4 from the representation of ASA as a series of "dummy" "variables." But then Stata will realize that, as one indicator is already out of play, there is no need to omit another to avoid colinearity with the constant term, so 1.ASA will now appear beside 2.ASA, 3.ASA, and 5.ASA. That is, you still have four indicators, and what has changed is the 4 now serves as the base category.

But that does not really eliminate 4 from the model in the sense we have been talking about. To do that, you must both remove 4 and still have 1 used as the base category. So this would be run as:

Code:

any_Stata_estimation_command ... ib1o4.ASA ...

With this code, 1 is still omitted as the base category, and 4 is also omitted. The output contains 2.ASA, 3.ASA, and 5.ASA: only three indicators. And any observation with ASA = 1 or 4 has all three of these indicators = 0. So 1 and 4 have been made equivalent in this model.
Comment

Announcement

Building multivariable model - categorical variable overall p value

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment