Cut-points for continuous variables using information from regression models

Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#1

Cut-points for continuous variables using information from regression models

18 Jan 2016, 05:05

Hi Statlisters,

I am finding different methods cited in the literature to categorize a continuous variable (cumulative life time measure of alcohol consumed, tobacco smoked etc) , e.g, using smoothing functions such as splines to identify specific deflection points/range of values that can be used as cutpoints. For a binary outcome and a continuous predictor (with or without other covariates), is there any Stata syntax/code/function to find out the optimum cut-point values for categorizing the continuous predictor? Something similar in essence to what has been given for these softwares; a) SAS (Finding Optimal Cutpoints for Continuous Covariates with Binary and Time-to-Event Outcomes); b) Choose Cutpoints for Categorizing a Continuous Predictor (I have no idea which software this code is for but in this here, they seem to use the information received from the regression model to find out optimum cut points).

thanks
Tags: None
Maarten Buis

Join Date: Mar 2014

Posts: 3467
#2

18 Jan 2016, 05:08

The advise on this list is for such questions is just "don't do that". By categorising your continuous variable you are throwing away information, and throwing away information is bad.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#3

18 Jan 2016, 05:29

Thanks Maarten. I do understand that and I went through some of the discussions here pertaining to this, as well as in the literature pertaining to my outcome ( head and neck cancers) and predictors ( alcohol, smoking). The reason I am trying to find out the cut points is that I am in the middle of performing interaction analysis using these predictors and other binary variables(presence or absence of certain genetic markers) and I intend to receive estimates for interaction on both multiplicative and additive ( relative excess risk due to interaction). Mine is a case control study and sample size is less than a 1000. And also, I have posted this question up on advice from my PhD committee members to use information from RC splines (which I have already made) to find out optimal cut-points. I am giving below one of my RC spline graphs plotted between pack-years of cigarette smoked on the x axis and odds ratios on the log scale on the Y-axis. Above the x-axis is a rug plot showing the distribution of participants in my study. The odds ratio increases sharply till around 45 pack-years and graph flattens beyond approximately 55 pack years.

Attached Files

Last edited by Thekke Purakkal; 18 Jan 2016, 05:55. Reason: added spline graph
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#4

20 Jan 2016, 04:27

Solved
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#5

20 Jan 2016, 05:02

Hi Thekke, I'm glad your problem is now considered "solved" by you, even though I absolutely agree with Maarten's cautionary note (in #2) on the risk of "throwing away information" when categorizing a continuous variable. However, considering that sharing information and solutions stay among the main aims of this Forum, I kindly underline the best closure for the queries (in the spirit of this community of Stata users and lovers) is a presentation of the strategy applied to, well, "solve" a specific problem. Sincerely, it is not due to an interest of mine in this particular situation, but I believe there may be fellows - now or in the future - embracing a similar issue, hence the potential benefit.

Best,

Marcos

Last edited by Marcos Almeida; 20 Jan 2016, 05:08.

Best regards,

Marcos
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#6

20 Jan 2016, 07:03

Hi Marcos, Thank you for your advise. I really do appreciate it. I had gone through the aims or this forum and I really do understand that the proper way of closing a thread is by providing the solution to a stated problem. But I was in two minds because, 1) as I mentioned in #3, I had gone through previous discussions on the topic of categorizing a continuous variable on this forum and the method was not encouraged; 2) Maarten's reminder about "The advise on this list is for such questions is just "don't do that"" in #2; and 3) I had presented my special case and need on why i need to categorize my continuous variable but no response was given. Now i didn't not know if it was due to the general take of the forum of not encouraging categorization (I mean no one wanted to give a solution even if they knew it) or because no one knew the solution. But at the same time I had to close the tread as I had found an alternative route meanwhile. That's the reason why I settled for "solved" rather than giving details of my solution and being "in danger" of over riding the forums basic advise of not going for categorization. So if anyone can put light on this, it would be nice.

I solved my issues through a) Using the information conveyed by my RC spline graph, b) and a work around using outcome based information given by 4 different methods of finding cutpoints using '%findcut' / '%cutpoint' macro in SAS (I do not know much about SAS but had to download it and learn the basics for suiting my needs. And since its not Stata, I do not know if its ok to discuss it here).

The fact that categorization of a continuous variable would lead to loss of information is taught in our basic stats courses. So I am very well aware of the points raised against this methods. However, arguments continue on this topic based of the "need" for categorization. To name a few,
Avoiding Power Loss Associated with Categorization and Ordinal Scores in Dose-Response and Trend Analysis by Sander Greenland
Against quantiles: categorization of continuous variables in epidemiologic research, and its discontents by Caroline Bennette1 and Andrew Vickers
Finding Optimal Cutpoints for Continuous Covariates with Binary and Time-to-Event Outcomes by Williams BA et al.

I would like to recap that in my case, its a case control genetic epidemiology study, binary outcome, continuous variables of smoking/alcohol as main predictors and binary genetic markers as the third variable. I am testing interactions/effect measure modification. Recommendations for presenting interaction results seems to be presenting point estimates and CI's on both multiplicative and additive scale, reporting individual odds ratios (as the case may be) among the strata of effect modifier and presenting observed and expected joint effects on the multiplicative scale (supra/sub interactions), be it a case control or cohort study.
A Tutorial on Interaction
Recommendations for presenting analyses of effect modification and interaction by Mirjam J Knol and Tyler J VanderWeele

I am attempting this in my study, which is heavily limited by its sample size of less than a 1000 cases and controls for a genetic epi study. Now I am obviously using my continuous variables as such for my analysis and am reporting interactions using them on multiplicative scale, presenting results and graphs from RC splines and stratifying the graphs by specific genetic markers so that I get maximum info from this form of the variable. But for my second half of results, I do not know how to easily present interaction results on an additive scale (RERI) using a continuous predictor. Its much more easier to do it if I categorize it ( I mean I do know the method.Though my research relies heavily on epidemiology and stats, I am basically from a clinical faculty and, as its commonplace, we do not get access to strong statisticians often. My personal quest to solve my problems have succeeded in collaborating with epidemiologists with strong stats background into my PhD committee but I do prefer to attempt to solve my issues first before approaching them. And I must say that Statalist has been quite helpful in that). I too strongly believe that categorizing a continuous variable is not the way to go but, personally for me, I have to do it based on the reasons given.

Best

Last edited by Thekke Purakkal; 20 Jan 2016, 07:14.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#7

20 Jan 2016, 12:47

Dear Thekke, thank you very much for providing the information that helped you to - at least to a given extent - solve the question and entailed the closure of this thread. Kind regards, Marcos.

Best regards,

Marcos
Comment
Lin Naing

Join Date: Apr 2020

Posts: 1
#8

26 Apr 2020, 00:42

Hi, just to share!

I am also teaching statistics telling students "Do not to categorize it" (for the same reason mentioned above) but I add to them that "unless it is necessary".
The saying of "throwing away information" is true but from statistical point of view.
There are real need in practice (dealing with patients, decision making to give treatment and so on). e.g. higher blood pressure, higher risk to get stroke. This statement has very much limited usage in clinical point of view. We really need to know, below which BP point can be considered safe, above which point is consider risky and need to start treatment and so on. Cut-point is something important in real-life. Therefore, we, statisticians, should study and help practitioners with valid ways of finding cut-points.

Important thing is that, don't categorize it when you collect a numerical variable so that you have possibility to find an optimal cut-point with statistics way.
If you have categorize it since from data collection, this is a big loss for us. The cut points used in data collection are mostly baseless. [e.g. BP 80-89 ( ); 90-99( ); 100-109 ...] We should not do this.

Cheers
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4495
#9

26 Apr 2020, 08:10

while I agree with some of what is said in #8, I want to make 2 points: (1) optimal cut-offs are not optimal and (2) if categorization is "needed" for implementation the time to do that is after the analysis not prior to the analysis - at least one of the "consort" type statements goes into each of these issues; one example is Moons, KGM, et al. (2015), "Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration", Annals of Internal Medicine, 162:W1-W73. doi:10.7326/M14-0698
1 like
Comment

Announcement

Cut-points for continuous variables using information from regression models

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment