Issues with missing data, N, Cox models

Eduard López

Join Date: Dec 2014

Posts: 48
#1

Issues with missing data, N, Cox models

18 Jan 2016, 16:31

I'm trying to get an article published which attempts to assess if there was a higher risk of a condition lasting longer in a year when compared to a previous year. I sincerely appreciate any help you could provide. Apologies for the extra-long post and for asking two questions on this subreddit on the same day, but feel like it's a different (though related issue). If any mod thinks differently, I will merge them of course.

I have completed the descriptive analysis of a condition and written a table showing all the different variables and N, % and p25, median and p75 duration for that condition by each one of the variables. I have also conducted a Cox proportional hazards model. All anallyses have been stratified by sex (so I have two Cox models, one for each sex).

In most of the literature I've read, they show a total N at the top of the table. For some reason, my ex-boss and co-author insisted on me using a final row, below all the variables, showing "Total". He also insisted on removing missing data information from the descriptive table analysis. Now, after sending the article for publication, my reviewer asks me to add the missing data on the descriptive table and also the observation information for each Cox model (there are two, as the analsyis has been stratified by sex). I'm using Stata, and running into some difficulties here.

1) When running the Cox Model, I used "If" conditions to restrict the analysis to the condition that was of interest to us. Let's say there is condition type A and condition type B. When calculating total N, I should NOT include condition type B in it, even if it was included in the database, I understand (please correct me if I'm wrong). Now, I've also restricted the analysis to condition episodes that lasted X days (as I was told by my supervisors that condition episodes exceeding those days were virtually impossible, and more likely to be an error when collecting information). Should those episodes be counted when calculating the total N?

2) Is the total N then obtained just by dropping the subjects suffering from condition type B (the one I'm not interested in) and using the "describe function"? I seem to get the total number of subjects that way. And the total N is supposed to be the total number of subjects included in the study, right? So in theory, when adding up each variable's categories including missing data, each variable should add up the same number, right?

3) Is it OK to show total N like I described (at a bottom row)? As I'm the main author, would it be OK, to change it to the usual N at the top row?

4) My last problem would be with the number of observations information for the Cox Model. My reviewer asks me to include the information on the number of observations for each model. I understand he does not refer to the "XXXXX total observations 0 exclusions". I get that number right (meaning the total observation coincides with the number of observations I get with the describe command).

However, what I think he refers to is to the number at the "Number of obs = XXXXX" line. Assuming I'm using complete case analysis, that number should be lower than the total number of subjects for each variable right? (As the program is dropping any subject with missing data on any variable). My co-author instead says that "if there are no missing, models should have an N that's the same as the total of the sample". I can't wrap my head around this. There are missing values, the thing is that we are just not showing them. How to reconcile this number with showing information on the missing data?

Again, thanks a lot for any help and apologies for the extension.
PS.: I'm aware user names need to be composed of a full name and username, is there any way I can change my username to follow the rules? Thanks!
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35724
#2

18 Jan 2016, 16:36

I found your title slightly alarming personally as it seemed to be a variation on "issues with N. Cox", but relief all around.

On the PS: explanation at http://www.statalist.org/forums/help#realnames and CONTACT US button at bottom right.
1 like
Comment
Roman Mostazir

Join Date: Apr 2014

Posts: 874
#3

18 Jan 2016, 16:53

Hi, Guillem, please email the admin to change your name.

When you restrict with if clause only those observations are selected for the model and rows are dropped for any missing values that occur with any variable involved in your model. Actually you can easily find out the total number of observation used in your Cox' model. Rund the model and then type :

Code:

// For total observations: di e(N) // For number of subjects di e(N_sub) // For other information: ereturn list // and then type: di e(other information)

This should really answer your 1, 2 and 4. For # 3, it is really a matter of taste. You go with what you like, unless advised otherwise by the reviewers or others you are obliged to . Your co-author is right. If there is no missing in any of the variables, the N=total sample. However, if you have missing in any of the rows in any of the variables that row (observation) is dropped from the model. Just to let you know that your use of term 'complete case' means cases without any missing values, it does not imply that the dataset does not have any missing value. Check the summary of the variables, and shouldn't be difficult to investigate.

Regards,

Last edited by Roman Mostazir; 18 Jan 2016, 16:59.

Roman
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#4

18 Jan 2016, 18:31

Cross-posted http://stats.stackexchange.com/quest...s-on-cox-model

Also references to "subreddit".

Please see our cross-posting policy http://www.statalist.org/forums/help#crossposting
1 like
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#5

19 Jan 2016, 02:18

Thank you very much Roman, I will try it first thing when I get home. Much appreciated!

Nick, I'm sorry if it sounded like I was singling you out, promise that wasn't my intention .
I didn't know the cross-posting policy, thanks for the heads up! I already contacted the moderators to change my username. I posted the same question here: https://www.reddit.com/r/AskStatisti...ng_data_total/

And here: http://stats.stackexchange.com/quest...s-on-cox-model

Thanks again guys.

Last edited by Eduard López; 19 Jan 2016, 02:22.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#6

19 Jan 2016, 02:22

Thanks for this, but please note that the policies we mention are all explained in the FAQ Advice: all posters are asked to read that before posting, once on the home page and once in the prompt when you open a new topic. If you post further, it would be a good idea to read it.
1 like
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#7

19 Jan 2016, 02:24

Definitely, you are right. I'm going to go through it so I don't mess up again.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#8

19 Jan 2016, 03:43

Thanks for that. No one likes being stopped by a traffic cop....
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#9

19 Jan 2016, 12:22

OK, so I basically got the respone from my co-author to check that "there are no differences between excluded registers and the included ones" in order to justify our use of complete case analysis (and hence, not using any imputation).

If I understand correctly, this means that I should determine if missing data is MCAR, MAR or MNAR, right?

Following this link: https://www.ssc.wisc.edu/sscc/pubs/stata_mi_decide.htm

I specifically followed the part:
"First create a new indicator variable for each existing variable which is 1 if a given observation is missing that variable and 0 if it is not. The misstable command can do this part automatically with the gen() option. Then run logit models to test if any of the other variables predict whether a given variable is missing. If they do, then the data is MAR rather than MCAR".

My question is: after running logit, where do I see (in the output) if any of the other variables predict whether a given variable is missing?

Thanks for your help and patience.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#10

19 Jan 2016, 13:01

The concepts of MCAR and MAR are complicated and frequently misunderstood. What you are proposing to do is, in principle, impossible.

Let's start with MCAR, and focus on a single variable y. The meaning of y being MCAR is as follows:

"Missingness of y is independent of the actual (observed or unobserved) value of y."

So suppose that you had a magic oracle that could provide you with a new varialbe, y_actual which equals y for the observed values of y, and which equals the actual value of y that would have been observed had you been able to get it for those observations where y is missing. You could then create a variable miss_y coded 0 if y was observed, 1 if y was missing. You would then explore whether y_miss and y_actual are independent in some suitable way. If you concluded that y_miss and y_actual are independent, then your data would be MCAR.

MAR is simliar, with just an additional complication. You would perform some test to see if y_miss is independent of y_actual conditional on all the other variables in your model.

The problem, of course, is that you don't have an oracle to tell you y_actual. And if you did, you would not use the values of y_actual for this silly exercise: you would use them in the actual model analysis! Absent such an oracle, there is no way to tell from within the data whether y is MCAR, MAR, or MNAR.

The best you can do is to ponder the process that generates missingness in your data set. If, for example, y is the result of a blood test, and the missing values of y result from a power failure at the lab when those particular specimens were to be processed, then unless the electric company is conspiring to mess up your study, it is likely that such data are MCAR. Similar but more complicated reasoning might be brought to bear on the less stringent MAR condition. At the other extreme, if your data are longitudinal and y is some measure of functional status, then it is highly likely that you will have more missing data from people with poor functional outcome if only because they are less likely to be able to show up for the end-of-study assessment. In that case the data are almost certainly MNAR.

But in the end, your reasoning must always arise from some assessment of the process giving rise to the missingness of data: it cannot be tested within the data set you have. I would call your attention to the following paragraphs from the link you gave in #9:

There is no formal test for determining whether a given set of logit results means the data is MCAR or MAR, but they will give you a sense of how close the data are to MCAR and how big a problem the deviations from MCAR are likely to be. The bigger the deviation the stronger the case for using multiple imputation rather than complete cases analysis.

By definition you cannot determine whether data are MNAR by looking at the observed values. Think carefully about how the data was collected and consider whether some values of the variables might make the data more or less likely to be observed. For example, people with very high or very low incomes might be less willing to disclose them, or people with high BMIs. People with a strong interest in the topic of a survey might be more likely to respond than those who care less. Schools might try very hard to make sure students they expect to do well take standardized tests but put much less effort into having students they expect to do poorly take them. In the last example, adding variables like grades or socioeconomic status that predict test performance and thus probability of taking the test might make the data plausibly MAR.

While I think this quote beats around the bush, it is warning you that the testing procedures it advises do not actually answer the real question but just give you a (vague, and in my opinion not terribly useful) sense of things.

Last edited by Clyde Schechter; 19 Jan 2016, 13:05.
1 like
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#11

19 Jan 2016, 14:44

Thanks a lot for your help, Clyde. So basically, I could only assess MCAR, MAR or MNAR based on the knowledge of the data collection process, right? And if I understood correctly, MNAR being the case when the probability of missing data depends on that very variable (for example, if you have a lot of missing data in the Age variable, and incase you had an oracle and could see where the missing data is concentrated, you would see it is so in older people). MAR means the probability of missing data on a variable depends on another variable (missing data in the Age variable depends on the Country variable, with people from X country being less likely to report their age). MCAR meaning that missing data is absolutely random (no relationship with any other factor).
A couple of questions:

1) My issue is that a reviewer is asking me to explain why I used complete case analysis in the Cox model, as it could lead to a selection bias. As I understand it, a selection bias might be involved if missing data are MNAR. My co-author is, thus, requiring me to check if missing data are MCAR (to explain that no selection bias is happening).
But if I follow your explanation, knowing if missing data are MCAR cannot be infered from tinkering with the database in Stata: I would need further information from the data collection process, am I correct?

So, what is the purpose of running logit on Stata as explained on the link I posted? Do the coefficients tell you anything about MCAR, MAR or MNAR? And by anything I mean (assuming that you only might infer MCAR, MAR or MNAR from examining the data collection process) that they at least might preclude the possibility of that variable's missing data being MCAR/MAR/MNAR. If so, how? (ie. what to look for in Stata's output).

2) Could it be a way to check if there is any selection bias happening to do the following? Choose two independent variables (age, sex, for example) and compare them in the subjects with NO missing data in any observations and those with missing data in at least one observation? If so, how?

Thanks a lot Clyde, appreciate your help.
Comment
Roman Mostazir

Join Date: Apr 2014

Posts: 874
#12

19 Jan 2016, 16:30

Given the area of the complexity, an indepth guideline for investigating and handling of missing data mechanism is nearly impossible here. There are several ways for investigating missing pattern distribution, often the process goes with the 'objective hunch' of the researcher. You can go for logit, probit, stratification or others. However, whatever you do, the first thing that gets priority is to understand the concept clearly. Once you understand the conept, you can investigate with any of the methods directed by your objective hunch ('Objective research is a myth' !! but we can get closer to it). As long as your data is mcar or mar, a complete case analysis is justified. I attached a short and very lucid paper on the concept which is written for general readers hope that helps you.
Attached Files

MCAR_vs_MAR_vs_MNAR.pdf (371.2 KB, 1 view)

Roman
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#13

19 Jan 2016, 16:43

Thanks a lot for your help, Clyde. So basically, I could only assess MCAR, MAR or MNAR based on the knowledge of the data collection process, right? And if I understood correctly, MNAR being the case when the probability of missing data depends on that very variable (for example, if you have a lot of missing data in the Age variable, and incase you had an oracle and could see where the missing data is concentrated, you would see it is so in older people). MAR means the probability of missing data on a variable depends on another variable (missing data in the Age variable depends on the Country variable, with people from X country being less likely to report their age). MCAR meaning that missing data is absolutely random (no relationship with any other factor).
A couple of questions:

1) My issue is that a reviewer is asking me to explain why I used complete case analysis in the Cox model, as it could lead to a selection bias. As I understand it, a selection bias might be involved if missing data are MNAR. My co-author is, thus, requiring me to check if missing data are MCAR (to explain that no selection bias is happening).
But if I follow your explanation, knowing if missing data are MCAR cannot be infered from tinkering with the database in Stata: I would need further information from the data collection process, am I correct?

Agree with everything you say above.

So, what is the purpose of running logit on Stata as explained on the link I posted? Do the coefficients tell you anything about MCAR, MAR or MNAR? And by anything I mean (assuming that you only might infer MCAR, MAR or MNAR from examining the data collection process) that they at least might preclude the possibility of that variable's missing data being MCAR/MAR/MNAR. If so, how? (ie. what to look for in Stata's output).

Well, one thing you can get a hint about from this is the distinction between MAR and MCAR. If the missingness indicator is predictable from some of your variables, then MCAR is ruled out. MAR is possible, but it could still be MNAR. To decide requires information about the process that generates missingness which cannot be found in the data set.

2) Could it be a way to check if there is any selection bias happening to do the following? Choose two independent variables (age, sex, for example) and compare them in the subjects with NO missing data in any observations and those with missing data in at least one observation? If so, how?

Again, this would only give you a hint about MAR vs MCAR--you would learn whether missingness of age is related to sex and vice versa, so you might be able to rule out MCAR. But that's as far as that will take you. Only a good understanding of how the data were gathered (and how some of it was missed) will take you beyond that.

To respond to your larger predicament, where the reviewer is asking you to justify the use of complete case analysis, you may not be able to unless you have some way to get a handle on the missingness generating process. Besides the simple examples I gave in #10, I found myself in a similar situation once in a study where the outcome variable was a test result obtained within a certain number of days of the anniversary date of patient randomization. We had a substantial amount of missing data because we did not really have any way to push our subjects to get the test done other than reminders and cajoling. But we also knew that the patients' doctors would likely obtain the test at the patient's visit if it were not already available. So, as a "near oracle" we went and got the test results from the doctors' offices (or, occasionally from other sources) that were done a bit too early or a bit too late to qualify for our outcome variable. We rationalized this because the analyte being tested is pretty static over moderate time periods and a result that is a week or two too early is likely to differ little from what would have been obtained on the right date. By expanding our time window we were able to come pretty close to having complete data. And we found that missingness of the rigorously defined outcome appeared to be independent of the expanded time-window outcome. Also, the nature of the outcome was one which the patients would be unlikely to perceive on their own, so that their testing behavior was unlikely to be related to it.. These facts persuaded us, and our reviewers, that we were probably in, or close to, an MCAR situation. Moreover, in addition to the original pre-planned analysis, we also published (in an appendix) a sensitivity analysis using the expanded time-window outcome so that people could see that the results were essentially the same. Perhaps something like that is possible for you. More generally, there may be some proxy for the outcome that you can obtain for the missing cases. And depending on the nature of the variables you are working with, you may be able to do a convincing hand-waving argument that the missing values should be MCAR (or MAR as the case may be).

If nothing of this nature is possible, then you may just have to concede that complete case analysis is a leap of faith. It often is. In that case, you may need to do some robustness analyses. These might include a multiple-imputation analysis, which would cover you if you are fortunate enough to have MAR data (which you will not really know), and some additional analyses in which a few credible models of the missingness process are used to generate synthetic outcome data for the missing cases and the impact on your principal results is assessed.
1 like
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#14

21 Jan 2016, 18:29

Thanks a lot for your help guys!

Roman, I wanted to ask you a question about the syntax you posted in your first message (or to anyone who could help).

My database includes a type of condition (typecondition==2) and a group of people (self-employed, which is selforsalaried==1) that I don't want to include in the study. So I restricted the Cox model by using the if clause like this:

Code:

preserve gen evento1=. replace evento1=1 if conditiondays>0 & typecondition==1 & selforsalaried==2 & sex==1 & diasenbaja!=. stset conditiondays if conditiondays<=550, failure(evento1==1) scale(1) xi: stcox i.year i.agegroup i.state i.industry i.contracttype i.incomegroup i.icd9, nolog restore

This was therefore the Cox model for men. I then run the syntax you posted on your first message and I get 714198 for both total observations and total subjects (the total number of episodes).
Once it finishes, I then run the model again but changing the "sex==2" group in order to run it for women, again excluding the type of condition I'm not interested in and the self-employed.

Now, the thing is that I if I run the syntax you told me, it again gives me 714198 for both total observations and total subjects. Now, when I run the models but instead of using the "if" clause in the syntax, I resort to running:

Code:

drop if typecondition==2 drop if selforsalaried==1

Before running the Cox model, and then run the syntax you told me, I get a different number, 401567. This leads me to think that I'm getting a total number depending on the database, not the Cox model? I just don't get it, because you said "When you restrict with if clause only those observations are selected for the model", so am I doing anything wrong? How to get the number of observations for each model (the one for men and the one for women, but both of them excluding the type of condition 2 and the self-employed)?
And it's not just the number of observations, because when I run the Cox model after dropping the episodes I'm not interested in, I get a different HR than if I do it just using the if clauses (not by a lot, but still, if the model was restricted by using the if clauses, how do I get a different HR when I "restrict" the database in the same sense and then run the Cox model?). Am I terribly messing something up?

Thanks a lot in advane!
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#15

24 Jan 2016, 05:43

On closer inspection I realized that the reviwer misunderstood me as saying that I had 60% missing data, when what I wrote is that covariables with 60% missing data would be dropped.
The problem with the Cox model still remains though... any thoughts? Thanks a lot for the help in advance.
Comment

Announcement

Issues with missing data, N, Cox models

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment