Case-control analysis

Maryam Bidgoli

Join Date: Feb 2016
Posts: 89

Case-control analysis

10 Jul 2020, 14:02

Hi Stata users,

I apologize in advance if my question is simple. I am working with HCUP NIS dataset. There are 8,715 breast cancer patients who were hospitalized for doing "mastectomy", and 3,412 women were hospitalized to do "breast reconstruction". I am interested in doing a comparison in terms of "depression" between these two groups. Does it make sense if I do a case-control analysis; cases are those who did "mastectomy" and controls are those who did "breast reconstruction". And, I want to match them on age, length of stay (LOS) in hospital and Carlson comorbidity index.

Any suggestion is appreciated.

Here is the data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int AGE byte DIED long LOS byte household_inc float(charlindex breast_cancer) byte depression float(race health_ins patient_resid) byte(mastectomy reconstruction)
50 0   2 4 1 0 1 1 3 1 0 0
77 0   2 2 1 0 0 2 1 2 0 0
39 0   1 3 1 0 0 3 2 2 0 0
65 0   5 1 0 0 0 2 4 4 0 0
58 0   8 1 2 0 1 1 2 4 0 0
68 0   1 4 4 0 0 2 1 1 0 0
32 0  16 1 0 0 0 1 3 3 0 0
62 0   2 1 2 0 0 1 1 1 0 0
73 0   3 2 0 0 0 1 1 3 0 0
27 0   1 1 1 0 0 1 3 3 0 0
82 0   3 4 5 0 0 2 1 1 0 0
75 0   6 1 4 0 0 3 1 3 0 0
38 0  12 4 3 0 0 2 2 1 0 0
50 0   6 4 4 0 1 1 2 1 0 0
56 0   3 2 2 0 0 1 4 3 0 0
61 0   2 4 1 0 0 1 3 1 0 0
78 0   4 1 4 0 0 2 1 1 0 0
61 0   3 2 0 0 0 1 3 4 0 0
48 0   2 2 0 0 1 1 3 1 0 0
58 0   1 1 2 0 1 1 3 2 0 0
82 0   2 1 0 0 0 1 1 2 0 0
61 0  14 4 0 0 0 4 1 1 0 0
51 0   2 3 1 0 0 2 2 2 0 0
69 0   1 4 0 0 0 1 1 2 0 0
82 0  10 4 4 0 0 3 1 1 0 0
21 0  26 1 0 0 0 3 2 1 0 0
62 0   2 3 2 0 0 1 2 1 0 0
32 0  19 3 0 0 0 1 4 1 0 0
71 0   4 3 1 0 0 1 1 4 0 0
61 0   1 1 1 0 0 1 1 2 0 0
60 0   6 2 0 0 0 1 3 2 0 0
18 0   1 3 0 0 0 1 2 1 0 0
57 0   2 3 1 0 1 1 3 2 0 0
46 0   2 2 0 0 0 1 4 3 0 0
78 0   1 2 1 1 1 1 1 2 0 0
59 0  16 4 4 0 0 1 1 1 0 0
22 0   3 2 0 0 1 3 2 1 0 0
37 0   3 2 0 0 0 1 3 2 0 0
74 0   1 3 2 0 0 3 1 1 0 0
30 0   2 1 0 0 0 2 3 1 0 0
64 0   2 4 0 0 1 1 3 1 0 0
50 0   3 1 4 0 0 3 4 1 0 0
82 0   3 3 2 0 0 1 1 1 0 0
29 0   3 4 0 0 0 1 3 1 0 0
60 0   2 1 1 0 1 1 3 4 0 0
61 0   2 1 2 0 0 3 2 2 0 0
44 0   3 1 1 0 0 1 4 2 0 0
72 0  19 1 2 0 1 1 1 1 0 0
74 0   2 2 3 0 0 2 1 1 0 0
31 0   1 4 0 0 0 1 3 2 0 0
42 0   4 4 0 0 1 1 3 1 0 0
40 0   3 4 0 0 0 1 3 1 0 0
24 0   2 2 1 0 0 2 3 2 0 0
27 0   2 1 0 0 0 3 3 1 0 0
47 0   8 1 6 0 0 2 1 3 0 0
59 0   0 1 1 0 0 2 1 4 0 0
40 0   0 3 0 0 0 2 3 1 0 0
64 0   9 1 5 0 0 4 2 1 0 0
67 0   2 3 2 0 0 1 1 1 0 0
72 0   9 1 3 0 0 1 1 2 0 0
55 0   2 3 5 0 0 1 1 2 0 0
75 0   2 3 1 0 0 1 1 2 0 0
39 0  12 1 0 0 0 2 2 1 0 0
42 0   1 1 1 0 0 2 4 1 0 0
65 0   2 2 1 0 0 1 1 2 0 0
53 0   1 1 0 0 1 1 1 4 0 0
61 0   2 1 3 0 0 1 3 3 0 0
49 0   1 1 1 0 0 3 2 1 0 0
44 0   3 2 2 0 0 2 1 2 0 0
67 0   1 4 0 0 0 2 1 1 0 0
80 0   3 4 3 0 0 1 1 1 0 0
66 0   1 4 0 0 0 1 1 1 0 0
57 0   2 3 1 0 0 1 3 1 0 0
44 0   3 1 1 0 1 1 3 2 0 0
61 0   3 4 0 0 0 1 3 2 0 0
26 0   1 1 1 0 1 1 1 2 0 0
43 0   2 2 0 0 1 1 1 4 0 0
72 0   3 1 1 0 1 1 1 3 0 0
21 0   2 1 0 0 0 1 3 2 0 0
68 0   3 2 4 0 1 1 1 1 0 0
71 0   7 2 5 0 0 1 1 1 0 0
49 0   4 3 1 0 1 1 2 2 0 0
79 0   6 1 2 0 0 1 1 1 0 0
81 0   1 2 5 0 0 1 1 4 0 0
80 0   4 4 2 0 0 1 1 2 0 0
61 0   1 4 3 0 0 1 2 1 0 0
66 0   7 3 1 0 0 1 1 1 0 0
72 0  14 4 1 0 0 1 1 1 0 0
80 1   4 4 2 0 0 1 1 1 0 0
60 0 107 2 3 0 0 1 1 3 0 0
37 0   4 1 0 0 0 1 3 1 0 0
81 0   7 4 1 0 0 1 1 1 0 0
34 0   1 1 0 0 0 1 4 3 0 0
55 0  12 3 5 0 0 3 2 2 0 0
30 0   0 3 0 0 0 2 3 1 0 0
71 0   6 3 4 0 0 3 1 1 0 0
58 0   7 1 0 0 0 2 3 4 0 0
58 0   3 4 3 0 0 1 1 1 0 0
79 0   3 2 3 0 0 1 1 4 0 0
48 0   1 4 0 0 0 2 2 1 0 0
end

Tags: Case-control Match, Suggestion

Maryam Bidgoli

Join Date: Feb 2016
Posts: 89

10 Jul 2020, 15:06

This is the code:

Code:

keep if reconstruction==1 & mastectomy==0
rename * *_control
rename AGE_control AGE
rename LOS_control LOS
rename charlindex_control charlindex
save "C:\Users\mjafaribidgoli\Desktop\reconstruction_controls.dta",replace

clear
use "C:\Users\mjafaribidgoli\Desktop\HCUP\HCUP_NIS_2016\NIS_2016_Core_cleaned.dta"
gen long patient_id = _n
keep if reconstruction==0 & mastectomy==1
rename * *_case
rename AGE_case AGE
rename LOS_case LOS
rename charlindex_case charlindex

*NOW JOIN ON AGE AND LOS Charlindex
joinby AGE LOS charlindex using "C:\Users\mjafaribidgoli\Desktop\reconstruction_controls.dta"

* Randomly select one match if there are more
set seed 1234
gen double shuffle = runiform()
by patient_id_case (shuffle), sort: keep if _n == 1
drop shuffle

mcc depression_case depression_control

Comment

Maryam Bidgoli

Join Date: Feb 2016

Posts: 89
#3

13 Jul 2020, 12:23

Any suggestion?
Comment
Chris Boudreaux

Join Date: Jul 2020

Posts: 83
#4

13 Jul 2020, 16:04

Hi Maryam,

I'm not familiar with the matched case-control command (mcc). Here is a Stata manual you might find helpful (https://www.stata.com/manuals/repitab.pdf).

You might also consider another matching technique like propensity score matching or coarsened exact matching. I hope you can find something useful here.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#5

13 Jul 2020, 17:15

The proposed study is definitely not what any source in epidemiology would call a "case-control" study. In case-control studies, the case vs. control variable is an *outcome,* not a predictor. If Maryam was looking for literature that would help, this terminological difference would have led to serious confusion.

Now, that being said, it is perfectly legitimate to use the two treatment groups (mastectomy vs. reconstruction) as a predictor of depression, and there would be various ways one might control for variables such as age, LOS, etc., but this would just be an ordinary comparison of groups, not a case-control study. The -mcc- command would not be relevant here, as it is intended to be used when the *outcome* variable is binary. The outcome here is depression, presumably measured on some standard quasi-interval scale. I think coarsened exact matching or even a conventional regression analysis would be reasonable choices here, but definitely not -mcc- unless "depression" is binary variable. I'd recommend starting with the simplest analysis, i.e., a conventional regression.

One more suggestion for Maryam here: Per the FAQ, people on StataList come from all sorts of disciplinary backgrounds. Using abbreviations and terminology from a particular discipline (cancer epidemiology) will reduce your chances of getting a helpful response, as relatively few of us will know what e.g."HCUP NIS" is. There certainly are some very competent and helpful epidemiologists here on StataList, but it's worth noticing that none of them have yet responded to your question. So, the less discipline-specific terminology you use in a question, the better your chance of getting an answer.
Comment
Paul Dickman

Join Date: Apr 2014

Posts: 294
#6

14 Jul 2020, 05:40

I agree with Mike. Your question is about epidemiology rather than statistics, so to answer we need more information about the study design (not so applicable in this case), data, and your research question or hypothesis (most relevant here).

I've not worked with the HCUP NIS dataset, but it apparently contains data on hospital admissions (with which I have extensive experience). Some key questions are:

1. When were the admissions in relation to the breast cancer diagnosis (was the surgery done shortly following diagnosis or later).
2. When was the information on depression obtained?

In the inpatient care databases with which I have worked, the observation contains only information collected during that particular admission. I'm guessing depression was probably assessed at admission. It might possibly be history of depression, but it would be rare for the source data to contain episodes of future depression (unless you have merged future depression episodes with the treatment episodes in which case that would have been useful to know).

If you have identified future admissions for depression, then you will have a time-to-event study and you probably should record the time between the treatment admission and the depression admission.

Mike's response is based on the assumption that you are interested in studying whether treatment (mastectomy vs. reconstruction) is associated with depression. This is probably the most likely research question, but one could also potentially study whether depression affects choice of treatment. If depression was assessed concurrently then your design is cross-sectional and if it was assessed prior to the admission (i.e., the depression represents history of depression) then your design could actually be case-control.

If you are interested in studying if treatment causally affects depression then there are a lot of issues to consider (e.g., confounding by indication, appropriateness of the positivity assumption).
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#7

14 Jul 2020, 07:57

I had not considered Paul's point about depression being plausibly analyzed as a cause of treatment choice. If that's the perspective here, then it's conceivable that -mcc- might be relevant here, but I still wouldn't think of this as a case-control study.
Comment
Maryam Bidgoli

Join Date: Feb 2016

Posts: 89
#8

14 Jul 2020, 11:22

Originally posted by Chris Boudreaux View Post

Hi Maryam,

I'm not familiar with the matched case-control command (mcc). Here is a Stata manual you might find helpful (https://www.stata.com/manuals/repitab.pdf).

You might also consider another matching technique like propensity score matching or coarsened exact matching. I hope you can find something useful here.

Thanks Chris for the suggestion. Actually, I am interested in doing a comparison between two groups of patients; (1). those who just had mastectomy and (2). those who did breast reconstruction following mastectomy, and my interested outcome is "depression". That is why I am thinking of case control analysis. In a separate analysis, I have applied the propensity score matching technique using the psmatch2 and teffect commands. For example, to see the impact of breast cancer or mastectomy or breast reconstruction on "depression".
1 like
Comment
Chris Boudreaux

Join Date: Jul 2020

Posts: 83
#9

14 Jul 2020, 11:41

Thank you for the explanation. Again I am no expert on this type of method, so I will defer to you, Mike, and Paul.

I only mentioned matching techniques because I am somewhat familiar with these, and I thought certain designs might be applicable to your study. For instance, there is a paper by Kautonen et al. (2017) that examines whether job switching has an effect on the quality of life of entrepreneurs. They use propensity score matching to compare three groups: (1) Switching to Entrepreneurship vs. Staying in the Same Job, (2) Switching to a New Job vs. Staying in the Same Job, and (3) Switching to Entrepreneurship vs. Switching to a New Job. Their idea is that job switchers might have some unobserved characteristic that would confound any estimates of the effect of switching to entrepreneurship. But, if you compare switching to entrepreneurship vs. switching to a new job, you are more likely to eliminate this unobserved trait.

I'm not sure if this is useful to you, but when I saw your problem I thought you might be able to design a similar analysis. You could compare (1) those who had a mastectomy to those who did not, (2) those who had a breast reconstruction to those who did not, and (3) those who had a breast reconstruction against those who only had a mastectomy.

This may or may not be helpful to you, but I thought I would share.

Kautonen, T., Kibler, E., & Minniti, M. (2017). Late-career entrepreneurship, income and quality of life. Journal of Business Venturing, 32(3), 318-333.
1 like
Comment
Maryam Bidgoli

Join Date: Feb 2016

Posts: 89
#10

14 Jul 2020, 11:54

Originally posted by Mike Lacy View Post

The proposed study is definitely not what any source in epidemiology would call a "case-control" study. In case-control studies, the case vs. control variable is an *outcome,* not a predictor. If Maryam was looking for literature that would help, this terminological difference would have led to serious confusion.

Now, that being said, it is perfectly legitimate to use the two treatment groups (mastectomy vs. reconstruction) as a predictor of depression, and there would be various ways one might control for variables such as age, LOS, etc., but this would just be an ordinary comparison of groups, not a case-control study. The -mcc- command would not be relevant here, as it is intended to be used when the *outcome* variable is binary. The outcome here is depression, presumably measured on some standard quasi-interval scale. I think coarsened exact matching or even a conventional regression analysis would be reasonable choices here, but definitely not -mcc- unless "depression" is binary variable. I'd recommend starting with the simplest analysis, i.e., a conventional regression.

One more suggestion for Maryam here: Per the FAQ, people on StataList come from all sorts of disciplinary backgrounds. Using abbreviations and terminology from a particular discipline (cancer epidemiology) will reduce your chances of getting a helpful response, as relatively few of us will know what e.g."HCUP NIS" is. There certainly are some very competent and helpful epidemiologists here on StataList, but it's worth noticing that none of them have yet responded to your question. So, the less discipline-specific terminology you use in a question, the better your chance of getting an answer.

Thank you so much Mike for your reply. I found it very helpful.
Sorry for the confusion. You are absolutely right, I should have explained more about the dataset (HCUP, NIS). Here is more information:
The National Inpatient Sample (NIS) is a large inpatient care database in the U.S., containing data on more than 7 million hospital stays, and weighted 35 million hospital stays. I use the 2016 NIS file, including diagnosis coding on the inpatient data and procedure coding on inpatient data (ICD-10-CM and ICD-10-PCS).

I used the diagnosis and procedure codes to create binary variables for the breast cancer, mastectomy, breast reconstruction and depression. My hypothesis is those women who did breast reconstruction following mastectomy are less likely to be depressed compared with those who just did mastectomy.

Mike, "depression" is a binary variable. So you think the -mcc- is relevant? Below, is the result that I got from -mcc-,

Code:

. mcc depression_case depression_control | Controls | Cases | Exposed Unexposed | Total -----------------+------------------------+------------ Exposed | 106 370 | 476 Unexposed | 453 1867 | 2320 -----------------+------------------------+------------ Total | 559 2237 | 2796 McNemar's chi2(1) = 8.37 Prob > chi2 = 0.0038 Exact McNemar significance probability = 0.0042 Proportion with factor Cases .1702432 Controls .1999285 [95% Conf. Interval] --------- -------------------- difference -.0296853 -.0501227 -.0092478 ratio .8515206 .7635821 .9495866 rel. diff. -.0371033 -.0627005 -.011506 odds ratio .816777 .7100281 .9391025 (exact)

Thank you so much again for all your helpful suggestion.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#11

14 Jul 2020, 12:15

(Maryam's response crossed with what I had written, and made most of what I said irrelevant, so I deleted it.)

Last edited by Mike Lacy; 14 Jul 2020, 12:23.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#12

14 Jul 2020, 13:34

Taking into account Maryam's most recent posting, I'd recommend -clogit- over -mcc-. -clogit- is more versatile than -mcc-, and -mcc-, in my opinion, is one of the more thinly and confusingly documented commands in Stata. Also, -mcc- requires the data for two matched individuals to be on the same observation, which is unlike almost any other Stata command. -clogit- accommodates continuous as well as binary predictors. Finally, -mcc- only permits 1:1 matching, meaning that in the current situation, Maryam might well have to throw away a lot of useful observations . Among the many matched sets that might be formed here, one might have (say) 2 mastectomies and 5 reconstructions, another might have 3 mastectomies and 1 reconstruction, etc. One would want to use all of these. Another possibility here is to use -mhodds- (preferable to -mcc- but not to -clogit-). This requires creating a variable to identify the matched sets, as would be done for -clogit-, and then using -mhodds- with the matched set variable as the stratifying variable.

Here's an illustration of the three approaches to estimate an odds ratio, which give the same parameter estimate, with some minor differences for the CI or p-values depending on Stata's use of asymptotic vs. exact methods. Note that this example has only 1:1 matching:

Code:

use http://www.stata-press.com/data/r15/lowbirth2, clear keep low smoke pairid clogit low smoke, group(pairid) or mhodds low smoke pairid // The hard way, mcc, requires an inconvenient and // nonintuitive data layout. reshape wide smoke , i(pairid) j(low) rename (smoke0 smoke1) (smoke_control smoke_case) mcc smoke_case smoke_control

Finally: Since this is not a case-control study, there's no reason to have to use the odds ratio. If a matching or similar approach is desired, and a difference of proportions is of interest as an effect measure, other approaches (perhaps something in the -teffects- suite of commands), would be possible.
Comment
Maryam Bidgoli

Join Date: Feb 2016

Posts: 89
#13

14 Jul 2020, 13:43

Originally posted by Paul Dickman View Post

I agree with Mike. Your question is about epidemiology rather than statistics, so to answer we need more information about the study design (not so applicable in this case), data, and your research question or hypothesis (most relevant here).

I've not worked with the HCUP NIS dataset, but it apparently contains data on hospital admissions (with which I have extensive experience). Some key questions are:

1. When were the admissions in relation to the breast cancer diagnosis (was the surgery done shortly following diagnosis or later).
2. When was the information on depression obtained?

In the inpatient care databases with which I have worked, the observation contains only information collected during that particular admission. I'm guessing depression was probably assessed at admission. It might possibly be history of depression, but it would be rare for the source data to contain episodes of future depression (unless you have merged future depression episodes with the treatment episodes in which case that would have been useful to know).

If you have identified future admissions for depression, then you will have a time-to-event study and you probably should record the time between the treatment admission and the depression admission.

Mike's response is based on the assumption that you are interested in studying whether treatment (mastectomy vs. reconstruction) is associated with depression. This is probably the most likely research question, but one could also potentially study whether depression affects choice of treatment. If depression was assessed concurrently then your design is cross-sectional and if it was assessed prior to the admission (i.e., the depression represents history of depression) then your design could actually be case-control.

If you are interested in studying if treatment causally affects depression then there are a lot of issues to consider (e.g., confounding by indication, appropriateness of the positivity assumption).

First of all, thank you so much for taking time to comment. Here are more information and answers to your questions:
More info. on data: the NIS is National Inpatient Sample. I purchased the data from 2010 through 2016. However, In order to keep it simple I decided to work only with 2016 NIS due to transmission in coding system from ICD9 to ICD10 in 2015.

Answer 1: There are up to 30 codes for diagnosis (DX1, DX2,...,Dx30). The DX1 indicates the primary reason of being hospitalized. I generated a binary variable for breast cancer based on all 30 diagnosis codes. For instance, someone may be hospitalized for diabetes (DX1=diabetes), but if she had a breast cancer history (for example DX2), she is captured in the treatment group. I am not sure this is true or not to look at all DXs for breast cancer instead of DX1, which is the primary reason of hospitalization (I came across a study that just used primary diagnosis at admission, DX1, for the breast cancer variable. There are up to 15 codes for the procedures (PC1, PC2,...PC15). To make binary variables for those who only had mastectomy vs. breast reconstruction following mastectomy, I used the 15 codes for the procedures.

Answer 2: You are right; depression was assessed at admission. So I used the DX2-DX30 to see if someone is diagnosed with depression or not.

Yes, this is a cross-sectional study, and it is not a real case-control analysis. I just thought I might be able to use the same technique to compare those who had mastectomy vs. reconstruction in terms of depression.

Thank you so much again.
Comment
Maryam Bidgoli

Join Date: Feb 2016

Posts: 89
#14

14 Jul 2020, 15:01

Originally posted by Mike Lacy View Post

Taking into account Maryam's most recent posting, I'd recommend -clogit- over -mcc-. -clogit- is more versatile than -mcc-, and -mcc-, in my opinion, is one of the more thinly and confusingly documented commands in Stata. Also, -mcc- requires the data for two matched individuals to be on the same observation, which is unlike almost any other Stata command. -clogit- accommodates continuous as well as binary predictors. Finally, -mcc- only permits 1:1 matching, meaning that in the current situation, Maryam might well have to throw away a lot of useful observations . Among the many matched sets that might be formed here, one might have (say) 2 mastectomies and 5 reconstructions, another might have 3 mastectomies and 1 reconstruction, etc. One would want to use all of these. Another possibility here is to use -mhodds- (preferable to -mcc- but not to -clogit-). This requires creating a variable to identify the matched sets, as would be done for -clogit-, and then using -mhodds- with the matched set variable as the stratifying variable.

Here's an illustration of the three approaches to estimate an odds ratio, which give the same parameter estimate, with some minor differences for the CI or p-values depending on Stata's use of asymptotic vs. exact methods. Note that this example has only 1:1 matching:

Code:

use http://www.stata-press.com/data/r15/lowbirth2, clear keep low smoke pairid clogit low smoke, group(pairid) or mhodds low smoke pairid // The hard way, mcc, requires an inconvenient and // nonintuitive data layout. reshape wide smoke , i(pairid) j(low) rename (smoke0 smoke1) (smoke_control smoke_case) mcc smoke_case smoke_control

Finally: Since this is not a case-control study, there's no reason to have to use the odds ratio. If a matching or similar approach is desired, and a difference of proportions is of interest as an effect measure, other approaches (perhaps something in the -teffects- suite of commands), would be possible.

Thanks , Mike! This is very relevant and helpful. Can I still use -clogit- for a cross-sectional data structure? I didn't use it because I thought it is used for panel data.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#15

14 Jul 2020, 16:55

Yes, you definitely can use -clogit-. The conditional logit model, from a practical point of view, applies whenever you want to control for some variable but do not need to estimate the effect of that variable. In a "panel" setting, that variable would commonly be the ID of the individual, with multiple observations having the same value of the ID variable. Here, the matching variable defines sets of individuals that share the same set of background characteristics. You want to control for those factors but not estimate their effects. Conditional logit thus enables a finely stratified analysis, but without estimating the effect of the stratifying variable. Note that the -xtlogit- command, designed for panels, includes -clogit- as a special case (fixed effects), and you could also use it.
1 like
Comment

Announcement

Case-control analysis

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment