Handling missing data in matched case-control data-set

Moon Lu

Join Date: Apr 2018
Posts: 17

Handling missing data in matched case-control data-set

04 Apr 2019, 13:05

Dear all,

I would like to ask handling missing data in case control data-set (1:2) matched on age and sex factor. Let's say there are 1000 cases and 2000 controls.

1) I would like to remove the whole matched group which contained missing values in education .
For example, 100 cases had no information about education. So I want to remove all cases and as well their respective controls (200 controls). I have group variables for each case-control set.

2) As shown in below table, sex information are missing for all controls. So how can I replace sex information for each controls which will be same as respective cases?

ID	case	group	education	age	Sex
1	1	1	.	25	Male
2	0	1	1	25	.
3	0	1	2	25	.
4	1	2	.	30	Female
5	0	2	2	30	.
6	0	2	1	30	.
7	1	3	.	40	Male
8	0	3	1	40	.
9	0	3	1	40	.
2998	1	1000	1	35	Male
2999	0	1000	2	35	.
3000	0	1000	1	35	.

Thanks in advance,

Kind Regards,
Moon Lu

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2425
#2

04 Apr 2019, 14:23

To replace the missing sex values if Sex is a string variable with "." for a missing value indicator:

Code:

bysort group (Sex) : replace Sex = Sex[_N]

If Sex is actually a numeric variable, and what you show are value labels, the missing value of . for Sex will sort last, and you need:

Code:

bysort group (Sex): replace Sex = Sex[1]

You will not need to "drop" the individuals in matched groups with missing values on education, as any matched analysis technique I can think of (e.g. -clogit-) will not use data from the groups without data on case's education, as such groups will be noninformative. And, regardless, you probably should just identify such groups with a variable, rather than drop them, anyway:

Code:

bysort group (case]: gen byte CaseMissingEduc = missing(education[_N])
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2425
#3

04 Apr 2019, 16:52

I made a typographical error, should be -bysort group (case) : ......... I inadvertently typed "]" for ")" following "case"
Comment

Moon Lu

Join Date: Apr 2018
Posts: 17

05 Apr 2019, 03:14

Dear Mike,

Thanks for your kind help and suggestion .

1) I tried the suggested command to fill the missing values for variables which is numeric . But error "weights not allowed" r(101) appeared. So I tired the command by removing [] but not replacement in missing values.

2) I also want to stratify my results by the tumor type and I used the command:

by tumor_type, sort : clogit case X1, group(group) or

But the results show outcome does not vary in any group.

Could you pls advice how I can fix them?

ID	case	group	education	age	Sex	tumor_type
1	1	1	.	25	Male	1
2	0	1	1	25	.	.
3	0	1	2	25	.	.
4	1	2	.	30	Female	1
5	0	2	2	30	.	.
6	0	2	1	30	.	.
7	1	3	.	40	Male	2
8	0	3	1	40	.	.
9	0	3	1	40	.	.
2998	1	1000	1	35	Male	2
2999	0	1000	2	35	.	.
3000	0	1000	1	35	.	.

With regards,
Moon Lu

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2425
#5

05 Apr 2019, 09:11

There could be various things wrong here that led to question 1, including of course that I made a mistake. Please look at the FAQ again and learn about using -dataex- to display example data. Use it to post an example of your data (not an "almost example" as you have posted here.). Then, regarding question 1, cut and paste from your results window to show us exactly what you typed that led to the "weights not allowed" error. One guess here, I might add, is that you, like me, mistakenly typed a "[" where you want a "(". But, when you don't show us exactly what you typed, we generally can't help you.

Regarding "outcome does not vary ..:" This likely happened because you have coded all the controls as "missing" on tumor-type. They therefore are excluded from analysis by -clogit-, and the only observations that Stata analyzes are cases, all of which have the value 1. If you want to do this stratified analysis, code tumor-type for each control to be the same as for each corresponding case.

In general, you seem to have some misunderstandings about how missing values work in statistical analysis programs like Stata. One important thing to know is that any observation with a missing value is always excluded from any statistical analysis. In general, this happens because any arithmetic operation involving a missing value is defined to be missing, i.e., (1 + .) == .
Comment
Moon Lu

Join Date: Apr 2018

Posts: 17
#6

06 Apr 2019, 10:54

The error message came up when I typed as follow:

Code:

bysort caseset ( tumor_localization ): replace tumor_localization= tumor_localization [1]

When I removed "space" before "[1]", there was no error message and replacement had been made. :-)

Code:

bysort caseset ( tumor_localization ): replace tumor_localization= tumor_localization[1]

Then,I am now able to stratify my results after the replacement of missing values.

Thanks for referring the FAQ for data example. Next time when I have question, I will use dataex and post the command & error message altogether.

Kind Regards,
Moon Lu
Comment

Announcement

Handling missing data in matched case-control data-set

Comment

Comment

Comment

Comment

Comment