Individual matching for two exposure variables in a follow up cohort

Deependra Singh

Join Date: Mar 2015
Posts: 8

Individual matching for two exposure variables in a follow up cohort

01 Dec 2017, 02:35

Dear Stata users,
I have a huge follow up data of more than 4 million visits (for cancer screening) made by about 1 million women. The maximum number of visits a woman can have is 10. There are two symptom variables in the data set, which can occur at any visit during 1-10 visits. A woman can have more than one symptoms also. Below is an example data, for w1 woman, she had symptoms in her 6th visit, now, I want to find a similar women without a visit with symptoms in the same visit (i.e., 6th visit) matched by 4 background variables (not shown below). If a woman had more than one symptoms during 1-10 visits, I just consider the first visit with symptoms. How do I find the similar non-symptomatic visit for a given visit with symptom matched by 4 variables? Is it possible to match two exposure variables (visit with symptoms) to unexposed (visit without symptoms)?

My exposure group is visit with symptoms and comparison group is visits without symptoms. The matching ratio is 1:1 and I have no difficulty finding non-symptomatic visits using 4 matching variables. My follow-up time starts from the visit date with symptoms and ends at the exact date of death (due to cancer or other cause) or at last visit date/loss to follow up.

Women	number of visits	year of visit	symptom 1	symptom 2	death
w1	1	1992	0	0	0
	2	1994	0	0	0
	3	1996	0	0	0
	4	1998	0	0	0
	5	2000	0	0	0
	6	2002	1	0	0
	7	2004	0	0	0
	8	2006	0	0	0
	9	2008	0	0	1
	10	2010	.	.	.
w2	1	1996	0	0	0
	2	1998	1	0	0
	3	2000	0	0	0
	4	2002	0	0	0
	5	2004	0	0	0
	6	2006	0	0	0
	7	2008	0	1	0
	8	2010	0	0	0
	9	2012	0	0	0
	10	2014	0	0	0

Thank you.

kind regards,
Deependra

Tags: None

Dave Airey

Join Date: Apr 2014

Posts: 398
#2

01 Dec 2017, 10:15

You could change the data structure to wide, and concatenate the symptom variables to one string variable like for w1 "00000100." Then you might use grep to search for patterns on those string variables.
Comment

Dave Airey

Join Date: Apr 2014
Posts: 398

01 Dec 2017, 13:23

Noting the size of the dataset it is possible to concatenate strings in long format without reshaping.

Code:

clear
input id symptom
1 0
1 0
1 0
1 1
1 0
1 .
2 0
2 0
2 0
2 0
2 0
2 0
end
tostring symptom, generate(sympstr)
sort id, stable
by id: generate concat = sympstr if _n == 1
by id: replace concat = concat[_n-1] + sympstr if _n > 1
by id: replace concat = concat[_N]
list, sepby(id)

Comment

Deependra Singh

Join Date: Mar 2015

Posts: 8
#4

08 Dec 2017, 03:46

Originally posted by Dave Airey View Post

Noting the size of the dataset it is possible to concatenate strings in long format without reshaping.

Code:

clear input id symptom 1 0 1 0 1 0 1 1 1 0 1 . 2 0 2 0 2 0 2 0 2 0 2 0 end tostring symptom, generate(sympstr) sort id, stable by id: generate concat = sympstr if _n == 1 by id: replace concat = concat[_n-1] + sympstr if _n > 1 by id: replace concat = concat[_N] list, sepby(id)

Hi Dave,
I was out of reach through internet for some days, sorry for that.

Thank you very much for the codes. The codes works perfectly in long format.

I now got the new symptom variable with indication of the round of occurrence and frequency.

Since there are 40,000 visits with symptom and rest 3... million visits without symptom. I would like to match every symptomatic visit made by women with asymptomatic visit at 1:1 ratio, also by the order of occurrence. For example, if a symptom was reported in women's 4th visit, I want to find asymptomatic women and match only the 4th visit, given that asymptomatic women had no symptoms reported in her first three visits. The new variable generated above have several outcomes depending upon at which visit the symptom was reported (or the visit was missing), how do I pick the right _nth symptomatic visit and match to the right _nth asymptomatic visit number? I have not used 'grep' command that you mentioned above.

After that, I have the visit date variable and exact date of death, and could easily calculate the follow-up time.
Looking forward for your kind help.

kind regards,
Deependra
Comment
Dave Airey

Join Date: Apr 2014

Posts: 398
#5

08 Dec 2017, 09:36

You can google for grep help if you don't have any books on the topic. For example:

https://www.stata.com/support/faqs/d...r-expressions/
https://stats.idre.ucla.edu/stata/fa...r-expressions/
https://www.stata.com/meeting/wcsug0...ros_reg_ex.pdf

Here is a toy example matching on the first id using your requirements.

Code:

clear input id str5 sympstr 1 "01001" 2 "00010" 3 "0000." 4 "01000" end generate match_id_1 = regexm(sympstr, "00[0-1\.][0-1\.][0-1\.]") . list, clean id sympstr match_~1 1. 1 01001 0 2. 2 00010 1 3. 3 0000. 1 4. 4 01000 0
Comment
Deependra Singh

Join Date: Mar 2015

Posts: 8
#6

12 Dec 2017, 03:37

Originally posted by Dave Airey View Post

You can google for grep help if you don't have any books on the topic. For example:

https://www.stata.com/support/faqs/d...r-expressions/
https://stats.idre.ucla.edu/stata/fa...r-expressions/
https://www.stata.com/meeting/wcsug0...ros_reg_ex.pdf

Here is a toy example matching on the first id using your requirements.

Code:

clear input id str5 sympstr 1 "01001" 2 "00010" 3 "0000." 4 "01000" end generate match_id_1 = regexm(sympstr, "00[0-1\.][0-1\.][0-1\.]") . list, clean id sympstr match_~1 1. 1 01001 0 2. 2 00010 1 3. 3 0000. 1 4. 4 01000 0

Hi Dave,
Thank you for the reply.
I still could not understand the matching thing. Below is an example of a women who had 5 visits, she had symptom (symp_str) in her 2nd visit, and after concatenate the symptom variable the values goes like "0", "01", "010".. etc. I need control for symp_str using the index visit when symptom was reported, this means in the new generated variable I will have "1" in second visit while other value "0" is missing and by matching try to find a women without symptom "0" in the whole visit history.
Using the codes you mentioned above I am able to create a new matching variable but all visits are indicated as "0". However, the idea is to find another random women without symptom as indicated "0" in her visit history and hence, marked as "0" in her second visit while other visits are left as missing. Thus a woman with symptom in a given visit with have another woman without symptom in that respective visit number.

n_obs true_order_inv symp_str concat match_id_symp
883 1 0 0 0
883 2 1 01 0
883 3 0 010 0
883 4 0 0100 0
883 5 0 01000 0

kind regards,
Deependra
Comment
Dave Airey

Join Date: Apr 2014

Posts: 398
#7

12 Dec 2017, 11:28

In my toy example, the first woman had a symptom in the second place, e.g., "01". So I searched the remaining women using the above grep command such that the other women had to have "00" and then could have any combination following (either 0,1, or .). You would need to modify the grep command for each possible search position.
Comment

Announcement

Individual matching for two exposure variables in a follow up cohort

Comment

Comment

Comment

Comment

Comment

Comment