arrays in stata?

Vishal Sharma

Join Date: Sep 2018

Posts: 60
#1

arrays in stata?

02 Apr 2019, 10:37

hi all,

I m wondering if there is a way in stata to use arrays ( like in sas) to search through multiple icd codes.

the data i have is in this format: (each line is a hospitalization)
id date diag1 diag2 diag3 diag4 diag5

1 20jan2018 t40.0

1 16oct2018 t40.6

2 04apr2018 t40.3

each hospitalization (each line) has the date of the event and 5 diag codes( icd codes)

i would like to identify all of the diagnosis codes that are t40.0-t40.6 such that if present, then i can create a variable pat_diagnosis=1, if absent then pat_diagnosis=0

with sas we set up an array , but with stata, i can only think of using for loops like this:

gen pat_diagnosis=0
forv i=1/5{
replace pat_diagnosis`i'=1 if diag`i'>= "t40.0" & diag`i'<="T40.6"
}

any thoughts on a better way than this?

thanks in advance
vishal
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35634
#2

02 Apr 2019, 11:07

https://www.stata.com/support/faqs/d...nt-sas-arrays/

In this particular case, your loop is confused. I think you mean

Code:

gen pat_diagnosis = 0 forv i=1/5 { replace pat_diagnosis = 1 if diag`i'>= "t40.0" & diag`i'<="t40.6" }

Last edited by Nick Cox; 02 Apr 2019, 11:10.
1 like
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#3

02 Apr 2019, 11:15

Vishal,

Quick note that Stata has two specialized commands to clean ICD-9 and ICD-10 codes, or to generate a flag if a variable contains a code in a certain range. The range feature is a nice convenience command for your purpose. You could simply do:

Code:

forv i=1/5{ icd10 generate pat`v' = diag`v', range(T400/T406) } egen pat_flag = rowmax(pat?) drop pat?

I don't believe there's a direct equivalent of SAS arrays. If you needed to speed the above up, it might run faster if you reshaped the data to long (Stata command is reshape long, SAS equivalent is PROC TRANSPOSE, I believe).

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#4

02 Apr 2019, 11:59

The original loop as modified by Nick is fine and looks like the most efficient solution.
If you are looking for something different, you can think along the lines of:

Code:

egen diag = concat(diag*), punct("|") generate flag = strpos(diag,"t40.")>0

Clearly you can adjust the expression for flag to focus only on some diagnoses.

Best, Sergiy

Last edited by Sergiy Radyakin; 02 Apr 2019, 12:10.
Comment
mark butum

Join Date: Jul 2018

Posts: 6
#5

16 Feb 2020, 16:24

Originally posted by Nick Cox View Post

https://www.stata.com/support/faqs/d...nt-sas-arrays/

In this particular case, your loop is confused. I think you mean

Code:

gen pat_diagnosis = 0 forv i=1/5 { replace pat_diagnosis = 1 if diag`i'>= "t40.0" & diag`i'<="t40.6" }

What of you you have 3 or more diagnosis codes, like D45, D473 , D474. How will you create a loop to extract.
Thank you very much!
Comment
mark butum

Join Date: Jul 2018

Posts: 6
#6

16 Feb 2020, 16:26

I have had similar issue and Nick's code was extremely helpful, but I got stuck beyond having 3 or more codes.
Any help is extremely appreciated.
Comment
mark butum

Join Date: Jul 2018

Posts: 6
#7

16 Feb 2020, 16:29

This exmple gives me error:

gen pat_diag = 0
forv i=1/30 {
replace pat_diag = 1 if I10_DX`i'<="D45" & I10_DX`i'>= "D473"& I10_DX`i'<="D474"
}
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35634
#8

17 Feb 2020, 05:51

I can't see a syntax error in #7. I note that

Code:

DX`i' <= "D45"

is generously inclusive, so that anything beginning "A" "B" "C" "D1" "D2" "D3" and some things beginning "D4" would fit. What was the error message?
Comment
mark butum

Join Date: Jul 2018

Posts: 6
#9

17 Feb 2020, 19:11

Thanks Nick.
Sorry, I was not specific enough. I meant, when I ran above code, and tabulate results, I get
pat_diagnosis | Freq. Percent Cum.
------------+-----------------------------------
0 | 7,135,090 100.00 100.00
------------+-----------------------------------
Total | 7,135,090 100.00
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4458
#10

17 Feb 2020, 19:14

unlike Nick, I do see a problem with the code in #7 - you need parentheses after the ampersand (and at the end)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35634
#11

18 Feb 2020, 01:49

Sorry; I missed something more fundamental -- as did my friend Rich Goldstein, It's impossible for a string to be BOTH earlier than or equal to D45 in alphanumeric sort order AND between D473 and D474 . Hence the syntax is legal, but no observations will satisfy the mutually contradictory criteria, as you found, You need some different mix of operators.
Comment
mark butum

Join Date: Jul 2018

Posts: 6
#12

19 Feb 2020, 12:21

Indeed, I am interested in adding D45, D473 and D474. I modify the code as below and seems to work

gen pat_diag = 0
forv i=1/30 {
replace pat_diag = 1 if I10_DX`i' =="D45" |I10_DX`i'=="D473" | I10_DX`i'=="D474"

pat_diag | Freq. Percent Cum.
------------+-----------------------------------
0 | 7,104,464 99.57 99.57
1 | 30,626 0.43 100.00
------------+-----------------------------------
Total | 7,135,090 100.00

Based on Nick's code, I am confident this result is correct.

My question now is, how to get separate group of patients with D45, D473, and D474.

This what I did:
gen pv=0

. forv i=1/30 {
2. replace pv = 1 if I10_DX`i' =="D45"

3. }

ta pv

pv | Freq. Percent Cum.
------------+-----------------------------------
0 | 7,130,991 99.94 99.94
1 | 4,099 0.06 100.00
------------+-----------------------------------
Total | 7,135,090 100.00

gen et=0

. forv i=1/30 {
2. replace et = 1 if I10_DX`i' =="D473"
3. }

. ta et

et | Freq. Percent Cum.
------------+-----------------------------------
0 | 7,108,341 99.63 99.63
1 | 26,749 0.37 100.00
------------+-----------------------------------
Total | 7,135,090 100.00

gen mf=0

. forv i=1/30 {
2. replace mf = 1 if I10_DX`i' =="D474"
3. }

ta mf

mf | Freq. Percent Cum.
------------+-----------------------------------
0 | 7,134,964 100.00 100.00
1 | 126 0.00 100.00
------------+-----------------------------------
Total | 7,135,090 100.00

Now I am intersted in percentages of each diagnosis:

I am not 100% confident this method I used to eventually get these percents is correct.

[ just to remind that, I looking for these 3 diagnosis: polycythemia, essential thrombocythemia, and myelofibrosis, from Diagnosis( I10_DX1 to I10_DX30). I want to capture all those diagnosis in the database. I then want to separate them into individual diagnoss with their frequencies]

I will need some reassurance or otherwise.

Thank you.
Comment
mark butum

Join Date: Jul 2018

Posts: 6
#13

19 Feb 2020, 12:22

i meant *frequncies
Comment
zaed hamady

Join Date: Sep 2020

Posts: 1
#14

01 Jun 2021, 05:35

I am wondering if someone can help me to sort out this array, where code is CD10, dates means date of this code retrieved from central registry, there are 70 codes var, and corresponding dates, in each code var there are hundreds of CD10 codes
code1 code2 code3 date1 date2 date3

c25 01/01/2015

c25 01/02/2016

c25 02/02/2017

Thanks
zaed
Comment
Vishal Sharma

Join Date: Sep 2018

Posts: 60
#15

01 Sep 2021, 10:00

thanks all!
Comment

id	date	diag1	diag2	diag3	diag4	diag5
1	20jan2018	t40.0
1	16oct2018			t40.6
2	04apr2018		t40.3

code1	code2	code3	date1	date2	date3
c25			01/01/2015
	c25			01/02/2016
		c25			02/02/2017

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment