manipulating two sets of variables at once

Amin Saad

Join Date: Apr 2015

Posts: 56
#1

manipulating two sets of variables at once

03 Jun 2015, 16:12

I have two sets of columns PR1 to PR20 (type str12) contains procedure codes and PRDAY1 to PRDAY20 (type double) contains the day of the procedure

I need to write a code that would do to tasks.

first find the code in the PR columns, and then in PRDAY column that has the similar number (PR5 and PRDAY5) I want to calculate the mean day of the procedure and compare it between two different groups.

pr 1 pr 2 pr 3 prday3 prday3 prday3 type 1 type 2
133 111 122 2 3 4 1 0
133 155 578 2 3 4 0 1
133 111 122 7 8 9 0 1
155 111 321 2 3 4 1 0
155 555 777 2 3 4 1 0

what I want to do is, calculate the mean "day to procedure" for procedue 111 and compare it between type 1 and type 2
gen time2pr=.
forvalues j = 1/20 {
replace time2pr= sum PRDAY`j' if inlist(PR`j',"111")
}

something like that, I got stuck
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#2

03 Jun 2015, 16:44

I don't understand your explanation of what you want to do. Can we clarify a few things:

1. In your sample data, whenever type1 = 1, type2 = 0 and vice versa. Is that always true? It seems then that you would only need one of these variables. (If there are cases where type1 and type 2 are both 1 or both 0, then just what kind of comparison do you want to do between the two groups?)

2. pr 1, pr 2, and pr 3 are not legal variable names because embedded blanks are not allowed. I assume you mean pr1, pr2, pr3? Also you have three variables all named prday3. That's not legal either, and wouldn't make any sense even if it were. Did you mean prday1, prday2, and prday3? That seems to fit with the overall sense of what you are trying to do? (Also type 1 and type 2 are not allowed: do you mean type1 and type2?)

3. In the sample data you posted, procedure code 111 occurs three times, once each in observations 1, 3, and 4, and always as pr2. So do you want the result for time2pr for code 111 to be 3 + 8 + 3 = 14?

If your answers to my questions above are all "yes", then I think this will get you your time2pr variable:

Code:

drop type2 rename type1 type gen obs_no = _n reshape long pr prday, i(obs_no) j(_j) egen time2pr = total(prday), by(pr)

From there, I don't understand what kind of comparison across types you want to do on time2pr. Because you have added up all values of prday* associated with any given pr, the value of time2pr will be the same for all those occurrences, regardless of what the value of type1 (or type2) is. (The only exception would be if some particular pr only occurs with one value of type1, in which case there is literally no comparison to be made.)
Comment
Amin Saad

Join Date: Apr 2015

Posts: 56
#3

03 Jun 2015, 21:27

for question 1, that's true when type1= 1 then type2=0, I need both variables to calculate time to procedure.(time2pr)

for question 2, pardon the space before each number it is just a typing habit. yes it is PR1, PR2, PR3 and yes PRDAY1, PRDAY2, PRDAY3.

for question 3, the real sample size is very large and I just made this dataset as an example. Since you took procedure 111 as an example, for type1 it was listed twice in PR2 and the relavent PRDAY2 was 3 and 3, for type2 it was listed once in PR2 and the PRDAY2 is 8. what I want to do is to calculate the means of PRDAY for code 111 when type1=1 and when type2=1

I hope this helps!
Comment
Amin Saad

Join Date: Apr 2015

Posts: 56
#4

03 Jun 2015, 21:35

In the real dataset, type 1 is code by one ICD-9 code, lets call it (X) which will be found in the PR1 to PR20. Type 2 is codes by two codes, (Y) and (Z), that are found in PR1 to PR20 as well. These ICD-9 codes are procedural codes. I want to know how long did it take type 1 patients to receive procedure X and how long did it take type2 to receive either procedure Y or Z.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#5

04 Jun 2015, 08:17

Thanks for those clarifications. I don't think you need to calculate any sums unless you have some interest in hand-calculating your t-test. The main thing is to -reshape- the data to long, and then you can do a t-test. (And as usual, almost all analyses in Stata are easier in long layout than wide. Whenever you have data in wide layout you should ask yourself if there is any reason not to -reshape- it to long. And if you are ever thinking of reshaping long data to wide, you should ask yourself at least 3 times why.)

By the way, you do not need the two variables type1 and type2. There is no extra information beyond what is in either one alone.

One more thing, you need a patient_id variable in each record. I'm going to write the code as if it exists.

Code:

// RESHAPE TO LONG LAYOUT reshape long pr prday, i(patient_id) j(_j) // CONTRAST TIMES TO PROCEDURE BY TYPE ttest prday, by(type1) // AND EVEN THOUGH I DON'T SEE ANY NEED FOR THE // SUMS FOR YOUR PURPOSE, HERE IS HOW TO GET THEM: egen total_time_to_pr = total(prday), by(pr type1)

Since the numbers in prday represent times to events, you may not want to use a t-test to contrast the groups. In that case you could replace that -ttest- command by:

Code:

// DECLARE DATA TIME-TO-EVENT DATA stset prday // LOGRANK TEST OF EQUALITY OF TIMES BY TYPE sts test type1

will give you a non-parametric log-rank test of equality of time to procedure in the two types.

Finally there are other analytic subtleties to think about. You have multiple procedures listed for each patient. There may be patient-level factors that affect time to procedure beyond those captured by the type1/type2 distinction. If so, the separate observations within patient lack independence, so that the assumptions underlying either a ttest or a logrank test are violated and your tests may be anti-conservative. You might want to consider a mixed-effects parametric survival model (-help mestreg-) or a Cox model with shared-frailty (-help stcox-).
Comment
Amin Saad

Join Date: Apr 2015

Posts: 56
#6

06 Jun 2015, 22:03

I'm not sure if the data needs converting from wide to long

This is what I have came up with so far.

gen ORDAY=.

then

for values j= 1/20 {
replace ORDAY = 0 if inlist(PR`j', "3845" , "3957") & ORDAY==. &; PRDAY`j' == 0
replace ORDAY = 1 if inlist(PR`j', "3845" , "3957") & ORDAY==. &; PRDAY`j' == 1
replace ORDAY = 2 if inlist(PR`j', "3845" , "3957") & ORDAY==. &; PRDAY`j' == 2
/
/ and change the ORDAY==n+1 at the beginning and end of the line until I reach
// the number of days I am interested in the final line I would have look as below, so that you will then capture all procedures down after n days
// in this case 21 or above, labelled as 21
replace ORDAY = 167 if inlist(PR`j', "3845" , "3957") & ORDAY==. & PRDAY`j' = up to 167

The problem here is the range of length of hospital stay (LOS) is 0 to 167. So I will have 167 lines, Is there a ways to adjust that?

// after I compute this, i will be able to compare the means

ttest ORDAY, by(TYPEB_OPEN)
Comment
Amin Saad

Join Date: Apr 2015

Posts: 56
#7

06 Jun 2015, 22:09

Finally there are other analytic subtleties to think about. You have multiple procedures listed for each patient. There may be patient-level factors that affect time to procedure beyond those captured by the type1/type2 distinction. If so, the separate observations within patient lack independence, so that the assumptions underlying either a ttest or a logrank test are violated and your tests may be anti-conservative. You might want to consider a mixed-effects parametric survival model (-help mestreg-) or a Cox model with shared-frailty (-help stcox-).

You make very valid points. My study is on acute aortic dissection where I am comparing the outcomes in people with and without risk factors for aortic dissection. for the outcome, I have adjusted for inherent demographics and comorbid conditions. For mortality, I have found a statistically significant difference facoring patients "with" risk factors which is counter intuitive. By comparing the time to operation, i'm attempting to explain this finding. I'm hypothesizing that delayed treatment/surgery could be one of the reasons for this high mortality found in the "no risk factors" group.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#8

07 Jun 2015, 11:43

If I understand what you are trying to do in #6 it can be greatly simplified. It appears to be equivalent to:

Code:

gen ORDAY = . forvalues j = 1/20 { replace ORDAY = PRDAY`j' if inlist(PR`j', "3845" , "3957") & ORDAY==. }

What this code does is look for the first occurring j between 1 and 20, if any, for which PR`j' is either 3845 or 3957 and PRDAY`j' is not missing, and sets ORDAY equal to the corresponding value of PRDAY. (It is the first occurring such j because of the & ORDAY == . condition.)

As for whether you need to convert you data from wide to long, the definitive answer depends on specifically how you plan to analyze your data. Most analyses are easier in long form than in wide. The particular analysis you propose in #6, however, seems as if it can be done just as easily with the wide layout you currently have.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#9

11 Jun 2015, 19:37

Well, without a sample of your data it's hard to troubleshoot this. But for one thing, the results you got for ORDAY3 are very peculiar. It shows negative numbers. But my code just copies one of the PRDAY values into ORDAY, and the PRDAY values are never negative, at least not in what little data you've shown, and given your description of the data I don't think a negative value of PRDAY would even make sense. So I'm suspicious that you didn't actually run exactly the code in #8, or you did something to your data before running it that interfered.

Now, one thing I notice is that in #6 you talk about going up to 167, although you also make the comment that after 20 everything else is just coded as 21. My code only goes up to 20, because I couldn't understand what you meant beyond 20.

Let me suggest a different approach, because, as noted previously, working with wide layout data is difficult in Stata, long layout is much easier.

Code:

reshape long pr prday, i(patient_id) j(_j) // SET A VARIABLE TO prday WHENEVER pr CODE IS 3845 OR 3957 gen orday = prday if inlist(pr, "3845", "3957") // NOW SPREAD THE FIRST NON-MISSING VALUE // OF orday OVER THE OTHER OBSERVATIONS // FOR THE PATIENT gen miss_orday = missing(orday) by patient_id miss_orday (_j), sort: replace orday = orday[1] if _n > 1 by patient_id (orday), sort: replace orday = orday[1] drop miss_orday reshape wide pr prday, i(patient_id) j(_j)

As before, I strongly urge you to consider abandoning using the wide layout with your data. Just as this is done at least as easily in wide as long, and your first request was much easier to do in long layout, you will find that almost everything you will want to do with your data in Stata is easier in long. And if you keep having to -reshape- back and forth between wide and long for every step of your analysis, you will be wasting a lot of time. So think about sticking with long--meaning delete the last line of the code shown above.

If this doesn't do what you need and you would like further help with this, post a sample of your data that is fuller than in #1, full enough to illustrate the calculation of orday for several patients with different combinations of the other variables (and be sure to include some observations with pr = 3845 or 3957). Then show what ORDAY should be for each of those patients. Then perhaps we can resolve this.
Comment

Announcement

manipulating two sets of variables at once

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment