Repeated measures independent variable(s) with binary outcome

Horea Feier

Join Date: Mar 2018

Posts: 12
#1

Repeated measures independent variable(s) with binary outcome

25 Mar 2018, 13:50

Hello Statalisters,

This is my first message on this forum, so please bear with me :-).

I'm trying to assess the impact of several pre-operative independent variables on the outcome of interest (death, yes/no). One of this independent variables has been measured repeatedly (on admission, before entering the OR), but only in stable subjects. Unstable ones were rushed directly to the OR and we have a single measurement.

I'd like to see:

1) Wheather the mentioned variable is independently significant
2) Which of the two time points is more closely associated to the outcome of interest

Which test would be most apppropiate to use?

Thank you so much, Horea
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#2

25 Mar 2018, 15:42

Let's call the independent variable in question x. You have multiple measurements of it in people who were stable, and only a single measurement in the people who were not. It is also quite likely that those who were stable are less likely to have subsequently died. So the presence or absence of multiple measures of x is, itself, informative about the death outcome. So any comparison involving multiple measures of x in terms of an effect on mortality is quite likely to be substantially biased.

I think I would probably simply select the first measure of x in all cases and disregard the rest. Then I would model the death outcome in terms of x. As your outcome is dichotomous you might want to do a logistic regression. If x itself is a discrete variable you might just do a cross-tabulation of x and outcome and do a chi square or Fisher exact test. Another possibility if x is continuous is to do a t-test or perhaps a nonparametric comparison such as the Wilcoxon rank-sum test of x over categories of the outcome. There are many ways to go here that depends on the nature of the variables involved, the distributions, and the sample size.

With regard to the second question, you can only answer this among the stable patients. You have to be careful what you mean by "more closely associated to." If the distribution of x is more or less the same at each time, then you can do, for example, a two-level logistic regression with your death outcome regressed on x, an indicator for time period (call it t) and the interaction of x and t. The details, again, depend on information you haven't disclosed.

If this response has set you on the path and you can see your way to your goal from here, that's good. If you need more concrete advice, you will have to say more about what you are doing, and if you want help with code, it is crucial to show example data using the -dataex- command. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Horea Feier

Join Date: Mar 2018
Posts: 12

26 Mar 2018, 03:49

Hello Clyde,

Thank you so much for a detailed and kind response. It means a lot to me.

Here is (part of) the data I'm talking about:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(death hta) double a byte b float(c y1 y2 d timeofy1 age timeofy2 delta_time)
1 1  1.649999976158142 55 0 6.89 6.89 0  1.354705e+12 45.95617  1.354734e+12  8.009955
1 1  1.899999976158142 45 1 6.93 7.25 1 1.4514084e+12  70.2411 1.4516027e+12  53.99438
0 1 1.2999999523162842 55 1 7.03 7.25 1  1.708869e+12 62.97534 1.7088768e+12 2.1481245
1 1  2.200000047683716 60 1 7.03 7.15 1  1.449601e+12 46.84109 1.4496083e+12 2.0024889
1 1                  3 30 1 7.07 7.07 1  1.771435e+12 74.31781 1.7714376e+12    .65536
1 1 2.7799999713897705 50 1 7.08 7.28 0 1.6855507e+12 51.35342 1.6856317e+12 22.500694
1 1 3.0999999046325684 50 0  7.1 7.32 0 1.6964414e+12 56.16712 1.6965036e+12 17.257813
1 0 1.2599999904632568 45 1 7.11 7.18 1  1.676945e+12 69.50411 1.6769556e+12  2.985529
1 0               2.07 45 0 7.12 7.29 1 1.7876675e+12 50.41918 1.7877348e+12  18.67776
1 1 1.7899999618530273 55 0 7.13 7.26 1 1.6862076e+12 72.41918 1.6862328e+12  6.990507
end
format %tc timeofy1
format %tc timeofy2

I did a logistic regression with death as the dependent variable on the variables with p<0.2 in the univariate analysis (the ones from the table above)

The variables with repeated measures are y1 and y2, and they are continuous. As I mentioned in my previous post, unstable patients had a single measurement of y done, so y1 and y2 are the same in those cases. I'd like to see which measurement is independently associated with the outcome of interest (death, in this case)- in the separate logistic regression model, both of them are. Then I'd like to determine cutoff points with prognostic value using ROC analysis and the "cutoff" package from SSC.

I never did two-level logistic regression. Could you, please, be more specific on that topic?

Thank you so much, Horea

Last edited by Horea Feier; 26 Mar 2018, 04:33.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#4

26 Mar 2018, 08:47

Well, this is an unfortunate way of setting up the data and I'm not sure how to work with it. The problem is that imputing y2 = y1 in the unstable patients makes it impossible for you to distinguish unstable patients (where y2 was never actually measured) from patients for whom just coincidentally y2 actually was done and came out the same as y1. In the data you show, there are two observations with y2 = y1. But they have different values for timeofy1 and timeofy2. If I take the timeof* variables seriously, then these are, I suppose, stable patients with two measures of y that happened to come out the same. Can we rely on this assumption:

If y2 was never actually observed, then not only did you set y2 = y1 but you also set timeofy2 = timeofy1. By contrast if y1 and y2 were both really done, we will always have timeofy2 != timeofy1, in fact we will have timeofy2 > timeofy1?

If this is right, then first step is to undo the damage:

Code:

replace y2 = . if timeofy2 <= timeofy1

Now we can do our two-level logistic regression:

Code:

gen long id = _n reshape long y timeofy, i(id) j(time) xtset id time xtlogit death c.y##i.time margins time, dydx(y)

The -margins- output will give you the separate average marginal effects of y1 and y2 on the probability of death. That is, the rate at which the probability of death increases (or decreases) per unit change in y averaged over the observed distribution of values of y. If you want a formal hypothesis test that they differ, you can look at the row of output labeled time#c.y in the -xtlogit- output table.

I am not familiar with the -cutoff- package you refer to, and it does not appear to be currently available on SSC, so I can't advise you about its use. I will tell you that I am extremely skeptical of programs that purport to identify cutoffs in situations like this. The determination of such cutoffs is within the realm of decision theory, not frequentist statistics, and if the program does not require you to give it values of loss associated with misclassifications (many of them don't) then it is peddling snake oil and you should avoid it. If it does, then it may have a good theoretical foundation and you might be able to trust it.
Comment
Horea Feier

Join Date: Mar 2018

Posts: 12
#5

26 Mar 2018, 10:26

By contrast if y1 and y2 were both really done, we will always have timeofy2 != timeofy1, in fact we will have timeofy2 > timeofy1?

You are right about timeofy1 and timeofy2, that's exactly the case.

The first (timeofy1) was admission time, but there we don't have all the data, because y1 was not always taken at that time (it was not actually observed). By contrast, y2 was always taken in the OR room and I replaced y1 (which was the first value of y harvested ) with y2 as I considered it to be the first and at the same time the last value of y observed. So then I should maybe do:

Code:

replace y1=. if y2==y1

That way I would get back to coding the values that were not actually harvested at timeofy1 as missing.

I am not familiar with the -cutoff- package you refer to, and it does not appear to be currently available on SSC

Please excuse me, I was talking about the "cutpt" package, I wrote "cutoff" by mistake

Thank you so much about providing the code to help me out with two-level logistic regression, I appreciate it very much

Last edited by Horea Feier; 26 Mar 2018, 10:52.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#6

26 Mar 2018, 10:42

I would not do

Code:

replace y1=. if y2=y1

because it is possible that a real y1 exists and is, coincidentally, equal to y2. (Also, -if y2 = y1- is a syntax error, you need -if y2 == y1-.

So I would do it as

Code:

replace y1 = . if timeofy1 >= timeofy2

I am not familiar with -cutpt- either. But looking at the description offered in its description at SSC it appears to offer three different flavors of snake oil.
Comment
Horea Feier

Join Date: Mar 2018

Posts: 12
#7

26 Mar 2018, 11:13

I tried running the xtlogit model as you suggested (including the replacing syntax) , however I get the error:

cannot compute an improvement -- discontinuous region encountered
r(430);

I am not sure what that means

I am not familiar with -cutpt- either. But looking at the description offered in its description at SSC it appears to offer three different flavors of snake oil.

:-)) Thank you so much for provinding your opinion on cutpoint packages. I am a physician not a statistician. Finding cutoff points are like finding the holy grail: one hopes to be able to provide definite answers about outcomes simply by looking at the value of a clinical variable

Last edited by Horea Feier; 26 Mar 2018, 11:16.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#8

26 Mar 2018, 12:00

For help troubleshooting the -xtlogit-, please do the following:

1. Note the number of iterations completed before you received the error message.
2. Re-run the -xtlogit- command, adding the -iterate(#)- option, replacing # by one less than the number of iterations you noted.
3. Post the exact command as well as the entire regression output (between code delimiters).

I (or somebody may beat me to it) will try to help you figure out what is causing the problem and suggest some ways to fix it. Be advised, however, that although reasonable solutions to this kind of problem are usually available, sometimes this kind of problem cannot be solved, or can only be solved by radically revising the analysis plan in ways that may not achieve all of the original research goals. Sometimes the particular data are simply not able to provide estimates of the desired model parameters. We'll see.
Comment

Horea Feier

Join Date: Mar 2018
Posts: 12

26 Mar 2018, 12:25

First of all, I'd like to express my gratitude for taking time to mentor me through this procedure.

Here is the command and output:

Code:

 xtlogit death c.y##i.time, iterate(0)

Fitting comparison model:

Iteration 0:   log likelihood = -191.88231  
Iteration 1:   log likelihood = -175.38125  
Iteration 2:   log likelihood = -175.22894  
Iteration 3:   log likelihood = -175.22885  
Iteration 4:   log likelihood = -175.22885  

Fitting full model:

tau =  0.0     log likelihood = -175.22885
tau =  0.1     log likelihood = -172.27697
tau =  0.2     log likelihood = -169.01725
tau =  0.3     log likelihood = -165.40506
tau =  0.4     log likelihood = -161.38051
tau =  0.5     log likelihood = -156.85747
tau =  0.6     log likelihood = -151.70209
tau =  0.7     log likelihood = -145.68745
tau =  0.8     log likelihood = -138.39451
Iteration 0:   log likelihood = -145.68577  (not concave)
convergence not achieved

Random-effects logistic regression              Number of obs     =        306
Group variable: id                              Number of groups  =        153

Random effects u_i ~ Gaussian                   Obs per group:
                                                              min =          2
                                                              avg =        2.0
                                                              max =          2

Integration method: mvaghermite                 Integration pts.  =         12

                                                Wald chi2(3)      =      13.86
Log likelihood  = -145.68577                    Prob > chi2       =     0.0031

------------------------------------------------------------------------------
       death |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           y |  -7.049468   2.383186    -2.96   0.003    -11.72043   -2.378509
      2.time |   1.174035   23.30363     0.05   0.960    -44.50024    46.84831
             |
    time#c.y |
          2  |  -.1263155   3.178675    -0.04   0.968    -6.356404    6.103773
             |
       _cons |   50.75799   17.42926     2.91   0.004     16.59728    84.91871
-------------+----------------------------------------------------------------
    /lnsig2u |   .8472979          .                             .           .
-------------+----------------------------------------------------------------
     sigma_u |   1.527525          .                             .           .
         rho |   .4149475          .                             .           .
------------------------------------------------------------------------------
LR test of rho=0: chibar2(01) = 59.09                  Prob >= chibar2 = 0.000

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#10

26 Mar 2018, 12:47

OK. The program is having trouble from two sources that may be related to each other.

1. You have a very small sample: only 15 patients. This is not large enough to support the method that I set out. I had imagined that your data were much more copious.

2. The constant term of 50.8 tells me that nearly everybody in your sample dies. That is going to make it extremely difficult to accomplish your research goals.

There is something else peculiar about the results you show. Stata is saying that among these 15 people, you have two observations in the estimation sample for every one of them. But in describing your problem originally, you indicated that for some subset of the patients, you have only one measurement of y. If you did the replacement of y1 by missing value when it is not a real measurement, as suggested in earlier posts, then those people should only have one observation each in the data set, as observations with missing values are excluded automatically.

But seeing that you have only 15 patients, and nearly all of them die, my advice is to abandon ship. You cannot make any reasonable predictions about rare (survival) outcomes from a sample of 15. (And it will only get worse when you have to exclude a few observations on people who never had y1 actually measured.)
Comment
Horea Feier

Join Date: Mar 2018

Posts: 12
#11

26 Mar 2018, 20:23

My original sample has 153 patients in wide format-you can see that on the right hand side of the image I posted. After converting to long format it has 306, so that might not be the cause. As for the death rate, it was around 30%.

The original example set I posted was limited to 10 observations.

Last edited by Horea Feier; 26 Mar 2018, 20:26.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#12

26 Mar 2018, 21:15

Ah yes, on my screen the final digits of 306 and 153 were cut off on the right end, and I didn't scroll over to see them. There is still the issue of how you ended up with two observations for every subject: that contradicts your claim that y1 wasn't measured in some of them.

The constant term of 50 is also not so bad as I had thought. I see that in your sample data the values of y1 and y2 are very tightly clustered around 7 (are these arterial blood pH measurements in people with severe acidosis?) The coefficient of about -7 for the y variable then brings xb into the range of 50-49 = 1, which corresponds more to a fatality rate of 70% than to 30%, but as these are not converged results, they aren't accurate anyway. But at least now I see how things may be in the right ballpark here. Here are a couple of suggestions for dealing with the convergence problem:

1. You could try running it as

Code:

melogit death c.y##i.time || id:

That estimates the same model as -xtlogit-, but uses a different algorithm and might converge when -xtlogit- fails.

2. If that, too, fails you could try

Code:

meqrlogit death c.y##i.time || id:

which is yet a third estimation algorithm for the same model, and often succeeds where -melogit- fails.

In addition, both of those approaches might benefit from adding the -difficult- option if they don't work as shown here.

But perhaps it would be better to take a different approach altogether. Go back to your original data with y1 and y2 as separate variables (but with y1's that were never measured replaced by missing values) and run

Code:

roccomp death y1 y2

This will let you determine whether y1 or y2 gives the better ROC curve area. The ROC curve area is a measure of the discriminatory ability of variables to predict the outcome. The -roccomp- command will calculate both ROC curves and also give you a test of the null hypothesis that the ROC areas are equal. For your purposes this approach might well be better as well, as it relates more directly to issues of sensitivity and specificity than any logistic model does. And there is no possibility of a convergence problem here.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4420
#13

26 Mar 2018, 22:39

Originally posted by Horea Feier View Post

. . . I'm trying to assess the impact of several pre-operative independent variables on the outcome of interest (death, yes/no). One of this independent variables has been measured repeatedly (on admission, before entering the OR), but only in stable subjects. Unstable ones were rushed directly to the OR and we have a single measurement.

I'd like to see:

1) Wheather the mentioned variable is independently significant
2) Which of the two time points is more closely associated to the outcome of interest

1) If by "independently significant" you're interested whether y's regression coefficient is statistically significantly different from zero even after accounting for all of the other predictors, then you're going to need to fit a regression model. If that's what you're after, then you'll need to include hta, a, b, c, d and age as well as y, and roccomp cannot do that for you. Although they'll be related to the linear predictions and not to values of y alone, you can create ROC plots after logistic regression—at Stata's command line, type

Code:

help logit_postestimation

and click on the hyperlink for lroc. (You might be interested in lsens, as well.)

You won't be able to include both y1 and y2, because by design or happenstance you haven't measured them both in each patient. You can include the one time point that is measured in all patients, which it seems is the one intraoperatively.

In contrast to Clyde's recommendation, I would be wary of xtlogit or other multilevel models inasmuch as you've perforce measured the outcome only once in any patient.

2) As Clyde mentions, you can answer this question only for patients who have had two measurements of y. Your convenience sample just won't allow you to answer the question for both those who were admitted and those rushed straight to the operating room. Absent other information, I would guess, like Clyde does, that the temporally more proximal measurement will be the one that more closely associates in some manner with the outcome.

The dates in your snippet of the dataset span nearly a decade and a half. You might want to consider the possibility that changes in medical practices, advances in technology, demographic shifts and so on might have occurred during the period of data collection and that these could affect the outcome more profoundly than anything that you've happened to record.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#14

26 Mar 2018, 22:53

In contrast to Clyde's recommendation, I would be wary of xtlogit or other multilevel models inasmuch as you've perforce measured the outcome only once in any patient.

This is a good point.

You won't be able to include both y1 and y2, because by design or happenstance you haven't measured them both in each patient. You can include the one time point that is measured in all patients, which it seems is the one intraoperatively.

I may have misunderstood the original pose of the question, but I don't think this is correct. My understanding is that some patients, those who were "stable" had two measurements done, but the "unstable" ones only had a single measurement. So, if we restrict to the sample of patients who had both measurements done, it is possible to include both y1 and y2 in the analysis. But we need to be very clear that no generalization from there to all patients is possible because the very fact of having both measurements done identifies the patient as "stable," and that, in turn, presumably affects the outcome probability.

I am usually reluctant to advise people to compare coefficients of two variables in a regression model as a way to discern which is "more important" or "better." But in this case, at least in the example data shown, the distributions of y1 and y2 appear to be very similar (though I have not investigated this thoroughly, and really could not in an example of 10 cases). If that is really the case, then the comparison of the y1 and y2 coefficients would in fact be a reasonable approach here, which is the direction in which I was originally pointing (though doing it by -reshape-ing long and using -xtlogit- was not a good idea.)
Comment
Horea Feier

Join Date: Mar 2018

Posts: 12
#15

27 Mar 2018, 11:31

Hello Joseph,

You won't be able to include both y1 and y2, because by design or happenstance you haven't measured them both in each patient. You can include the one time point that is measured in all patients, which it seems is the one intraoperatively.

Yes, that is absolutely the case. I ran

Code:

logistic death y2 hta a b c d age

as I had y2 data for every patient (intraoperatively). But, finding a predictive factor on admission has a better "utility" for the medical practice, so I strived to

1) compare y2 and y1, although I didnt have y1 data in all patients
2) find cutoff points using ROC analysis for y1 and y2 values that might predict death

In contrast to Clyde's recommendation, I would be wary of xtlogit or other multilevel models inasmuch as you've perforce measured the outcome only once in any patient.

So, if I understand correctly, you can use mixed effects models only on data in which you measure the covariantes AND the outcome repeatedly over time?

Kind regards, Horea
Comment

Announcement