How can I select a study population matching a published prospective trial?

Jonathan Tward

Join Date: Jan 2018

Posts: 18
#1

How can I select a study population matching a published prospective trial?

01 Feb 2022, 13:56

Dear Experts,

I have a very large population database of patients who have been treated for localized prostate cancer with various therapies (surgery, radiation, systemic drugs, etc). It contains oncologic outcomes (e.g., time to recurrence, metastasis, death, death from prostate cancer, next therapy, or last followup if none of the above). This database has about 20,000 individuals in it longitudinally tracked over time, and is mostly used for comparative effectiveness research centered around survival analyses.

I would like to perform analyses whereby I select a subset of patients from this database having a similar distribution of baseline characteristics to patients from published, prospectively performed clinical trials.

For example, the published RTOG 9601 trial was a 2 arm randomized trial that investigated the addition (or omission) of a drug therapy to patients receiving radiation therapy after surgery for localized prostate cancer

Here is a table from the publication showing the baseline characteristics of the population:

How would I go about selecting a population from my database of 20,000 patients that would match the distribution of these published patients?

The end goal is to see how different therapies applied to a similar, matched patient population compare to published studies. without the individual patient level data from the published studies, I am not sure how to proceed.

Here is my data structure that matches the categories above:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(race Age Karnofsky personid Gleason_Score) byte pick3 5 65.41547 3 50 1 1 7 66.90486 9 155 1 1 7 78.17933 3 200 1 1 7 60.4846 3 278 0 1 5 62.89117 9 339 1 1 7 . 9 500 0 1 7 69.65366 9 527 1 1 7 66.49418 9 901 2 1 7 73.58248 9 1103 1 1 7 66.428474 9 2193 2 1 7 . 9 5267 0 1 7 63.96714 3 5343 0 1 6 . 9 5388 0 1 4 . 9 5465 0 1 7 77.16632 3 7921 0 1 6 77.37167 9 8124 0 1 6 72.1013 9 8556 2 1 7 . 9 8992 0 1 7 69.89733 9 8994 2 1 7 81.27584 8 9017 2 1 7 65.98768 9 9126 0 1 6 70.57358 3 9855 2 1 7 58.82272 9 10155 1 1 5 70.8063 9 10244 2 1 7 71.66872 8 10395 1 1 7 60.94456 9 10734 0 1 7 65.420944 9 10863 1 1 7 69.9165 9 11436 1 1 7 66.49145 9 11453 1 1 7 70.87474 3 11537 0 1 7 64.19439 3 11622 1 1 7 73.52772 9 11716 2 1 7 51.40862 3 11844 0 1 7 63.01985 9 12173 1 1 7 56.53388 9 12293 1 1 7 74.06434 9 12427 2 1 7 65.94661 9 12914 1 1 7 47.20329 9 13012 2 1 7 68.98015 9 13277 2 1 7 59.82204 9 13591 2 1 end label values race race label def race 4 "Native Am.", modify label def race 5 "Other", modify label def race 6 "Unknown", modify label def race 7 "White", modify label values Karnofsky RT_KPS label def RT_KPS 3 "(Karnofsky) 100 - Normal, no co", modify label def RT_KPS 8 "(Karnofsky) 80 - Normal activit", modify label def RT_KPS 9 "(Karnofsky) 90 - Able to carry", modify label values Gleason_Score GS label def GS 0 "2-6", modify label def GS 1 "7", modify label def GS 2 "8-10", modify

Any help appreciated.

Cheers,

JT
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30354
#2

01 Feb 2022, 15:02

Well, before we get to the how-to, let's talk about the "does it really make sense to do this?"

I assume your ultimate goal is to do some analyses about outcomes in your data base and be able to claim that the sample you used is similar in these baseline characteristics to the sample from the study you referenced. But that claim will probably fail anyway because the table you show provides only the marginal distributions of these variables. But the joint distribution is not fully determined by these marginals. There can be varying degrees of correlation among them. Now, if you really care only about matching the marginals, you can synthesize a joint distribution by treating them as independent, so that the probability of being in any combination of the categories is the product of those categories' individual probabilities. But I suspect this will be clinically and epidemiologically wrong: wouldn't we expect the older participants to have somewhat greater frequency of low performance status? Wouldn't we expect a higher proportion of high Gleason scores among the Black patients of any given age? And these differences may well have a material impact on the outcome distributions you will be looking at as well.

To do this well, I would recommend writing to the lead author of the study and ask if he or she would share the joint distribution with you in a data file of some kind. Because you are not requesting individual level data, you probably won't have to go through any extensive rigmarole to get it.

To illustrate what you would do once you had that joint distribution, here's an example. For simplicity and compactness, I'll restrict my attention to the performance status and Gleason score, ignoring the other variables. Suppose the joint distribution looks like this:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte(Karnofsky Gleason_Score) float(prob n_desired) 8 0 .003384 0 8 1 .00654 0 8 2 .002076 0 9 0 .06486 1 9 1 .12535 3 9 2 .03979 1 3 0 .213756 4 3 1 .41311 8 3 2 .131134 3 end label values Karnofsky RT_KPS label def RT_KPS 3 "(Karnofsky) 100 - Normal, no co", modify label def RT_KPS 8 "(Karnofsky) 80 - Normal activit", modify label def RT_KPS 9 "(Karnofsky) 90 - Able to carry", modify label values Gleason_Score GS label def GS 0 "2-6", modify label def GS 1 "7", modify label def GS 2 "8-10", modify

Choose a total sample size you want to recruit. For illustration purposes I'll make it 20, so it can be pulled from your example data set without replacements. (In the joint distribution shown above, the variable n_desired is just prob*20, rounded to the nearest integer.)

Then you would do this. Start by loading that joint distribution into a data frame named joint_distribution

Code:

frame change default * Example generated by -dataex-. For more info, type help dataex clear input float(race Age Karnofsky personid Gleason_Score) byte pick3 5 65.41547 3 50 1 1 7 66.90486 9 155 1 1 7 78.17933 3 200 1 1 7 60.4846 3 278 0 1 5 62.89117 9 339 1 1 7 . 9 500 0 1 7 69.65366 9 527 1 1 7 66.49418 9 901 2 1 7 73.58248 9 1103 1 1 7 66.428474 9 2193 2 1 7 . 9 5267 0 1 7 63.96714 3 5343 0 1 6 . 9 5388 0 1 4 . 9 5465 0 1 7 77.16632 3 7921 0 1 6 77.37167 9 8124 0 1 6 72.1013 9 8556 2 1 7 . 9 8992 0 1 7 69.89733 9 8994 2 1 7 81.27584 8 9017 2 1 7 65.98768 9 9126 0 1 6 70.57358 3 9855 2 1 7 58.82272 9 10155 1 1 5 70.8063 9 10244 2 1 7 71.66872 8 10395 1 1 7 60.94456 9 10734 0 1 7 65.420944 9 10863 1 1 7 69.9165 9 11436 1 1 7 66.49145 9 11453 1 1 7 70.87474 3 11537 0 1 7 64.19439 3 11622 1 1 7 73.52772 9 11716 2 1 7 51.40862 3 11844 0 1 7 63.01985 9 12173 1 1 7 56.53388 9 12293 1 1 7 74.06434 9 12427 2 1 7 65.94661 9 12914 1 1 7 47.20329 9 13012 2 1 7 68.98015 9 13277 2 1 7 59.82204 9 13591 2 1 end label values race race label def race 4 "Native Am.", modify label def race 5 "Other", modify label def race 6 "Unknown", modify label def race 7 "White", modify label values Karnofsky RT_KPS label def RT_KPS 3 "(Karnofsky) 100 - Normal, no co", modify label def RT_KPS 8 "(Karnofsky) 80 - Normal activit", modify label def RT_KPS 9 "(Karnofsky) 90 - Able to carry", modify label values Gleason_Score GS label def GS 0 "2-6", modify label def GS 1 "7", modify label def GS 2 "8-10", modify // SHUFFLE THE ORDER OF THE PATIENTS WITHIN GROUPS set seed 1234 // OR WHATEVER NUMBER SEED YOU LIKE gen double shuffle = runiform() sort Karnofsky Gleason shuffle frlink m:1 Karnofsky Gleason, frame(joint_distribution) frget n_desired, from(joint_distribution) by Karnofsky Gleason, sort: keep if _n <= n_desired

Now, as here, you may not get exactly the distribution you want, because there just might not be enough in some cross-classified categories. If that happens in your actual full data set, and it's only off by a slight amount I would just let it go at that. If it's off by a lot, you might want to reconsider the whole project: if your data set is that mismatched to the RTOG data distribution, is it really plausible to try to emulate it? But if you decide to go ahead and want to polish off that rough edge, there is a way to change the code to do sampling with replacement, which will enable you to re-use observations and reach the full desired counts. Post back if you find yourself in that situation.

Last edited by Clyde Schechter; 01 Feb 2022, 15:22.
1 like
Comment
Jonathan Tward

Join Date: Jan 2018

Posts: 18
#3

01 Feb 2022, 17:31

Clyde,

Thank you for this very prompt and insightful answer. It seems like I have a lot to think about on the "should I" side of this equation. Nevertheless, your explanation and sample code are extremely helpful in helping me understand the mechanics (and pitfalls) of trying something like this.

I would view this project as looking at some exploratory analyses, and if there was a "signal" in the data that suggested I should more rigorously pursue the concept, then I could go through the "rigamorole" of requesting the full datasets from the cooperative groups for formal analyses.

All the best,

JT
Comment
Danika Zhang

Join Date: Oct 2022

Posts: 1
#4

25 Oct 2022, 01:03

Originally posted by Clyde Schechter View Post

Well, before we get to the how-to, let's talk about the "does it really make sense to do this?"

I assume your ultimate goal is to do some analyses about outcomes in your data base and be able to claim that the sample you used is similar in these baseline characteristics to the sample from the study you referenced. But that claim will probably fail anyway because the table you show provides only the marginal distributions of these variables. But the joint distribution is not fully determined by these marginals. There can be varying degrees of correlation among them. Now, if you really care only about matching the marginals, you can synthesize a joint distribution by treating them as independent, so that the probability of being in any combination of the categories is the product of those categories' individual probabilities. But I suspect this will be clinically and epidemiologically wrong: wouldn't we expect the older participants to have somewhat greater frequency of low performance status? Wouldn't we expect a higher proportion of high Gleason scores among the Black patients of any given age? And these differences may well have a material impact on the outcome distributions you will be looking at as well.

To do this well, I would recommend writing to the lead author of the study and ask if he or she would share the joint distribution with you in a data file of some kind. Because you are not requesting individual level data, you probably won't have to go through any extensive rigmarole to get it.

To illustrate what you would do once you had that joint distribution, here's an example. For simplicity and compactness, I'll restrict my attention to the performance status and Gleason score, ignoring the other variables. Suppose the joint distribution looks like this:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte(Karnofsky Gleason_Score) float(prob n_desired) 8 0 .003384 0 8 1 .00654 0 8 2 .002076 0 9 0 .06486 1 9 1 .12535 3 9 2 .03979 1 3 0 .213756 4 3 1 .41311 8 3 2 .131134 3 end label values Karnofsky RT_KPS label def RT_KPS 3 "(Karnofsky) 100 - Normal, no co", modify label def RT_KPS 8 "(Karnofsky) 80 - Normal activit", modify label def RT_KPS 9 "(Karnofsky) 90 - Able to carry", modify label values Gleason_Score GS label def GS 0 "2-6", modify label def GS 1 "7", modify label def GS 2 "8-10", modify

Choose a total sample size you want to recruit. For illustration purposes I'll make it 20, so it can be pulled from your example data set without replacements. (In the joint distribution shown above, the variable n_desired is just prob*20, rounded to the nearest integer.)

Then you would do this. Start by loading that joint distribution into a data frame named joint_distribution

Code:

frame change default * Example generated by -dataex-. For more info, type help dataex clear input float(race Age Karnofsky personid Gleason_Score) byte pick3 5 65.41547 3 50 1 1 7 66.90486 9 155 1 1 7 78.17933 3 200 1 1 7 60.4846 3 278 0 1 5 62.89117 9 339 1 1 7 . 9 500 0 1 7 69.65366 9 527 1 1 7 66.49418 9 901 2 1 7 73.58248 9 1103 1 1 7 66.428474 9 2193 2 1 7 . 9 5267 0 1 7 63.96714 3 5343 0 1 6 . 9 5388 0 1 4 . 9 5465 0 1 7 77.16632 3 7921 0 1 6 77.37167 9 8124 0 1 6 72.1013 9 8556 2 1 7 . 9 8992 0 1 7 69.89733 9 8994 2 1 7 81.27584 8 9017 2 1 7 65.98768 9 9126 0 1 6 70.57358 3 9855 2 1 7 58.82272 9 10155 1 1 5 70.8063 9 10244 2 1 7 71.66872 8 10395 1 1 7 60.94456 9 10734 0 1 7 65.420944 9 10863 1 1 7 69.9165 9 11436 1 1 7 66.49145 9 11453 1 1 7 70.87474 3 11537 0 1 7 64.19439 3 11622 1 1 7 73.52772 9 11716 2 1 7 51.40862 3 11844 0 1 7 63.01985 9 12173 1 1 7 56.53388 9 12293 1 1 7 74.06434 9 12427 2 1 7 65.94661 9 12914 1 1 7 47.20329 9 13012 2 1 7 68.98015 9 13277 2 1 7 59.82204 9 13591 2 1 end label values race race label def race 4 "Native Am.", modify label def race 5 "Other", modify label def race 6 "Unknown", modify label def race 7 "White", modify label values Karnofsky RT_KPS label def RT_KPS 3 "(Karnofsky) 100 - Normal, no co", modify label def RT_KPS 8 "(Karnofsky) 80 - Normal activit", modify label def RT_KPS 9 "(Karnofsky) 90 - Able to carry", modify label values Gleason_Score GS label def GS 0 "2-6", modify label def GS 1 "7", modify label def GS 2 "8-10", modify // SHUFFLE THE ORDER OF THE PATIENTS WITHIN GROUPS set seed 1234 // OR WHATEVER NUMBER SEED YOU LIKE gen double shuffle = runiform() sort Karnofsky Gleason shuffle frlink m:1 Karnofsky Gleason, frame(joint_distribution) frget n_desired, from(joint_distribution) by Karnofsky Gleason, sort: keep if _n <= n_desired

Now, as here, you may not get exactly the distribution you want, because there just might not be enough in some cross-classified categories. If that happens in your actual full data set, and it's only off by a slight amount I would just let it go at that. If it's off by a lot, you might want to reconsider the whole project: if your data set is that mismatched to the RTOG data distribution, is it really plausible to try to emulate it? But if you decide to go ahead and want to polish off that rough edge, there is a way to change the code to do sampling with replacement, which will enable you to re-use observations and reach the full desired counts. Post back if you find yourself in that situation.

Dear Professor Schechter,

Thank you very much for the detailed and helpful answer! I want to know that whether it is necessary (or better) to sample with replacement when the sample is small, in order to ensure enough individuals in some categories? If so, could you please kindly offer some further explanations about sampling with replacement? With advance thanks for your attentions and response!

Best Regards,

Danika
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30354
#5

25 Oct 2022, 10:29

It isn't strictly speaking a matter of sample size. You could have a data set with only four observations that you're trying to match to an external sample--if the four observations all have commonly-occuring values of the matching variables, you might be able to come up with dozens or hundreds of matches for each. In fact, a large sample may be more difficult because there is a greater probability that a large data set will contain unusual combinations of matching variable values that are difficult to pair up with an external sample. It's really more about the extent to which the joint distribution of the matching variables in your sample matches that of the external sample.

That said, as I tried to make clear earlier in the thread, from a statistical perspective there is no reason to prefer matching without replacement. In my own work I always match with replacement unless coerced into doing otherwise by my project collaborators' insistence for no good reason. It is a simple exercise in counting to see that matching with replacement always results in a match that is at least as good as that of a match without replacement, and, in most circumstances, a better one.

If so, could you please kindly offer some further explanations about sampling with replacement?

I'm not sure what you have in mind. The question is unfocused. If you are looking for code, please respond with -dataex- examples of the data sets you are trying to match. Be sure the examples shown include potential matches to each other, and specify what the matching criteria are. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Announcement

How can I select a study population matching a published prospective trial?

Comment

Comment

Comment

Comment