extract random subsample

Davide Barbieri

Join Date: Feb 2016

Posts: 27
#1

extract random subsample

21 Mar 2016, 05:36

I am applying logistic regression to an imblanced dataset.
In order to improve specificity, I would like to resample the majority (negative) class, extracting a number of random records equal to the minority class.
Any suggestion on how to perform this in Stata?
Thanks
Tags: None
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#2

21 Mar 2016, 06:31

There was a recent post that is germane. Please see:

Sampling - Statalist

http://www.statalist.org

Hi Forum I have searched the forum for a possible answer to the following question but to no avail. If I have a dataset with some rare events, often the case

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
Davide Barbieri

Join Date: Feb 2016

Posts: 27
#3

22 Mar 2016, 05:07

Thanks, that post was useful
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#4

22 Mar 2016, 11:54

I'm not sure that I get the justification for subsampling here. The majority negative group is the denominator for a specificity estimate; decreasing its numbers can only reduce precision. Subsampling can be desirable for other reasons, so if you'll tell us more about your study, including numbers positive/negative, perhaps we can give better advice.

Last edited by Steve Samuels; 22 Mar 2016, 11:58.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
1 like
Comment
Davide Barbieri

Join Date: Feb 2016

Posts: 27
#5

24 Mar 2016, 05:42

Hello Steve,

thanks for you reply, I am happy to share my research.
I have a large dataset, with 60,938 observations (individual sport medical examinations). Each observation has 7 variables: age, height, weight, pulse rate, min and max blood pressure, and electrocardiography (ECG) outcome (either positive 1, or negative 0). All variables but ECG are continuous. My aim is to predict the ECG binary outcome by means of logistic regression, using the 6 continuous covariates. Only 8.8% (i.e. 5,375) observations are positive. It is a fairly rare event.

Using logistic straight away will give 0% sensitivity (missing all interesting observations, i.e. the positive ones) and 100% specificity. Therefore, as suggested also by King & Zeng 2001, I undersampled the majority class, to improve sensitivity. I realised though, that classification performances will change according to the amount of undersampling. Also, regardless of the fact that I selected "use estimation sample", it seems that Stata is using the whole dataset to evaluate classification performances. Besides, I would like to test the model on a sample with the same class distribution as the population (91.2% 1s and 8.8% 0s).

I then applied Firth logit (Firth, Bias reduction of maximum likelihood estimates, 1993) and rare events logistic (King, Zeng, Logistic regression in rare events data, 2001) but there seem to be no postestimate classification evaluation command for these functions in Stata.

Thanks for your precious insights.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#6

25 Mar 2016, 08:50

You are asked in the FAQto give full bibliographic references. I The reference to King and Zeng (2001) is below.

The most important information in your post is that your goal is prediction.

I see no benefit of subsampling negatives and some disadvantages. You previously wrote that subsampling would improve specificity; now you write that it will improve sensitivity. Neither statement is true, I believe. (If you a reference that says otherwise, please provide it.). Indeed the words "sensitivity" and "specificity" do not appear in the King-Zeng article. One important justification for subsampling is that the the entire sample is too big to process. That's not the case here: 65,000 observations is a small data set these days.

In fact, random sampling will make a predictive analysis more complicated and less accurate. Prediction will be more complicated because you would need to weight your models in order to get estimates of probabilities; it will be less accurate because subgroups will not be represented exactly in proportion to their population size. You can try stratification and post-stratification techniques, but why, when it is unnecessary?

I also don't think that you are in a "rare event" situation. Figure 3. on p 16 of King and Zeng, 2001 shows on the vertical axis, D, the maximum absolute difference between logit and Bayes estimates of risks. For the curve n = 20,000 and a percent close to 8.8% (1,760 events) D is about 0.1%, already a small number. Extrapolate to n = 60,000 and 5,400 events (your situation) and D < 0.01% is likely. Prediction will require some kind of validation. If you split the entire sample into equal sized training and validation data sets, I'd guess D would be no more than 0.05%.

Analyses with the goal of prediction can be very different from other kinds. If you want to ask about prediction, start another topic.`

References:

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2011. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics), Springer., NY https://web.stanford.edu/~hastie/loc...LII_print4.pdf

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics). Springer, New York. http://www-bcf.usc.edu/~gareth/ISL/I...20Printing.pdf

King, Gary, and Langche Zeng. 2001. Logistic regression in rare events data. Political analysis 9, no. 2: 137-163.
Preprints of all of Gary King's papers on rare events, including this one, can be found at http://gking.harvard.edu/category/re...s/rare-events.

Last edited by Steve Samuels; 25 Mar 2016, 08:54.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
1 like
Comment
Davide Barbieri

Join Date: Feb 2016

Posts: 27
#7

06 Apr 2016, 05:15

In my first post I wrote "specificity", but meant "sensitivity" (my mistake), and that is what I would like to improve, since post-estimation classification in my case resulted in 0% sensitivity:

Logistic model for ECG

-------- True --------
Classified | D ~D | Total
-----------+--------------------------+-----------
+ | 0 0 | 0
- | 5305 53331 | 58636
-----------+--------------------------+-----------
Total | 5305 53331 | 58636

Classified + if predicted Pr(D) >= .5
True D defined as ECG != 0
--------------------------------------------------
Sensitivity Pr( +| D) 0.00%
Specificity Pr( -|~D) 100.00%

All individuals were classified as negatives: this is what usually happens when one of the classes is under-represented (regardless of the fact that you don't consider this a rare events situation, strictly speaking).
I agree that my dataset is not big at all. Still, undersampling the negative class improved sensitivity:

Logistic model for ECG

-------- True --------
Classified | D ~D | Total
-----------+--------------------------+-----------
+ | 3170 2476 | 5646
- | 2135 2684 | 4819
-----------+--------------------------+-----------
Total | 5305 5160 | 10465

Classified + if predicted Pr(D) >= .5
True D defined as ECG != 0
--------------------------------------------------
Sensitivity Pr( +| D) 59.75%
Specificity Pr( -|~D) 52.02%

I understand the problem of altered proportion. Thus, I have selected "use estimation sample" in the estat classification command.
Maybe this is not sufficient: I am not familiar with Stata, and this is why I am here to ask.

You say stratification is not necessary, but you don't give suggestions on what to do, which is instead what I am looking for
Prediction was my aim since the beginning of the thread (even if I did not write it explicitely, but why?), so I don't think I should start a new one.
The administrator may advise me to do it, though.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#8

07 Apr 2016, 11:14

Subsampling negatives won't improve sensitivity either, because sensitivity is calculated as (correctly classified positives/all positives)--negatives don't enter into the equation. Similarly, specificity is a function of counts of negatives only. Subsampling them won't hurt your estimates of specificity much because so many will be left. However you will be unable to estimate functions like false postivie and false negative rates without weighting your data.

However I now see your problem. estat classification gave you zero sensitivity because it assumes that the correct prediction rule is: Predict positive if the estimated probability is >50%. 50% is the default "cutpoint". . However you have it backwards: one goal of prediction analysis is to choose a cutpoint that will give good sensitivity and specificity. You should have looked at lroc and lsens. A better cutpoint for your data will be very small. You can estimate optimal cutpoints with John Claytons' cutpt command from SSC. However these commands alone will give you biased estimates of sensitivity and specificity.

So the bottom line is: Analyze the entire dataset. Your reason for subsampling is based on a misinterpretation of estat classification. And yes, you should start a new thread about prediction. Nobody interested in prediction will look at a thread that is titled "extract random subsample"; I looked only because I'm interested in sampling. Having given you my considered answer to your original question, I don't plan to respond further here.

Last edited by Steve Samuels; 07 Apr 2016, 11:26.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
1 like
Comment

Announcement

extract random subsample

Comment

Comment

Comment

Comment

Comment

Comment

Comment