Imbalanced data, bootstrap resampling, and logistic regression

Ali Balapour

Join Date: May 2017

Posts: 3
#1

Imbalanced data, bootstrap resampling, and logistic regression

02 May 2017, 13:36

Hi,

I am using stata to perform logistic regression however I have the issue of imbalanced (Dependent Variable) DV. My DV is binary and I have 15% zeros and 85% ones in it. Basically that means 57 zeros and 374 ones. Based on literature we have the issue of imbalanced DV and we should remedy the dataset by balancing zeros and ones. To do that here is what I have done:

*========= Stata Command ==========
sample 57 if DV == 1, count
* This will draw 57 random observations from values that are 1 and we had 57 observations of 0, so the overall emerging new sample will be 114 observations with 57 one and 57 zero in it; all other observations are dropped. The emerging sample has 114 observations.

logit DV IV1 IV2 IV3
*========= Stata Command ends ========

This will only draw one random sample from ones, but I want to conduct a bootstrap or loop in which one random sample drawn from ones with 57 observations, then combined with zeros which have fixed number of observations, then perform logestic regression. I want this process to iterate 1000 times and at the end the aggregated results across 1000 iteration are reported.

Basically I want the command that I reported above to iterate 1000 times and report one table results of logistic regression.

I am adding more clarification to what I am looking for if it is not clear yet:
*========= Stata Command ============
*iteration/ replication 1
sample 57 if DV == 1, count
* Generating sample A: 57 random 1, 57 fixed 0
logit DV IV1 IV2 IV3

*iteration/ replication 2
** Generating sample B: 57 random 1, 57 fixed 0
sample 57 if DV == 1, count
logit DV IV1 IV2 IV3

.
.
.

*iteration/ replication 1000
** Generating sample ZZZ: 57 random 1, 57 fixed 0
sample 57 if DV == 1, count
logit DV IV1 IV2 IV3

report: one table results of all 1000 logestic regressions. This will not give me 1000 betas, but one beta coefficient for each IV and DV. Something like bootstrapping but I want to bootstrap a part of my sample.

Please advise. How could I do that? what is the best approach?

Last edited by Ali Balapour; 02 May 2017, 13:39.
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

03 May 2017, 15:28

You are more likely to obtain a helpful answer if you follow the FAQ on asking questions - provide Stata code in code delimiters, Stata output, and sample data using dateex. Also, try to simplify your presentation to the core issues.

Maybe others will see this differently, but I don't see any why this procedure helps. As far as I know, logit does not require the same number of 0's and 1's. There is a literature on rare event logit that I don't follow.

So, you're running logit 1000 times. You're getting 1000 runs. You can get the 1000 betas simply by generating a output variable before the loop along with a macro count variable. After each logit, replace the beta value in the observation number associated with your count macro, then increment the count macro by one. You end up with a variable with all your betas. Then you can look at the distribution of the betas. Something like:

local count 1
g beta=.
forvalues i=1/1000 {
**whatever**
replace beta=_b[IV1] in `count'/`count'
local count=`count' + 1
}

You may have to write the _b[IV1] into a macro before writing it into beta (I'm not sure).
Comment

Announcement

Imbalanced data, bootstrap resampling, and logistic regression

Comment