Imbalanced data

ajay pasi

Join Date: Jan 2019

Posts: 170
#1

Imbalanced data

18 Dec 2022, 14:09

Hi statalist community,

The main variable (CA) of my regression is highly imbalanced.

Code:

tab CA Frequecy percentage fc_yes 900 5% fn_yes 5000 28% no 12000 67%

This variable is the main variable of the analysis. I am wondering if it is right to test my hypothesis in this scenario.

The dataset is a nationally representative survey.

May anayone please guide me how to go about it?

regards,
Ajay

Last edited by ajay pasi; 18 Dec 2022, 14:13.
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10188
#2

18 Dec 2022, 15:13

5% of 18,000 observations is not small if that is what you are worried about. Even if you were comparing this category with the other two, you would go about using regular logistic regression instead of penalized logit or other estimation methods for rare events. Paul Allison has a nice discussion on what a rare event means in the context of a binary dependent variable model, see https://statisticalhorizons.com/logi...r-rare-events/. So I would advise that you proceed with standard methods.
Comment
ajay pasi

Join Date: Jan 2019

Posts: 170
#3

18 Dec 2022, 21:38

Thank you Andrew sir, I appreciate your help.
Comment

Announcement