Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Imbalanced data

    Hi statalist community,

    The main variable (CA) of my regression is highly imbalanced.
    Code:
    tab CA
                  Frequecy       percentage
    fc_yes         900              5%  
    fn_yes         5000            28%
    no             12000           67%
    This variable is the main variable of the analysis. I am wondering if it is right to test my hypothesis in this scenario.

    The dataset is a nationally representative survey.


    May anayone please guide me how to go about it?

    regards,
    Ajay
    Last edited by ajay pasi; 18 Dec 2022, 14:13.

  • #2
    5% of 18,000 observations is not small if that is what you are worried about. Even if you were comparing this category with the other two, you would go about using regular logistic regression instead of penalized logit or other estimation methods for rare events. Paul Allison has a nice discussion on what a rare event means in the context of a binary dependent variable model, see https://statisticalhorizons.com/logi...r-rare-events/. So I would advise that you proceed with standard methods.

    Comment


    • #3
      Thank you Andrew sir, I appreciate your help.

      Comment

      Working...
      X