Independent variable selection for ordinal logistic regression model

Ashley Riddle

Join Date: Jan 2020
Posts: 20

Independent variable selection for ordinal logistic regression model

22 Jan 2020, 11:48

Hello all,

I’m helping with the analysis for a clinical research project examining associations between socioecologic environment (measured by values such as median income, obesity rate, etc for a patients zip code) and severity of a specific disease (separated into stages 1, 2, and 3).

I’d like to fit an ordinal logistic regression model to the data, likely using gologit2. I have many potential IVs and only 150 subjects, so I need to narrow down my IVs quite a bit. All variables have some plausible theoretical association with the outcome, so theory-based model building isn’t helping me eliminate much. What would be the best way to select variables for inclusion in this model? I’ve found information on this topic for logistic and linear regression, but not ordinal logistic regression. I’ve read a little about principal component analysis, but it seems like that could make the model more difficult to interpret.

A data sample is below in case it’s helpful:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float diseasestage byte black double(enroll_age bmi_calc percentobese percentinactive percentexerciseaccess percentuninsured pcprate percentunemployed percentchildpoverty percentfoodinsec) long medhouseincome double segregationindexbw
1 0   22 27.3 31.4 23.1 92.454842897 13.174296752           80.56414  5.214064915 25.6 18.5 47754 49.775083188
1 0 21.1 40.7 25.1 19.4 92.205217762 12.874479923           86.93802 4.6838267454 18.1 16.4 63197 52.502188707
1 0 38.9 42.5 22.9   17 92.220361312 9.7436039952           85.53034 4.2366938009 11.5 13.5 76173 42.958102877
1 0 34.9 44.2 35.9 31.8 65.025632406 14.778388044           48.18779 5.9127625202 32.3 21.1 39341 42.761927748
1 0 45.7 30.5 35.1 26.6 44.889398998 12.430834434           59.62585 6.5778725722 25.1 19.8 47403 28.750982275
1 1 41.3 47.7 22.9   17 92.220361312 9.7436039952           85.53034 4.2366938009 11.5 13.5 76173 42.958102877
1 1 43.4 25.9 26.4 20.3 91.522757085 13.747080414          121.94636 4.4949827786 23.7 17.9 54255 39.727826668
1 0 36.6 24.8   30   26 78.562932655 12.451845907 100.65854999999999 5.5272060737 24.2 21.6 45918 31.948632196
1 0 21.7 25.4 33.4 25.4 79.081867446 10.771492023           74.11113 6.3310939223 27.2 19.9 45286 30.424114337
1 0 15.8 21.6 22.9   17 92.220361312 9.7436039952           85.53034 4.2366938009 11.5 13.5 76173 42.958102877
1 0 11.6 31.5 35.2 28.1 68.549485427 13.705615284           30.16266 4.6305645799 19.2 13.1 55174  25.37213294
1 0 25.2 34.2 38.2 30.2 51.412051573 13.751507841           49.14124 5.8252788796 29.2 18.9 42421 39.477852689
1 1 19.7 45.7 22.9   17 92.220361312 9.7436039952           85.53034 4.2366938009 11.5 13.5 76173 42.958102877
1 1 27.1 37.9 32.9 25.9 85.653505899 14.144132931           48.01739 4.7319970116 23.8 15.4 46060 42.314389935
1 0 57.8 31.7 22.9   17 92.220361312 9.7436039952           85.53034 4.2366938009 11.5 13.5 76173 42.958102877
1 0 15.5 25.9 35.2 27.3 76.792875515 14.734513274           64.77231 9.1764303429 38.7 25.5 35138 30.307836833
1 1 17.9 28.3 22.9   17 92.220361312 9.7436039952           85.53034 4.2366938009 11.5 13.5 76173 42.958102877
1 1 17.3 20.9 26.4 20.3 91.522757085 13.747080414          121.94636 4.4949827786 23.7 17.9 54255 39.727826668
1 1 58.8 32.9 26.4 20.3 91.522757085 13.747080414          121.94636 4.4949827786 23.7 17.9 54255 39.727826668
1 0   29   42 22.9   17 92.220361312 9.7436039952           85.53034 4.2366938009 11.5 13.5 76173 42.958102877
end
label values black black
label def black 0 "Unchecked", modify
label def black 1 "Checked", modify

Thanks so much, and please let me know if there is any more information I can provide!

-Ashley

Tags: None

Richard Williams

Join Date: Apr 2014

Posts: 4983
#2

22 Jan 2020, 13:30

Welcome to Statalist.

Have you considered data reduction techniques, like creating scales or using factor analysis? You may be able to come up with fewer but higher-quality measures that way. You may have, say, 50 possible independent variables, but they may be reflections of a much smaller number of underlying latent variables, e.g. several items may just be different ways of measuring the same thing.

For some basic stuff on scale construction, see

https://www3.nd.edu/~rwilliam/stats2/l23.pdf

Also, you can use stepwise selection with ologit, but most people prefer to avoid that because stepwise selection is the work of Satan. See https://www.stata.com/support/faqs/s...sion-problems/

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Ashley Riddle

Join Date: Jan 2020

Posts: 20
#3

23 Jan 2020, 08:04

Hi Richard,

I appreciate your help! I think factor analysis would actually be perfect. I remember seeing another one of your posts disparaging stepwise selection, so I'll steer clear of that for now (not today, Satan!). Thank you so much!

Ashley
Comment

Announcement

Independent variable selection for ordinal logistic regression model

Comment

Comment