Differing results with stepwise regression (& categorical predictors)

Bryony Simmons

Join Date: Jan 2018
Posts: 37

Differing results with stepwise regression (& categorical predictors)

02 Aug 2019, 06:12

Hi,

I am having trouble interpreting the results of my logistic regression analysis.

I am running an exploratory analysis using stepwise regression (I understand this method has limitations) including categorical variables as follows:

Code:

xi: stepwise, pr(0.2): logistic outcome (i.nationality) gender age (i.medications) (i.quality)

I reran the final model without using stepwise regression (ie, input all the variables that are retained in the model), with either of the following:

Code:

logistic outcome i.nationality i.quality
logistic outcome _Inationali_2 _Inationali_3 _Inationali_4 _Iquality_2 _Iquality_3

The results of these final two regressions are identical, but these differ slightly than the stepwise regression results. The number of observations retained in the stepwise analyses is reduced.

I am having trouble understand the mathematics/reasons behind these models providing different results. Could anyone help with this? I am using Stata 14.2 & have included my data below.

Thank you in advance for any advice,
Bryony

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(outcome nationality) byte gender float(age age_cat medications    quality)
0 1 1   43 2 6 2
1 2 1   32 1 4 2
0 2 0 39.5 1 6 1
0 1 0   41 2 4 1
0 1 0 32.5 1 3 1
0 1 1 24.5 1 2 1
0 3 0   35 1 3 1
0 2 0 36.5 1 4 2
0 1 1   39 1 3 2
0 2 0   44 2 1 2
1 2 0 26.5 1 2 2
. 1 0 47.5 2 3 1
0 1 0   37 1 6 1
. 1 0 52.5 3 1 1
. 1 0 36.5 1 1 2
. 2 0   42 2 3 3
0 1 0   53 3 5 1
0 4 0   32 1 2 1
1 1 0   52 3 6 3
. 1 0 35.5 1 1 1
. 1 0   27 1 4 1
0 2 0   42 2 1 1
0 2 0   39 1 1 1
0 2 0 47.5 2 4 3
. 1 0 40.5 2 1 3
. 2 0   42 2 6 .
0 2 0   40 1 2 2
0 1 0   40 1 4 1
0 1 1   28 1 3 3
0 2 0   37 1 6 1
1 1 0    . . 1 3
0 1 0 34.5 1 4 2
0 2 0   28 1 3 1
0 1 1 28.5 1 4 1
0 1 0   56 3 4 1
0 4 0   43 2 3 2
0 1 0 38.5 1 4 1
0 1 0 60.5 3 3 2
0 1 0   46 2 1 1
0 1 0   44 2 2 2
. 1 0 55.5 3 2 1
0 1 0   35 1 5 1
0 1 1   32 1 3 3
0 4 0   39 1 3 1
. 2 0   46 2 5 1
0 2 0   36 1 4 1
1 4 0 55.5 3 5 .
0 4 0 28.5 1 3 1
1 1 1   55 3 1 1
1 1 0   34 1 4 2
1 1 0 65.5 3 1 .
1 1 0   55 3 4 3
. 2 0 33.5 1 1 2
0 1 0 42.5 2 1 1
0 2 0   40 2 4 2
0 2 0 27.5 1 1 1
0 2 0 22.5 1 6 3
. 1 0 45.5 2 4 1
1 1 0   50 2 1 1
0 1 0 48.5 2 5 2
0 1 0 42.5 2 6 2
0 2 0 51.5 3 3 1
. 3 0   43 2 1 1
0 2 0 30.5 1 4 1
0 4 0   47 2 3 2
0 2 0 36.5 1 1 1
0 2 0   52 3 4 2
0 2 0   35 1 1 1
0 2 0 35.5 1 5 3
1 1 1   48 2 2 2
0 1 0 25.5 1 2 1
. 1 0   45 2 4 2
. 1 0 47.5 2 5 1
. 1 0 49.5 2 3 1
1 4 0 35.5 1 2 3
0 1 0 30.5 1 4 3
0 1 0   47 2 2 1
. 2 0 36.5 1 4 1
. 1 0 55.5 3 3 1
. 2 0 40.5 2 2 2
. 1 0   38 1 4 2
0 2 1 25.5 1 4 3
. 1 0 60.5 3 2 1
. 1 0 29.5 1 3 1
0 2 0 50.5 3 5 1
. 1 0   32 1 2 2
. 1 0 30.5 1 4 1
0 1 0    . . 1 1
. 1 0 28.5 1 3 2
. 1 0 26.5 1 3 1
0 2 0 44.5 2 2 1
. 2 0 38.5 1 4 3
0 2 0    . . 2 1
1 3 0 42.5 2 4 2
1 1 1   32 1 4 3
. 2 0   30 1 6 2
. 2 0   34 1 5 2
1 2 1 30.5 1 1 3
. 2 0   32 1 3 1
. 2 0 40.5 2 2 1
end

Tags: categorical, logistic, regression, stepwise

Richard Williams

Join Date: Apr 2014

Posts: 5008
#2

02 Aug 2019, 06:33

It is likely because of missing data in the variables that were not selected. From the manual:

Whether you use backward or forward estimation, stepwise forms an estimation sample by taking observations with nonmissing values of all the variables specified (except for depvar1 and depvar2
for intreg). The estimation sample is held constant throughout the stepping. Thus if you type

. stepwise, pr(.2) hierarchical: regress amount sk edul sval

and variable sval is missing in half the data, that half of the data will not be used in the reported model, even if sval is not included in the final model.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Bryony Simmons

Join Date: Jan 2018

Posts: 37
#3

02 Aug 2019, 06:49

that's really helpful, thank you!
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#4

02 Aug 2019, 08:11

Incidentally in Stata 16 lasso is being suggested as an alternative to stepwise. I told StataCorp people that lasso sounded like high-tech stepwise regression to me and was therefore the work of the devil, but they didn't agree. Here are two presentations on lasso from the 2019 Chicago Stata Conference:

https://www.stata.com/meeting/chicag...cago19_Liu.pdf

https://www.stata.com/meeting/chicag...19_Drukker.pdf

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
2 likes
Comment
Bryony Simmons

Join Date: Jan 2018

Posts: 37
#5

02 Aug 2019, 15:30

Thank you for the lasso comment, Richard. I thought I understood the stepwise process, but there is another thing I do not quite understand.

In the dummy dataset below, I run regressions using two different outcomes (outcome1 and outcome2) & keep the covariates the same - the outcomes are fully answered.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(outcome1 outcome2 cat1 cont1 cat2 bin1 nonmiss) 0 1 1 12 . 0 1 1 1 1 13 2 0 0 1 1 1 14 3 0 0 0 1 . 11 1 1 1 0 0 2 12 2 1 0 0 0 2 13 3 1 0 1 0 2 14 1 0 0 1 0 2 . 2 0 1 0 0 2 12 3 0 0 0 0 3 14 1 1 0 1 0 3 10 2 1 0 1 1 3 12 3 1 0 0 1 3 . 1 0 1 0 1 3 14 2 0 0 1 0 4 15 3 0 0 1 0 4 16 1 . 1 1 1 4 12 2 1 0 0 1 4 13 3 1 0 0 1 4 15 1 0 0 end

Code:

xi: stepwise, pr(0.2): logistic outcome1 (i.cat1) cont1 (i.cat2) bin1 xi: stepwise, pr(0.2): logistic outcome2 (i.cat1) cont1 (i.cat2) bin1

I understood that the estimation sample would be formed by taking all observations with non-missing values (ie, n=14). However, the number of observations included in the model differs for outcome1 (n=14) and outcome2 (n=10). Is it possible to explain the reason for this?

My apologies for my misunderstanding & thank you for your patience.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#6

03 Aug 2019, 05:50

It says observations are being dropped because of estimation problems. The sample is very small. It is a problem with the model and data, not with stepwise.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Announcement

Differing results with stepwise regression (& categorical predictors)

Comment

Comment

Comment

Comment

Comment