Bug in forward stepwise command? It doesn't seem to operate as described.

Eric Rasmusen

Join Date: Dec 2014

Posts: 7
#1

Bug in forward stepwise command? It doesn't seem to operate as described.

13 Feb 2015, 12:29

I tried using forward stepwise regression regressing the variable Conservative7 on two macro-lists of variables, xissues and xidentity, like this:

stepwise, forward pr(.0000000105) pe(.00000001): regress conservative7 `xissues' `xidentity' [pweight=weights];

This is mixed forward and backwards regression. It proceeded normally. My problem is that it doesn't follow what the online description says it does. The procedure is supposed to start by finding the x-variable that has the highest t-statistic (or correlation, or R2--all equivalent) when Conservative7 is regressed on it. It doesn't. It picks a variable that is worse than at least two others. I realized this when I looked at the simple bivariate correlations, and verified the problem by using the regress command to run bivariate (with constant) regressions. I haven't checked manually to see what stepwise does at later stages.
Since people don't do stepwise to stop after finding just one variable, this may be an overlooked bug.
Anyone have any idea what might be happening?
Tags: None
ben earnhart

Join Date: May 2014

Posts: 1027
#2

13 Feb 2015, 14:43

Have you tried manually regressing your variables? I wouldn't be surprised if it pays attention to multi-coliniearity when choosing what to include/exclude. Could be that as a single variable, it's the strongest predictor, but some combination of others (that would be impossible with the inclusion of your favorite) has a higher r^2.

I don't think you'll get a lot of help with step-wise here; it's like asking how to summon demons on a forum for priests
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#3

13 Feb 2015, 15:21

I don't think you'll get a lot of help with step-wise here; it's like asking how to summon demons on a forum for priests

Well said, Ben!
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#4

13 Feb 2015, 15:59

Did I hear somebody calling me?

A replicable example, or showing the output from what you did, might help. One thing I would check is to make sure that your use of weights and listwise deletion was consistent.

EDIT: If you used pwcorr to get the correlations, be sure you used the listwise option.

Last edited by Richard Williams; 13 Feb 2015, 16:03.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#5

14 Feb 2015, 00:05

Begone Dr. Williams! I realize that data-mining with tools like chaid are valid, but old techniques like stepwise regression are still anathema, Begone once, begone twice, begone thrice! Begone!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#6

14 Feb 2015, 02:52

A rumour that stepwise is buggy would be a good idea. I'd just want to flag that your threshold P-values seem unusually low.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#7

14 Feb 2015, 05:52

Eric:
on the same line of previous takes on your query, I remember a last year's post by Maarten Buis (http://www.statalist.org/forums/foru...oodness-of-fit) where he quoted an old Stata thread that basically banned stepwise regression (http://www.stata.com/support/faqs/st...sion-problems/).

Kind regards,
Carlo
(Stata 19.0)
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#8

14 Feb 2015, 07:21

At least I didn't show him how to do it in SPSS.

The p values are not only low but weird. I can see having a bunch of zeros followed by a 1, but why .0000000105?

My theory continues to be that, because of missing data, Eric was not always analyzing the same cases as he conducted his various analyses.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Eric Rasmusen

Join Date: Dec 2014

Posts: 7
#9

14 Feb 2015, 08:03

Thanks for the responses. Yes, stepwise wrecks all of my t-tests beyond repair, and was roundly condemned by MIT economics when I took my econometrics there, but it's ok for my unusual purpose. With Mark Ramseyer, we're trying to find the variables that best predict liberal-conservative ideology using 50 or so variables. A monster regression with all of them leaves 10 or 15 significant, though the problem of looking at separate t-values rather than some kind of joint test is pretty much as bad as stepwise. In any case, we want something parsimonious, even if it's biased. Rather than using the Akaike criterion or such things to tell us how many variables to include, we want to show what you'd use if you were only allowed to ask one question, two questions, three questions,e tc. up to ten questions, and show the reader how fast the R2 changes as you do that. We'll show stepwise, best subsets (that is, try all combinations), and lasso. They all come out about the same. We'll probably do some fancier stuff with bootstrapping and splitting the sample up into 10 parts too, but since we're leaving classical statistics territory we'll emphasize transparency. That's a big reason to use Mean Imputation, which seems to be universally deprecated.

Edit: Also, we have no theory--- in fact, we want to exclude as far as feasible our personal priors on what variables matter, so we include all issue-opinion variables in the dataset. We do limit the number of non-issue variables, but for them our question is whether identity-politics variables turn out to matter or not. And we do have to decide which variables are essentially ideology-summary variables--- "Republican", for example, or "Voted for Obama".

Last edited by Eric Rasmusen; 14 Feb 2015, 08:28.
Comment
Eric Rasmusen

Join Date: Dec 2014

Posts: 7
#10

14 Feb 2015, 08:17

Originally posted by ben earnhart View Post

Have you tried manually regressing your variables? I wouldn't be surprised if it pays attention to multi-coliniearity when choosing what to include/exclude. Could be that as a single variable, it's the strongest predictor, but some combination of others (that would be impossible with the inclusion of your favorite) has a higher r^2.

I don't think you'll get a lot of help with step-wise here; it's like asking how to summon demons on a forum for priests

Yes, I've done manual stepwise, which is pretty easy. Stepwise does in effect pay attention to multicollinearity. It starts with the single best variable, and if the one with the next highest simple correlation is multicollinear with that first one, it skips it. In fact, in my application, the best-2 regression doesn't include the best-1 variable for that reason when I do manual stepwise.
I've thought of an easy way to do manual stepwise, by the way--- one that avoids having to do lots of combination-regressions at each stage. Start by looking at the correlation matrix of yvar on all the xvars. Pick the xvar with the highest correlation and call it xvar1. Regress yvar on xvar1 and call the residuals yvar2. Find the correlation matrix for yvar2 on all the xvars (note that xvar1's correlation will be 0 now). PIck the xvar with the highest correlation and call it xvar2. Regress yvar on xvar 1 and xvar2 and call the residuals yvar3.

Find the correlation matrix for yvar3 on all the xvars (note that xvar1's nd Xvar2's correlations will be 0 now). Pick the xvar with the highest correlation and call it xvar3. Regress yvar on xvar1, xvar2, and xvar3 and call the residuals yvar3. Now do something different. Look at the last regression and see if xvar3 has a higher t-statistic than xvar1. If it does, kick out xvar1 and try regressing yvar on xvar2 and xvar3 as the best-2 regression and call ITS residuals yvar3. Proceed according to the add-remove pattern described.
Comment
Eric Rasmusen

Join Date: Dec 2014

Posts: 7
#11

14 Feb 2015, 08:20

Originally posted by Richard Williams View Post

Did I hear somebody calling me?

A replicable example, or showing the output from what you did, might help. One thing I would check is to make sure that your use of weights and listwise deletion was consistent.

EDIT: If you used pwcorr to get the correlations, be sure you used the listwise option.

THanks. I used corr to get the correlations instead of pwcorr, but that's the kind of thing I do wonder about in whether I made an error. I'll see about showing some output. I have 50,000 or so observations, so I'm not sure about an example, but I could post the dataset on the web and we should do that eventually for the paper anyway.
Comment
Eric Rasmusen

Join Date: Dec 2014

Posts: 7
#12

14 Feb 2015, 08:24

Originally posted by Richard Williams View Post

At least I didn't show him how to do it in SPSS.

The p values are not only low but weird. I can see having a bunch of zeros followed by a 1, but why .0000000105?

My theory continues to be that, because of missing data, Eric was not always analyzing the same cases as he conducted his various analyses.

Yes, I've got an unusual case. Really, I want stepwise with an option to stop with just, say, 5 x-variables, but I have to use the significance level option in Stata, and that's awkward when you have 50,000 observations since variables get wildly significant. Also, the way forward-and-backwards stepwise works in Stata is that you specify an add-variable significance and a drop-variable significance. The drop-variable significance can't be smaller (or you'd drop what you just added) and I suppose it can't be identical either-- I forget. So I set it to be just a little bigger.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#13

14 Feb 2015, 11:41

Stepwise tells you what happens at each step so it would be easy enough to stop at 5 variables if that is what you wanted.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Haron Smith

Join Date: May 2017

Posts: 4
#14

04 Jun 2017, 02:31

Originally posted by Eric Rasmusen View Post

I tried using forward stepwise regression regressing the variable Conservative7 on two macro-lists of variables, xissues and xidentity, like this:

stepwise, forward pr(.0000000105) pe(.00000001): regress conservative7 `xissues' `xidentity' [pweight=weights];

This is mixed forward and backwards regression. It proceeded normally. My problem is that it doesn't follow what the online description says it does. The procedure is supposed to start by finding the x-variable that has the highest t-statistic (or correlation, or R2--all equivalent) when Conservative7 is regressed on it. It doesn't. It picks a variable that is worse than at least two others. I realized this when I looked at the simple bivariate correlations, and verified the problem by using the regress command to run bivariate (with constant) regressions. I haven't checked manually to see what stepwise does at later stages.
Since people don't do stepwise to stop after finding just one variable, this may be an overlooked bug.
Anyone have any idea what might be happening?

Dear Eric,

could you please tell me how to obtain the values for pr and pe?
Thank you in advance!

Best regards,
Haron
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#15

04 Jun 2017, 03:04

Haron: If you don't already have strong views on what they should be, stepwise is more than usually inappropriate.
Comment

Announcement

Bug in forward stepwise command? It doesn't seem to operate as described.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment