May I "displace" non-correlating variables into "side" regressions to avoid overfitting?

Simon Kuhn

Join Date: Jul 2018

Posts: 15
#1

May I "displace" non-correlating variables into "side" regressions to avoid overfitting?

07 Aug 2018, 08:21

Dear Stata experts,

I am working on a data set where the regressand is binary (s.o. owns a house or not).

Unfortunately I only have 25 successes (and 101 failures). Hence, guidelines of 10 observations per independent variable restrict my analysis quite a lot. If I'd follow the "one in ten rule" e.g. I could analyse no more than two or three variables, while at least seven variables are significant and influential.

That's why I thought of "displacing" some variables that didn't make it into the narrow main model, but that I still want to expand on.

So in my homeownership analysis I'd do a second regression focussing solely on personal features, omitting other (important main) variables such as the size of the town and suchlike. Of course providing that theses variables are not correlating with the variables in the main model to avoid omitted variable bias.

Is this a possible way to elude overfitting or what would you recommend?

Thanks a lot for your time and expertise!
Simon
Tags: None
Bruce Weaver

Join Date: May 2014

Posts: 1129
#2

07 Aug 2018, 12:02

Hi Simon. Unfortunately, there is no magic solution to the problem of having only 25 events. I like Mike Babyak's very readable article on over-fitting. It addresses most of the potential problems, and solutions that people might attempt. You may find some helpful advice in it.
https://www.cs.vu.nl/~eliens/sg/loca...verfitting.pdf

HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
2 likes
Comment
Simon Kuhn

Join Date: Jul 2018

Posts: 15
#3

08 Aug 2018, 07:37

Thanks a lot for referring to this interesting article, Bruce! I had already skimmed through it as an introduction to the issue of over-fitting.

But my question still remains. May I use "side" regressions to go into a few more variables than I'm actually able to put into the main model? Using accordingly tentative language, of course.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1129
#4

08 Aug 2018, 08:08

Hi Simon. I'm going to recycle and expand on something I posted in another forum back in 2008. The following is an excerpt from the preface to Robert P. Abelson’s book, Statistics as Principled Argument.
For many students, statistics is an island, separated from other aspects of the research enterprise. Statistics is viewed as an unpleasant obligation, to be dismissed as rapidly as possible so that they can get on with the rest of their lives. Furthermore, it is very hard to deal with uncertainty, whether in life or in the little world of statistical inference. Many students try to avoid ambiguity by seizing upon tangible calculations, with stacks of computer output to add weight to their numbers. Students become rule-bound, thinking of statistical practice as a medical or religious regimen. They ask questions such as, "Am I allowed to analyze my data with this method?" in the querulous manner of a patient or parishioner anxious to avoid sickness or sin, and they seem to want a prescriptive answer, such as, "Run an analysis of variance according to the directions on the computer package, get lots of sleep, and call me in the morning."

For years, I always responded to students who asked, "Can I do this?" by saying something like, "You can do anything you want, but if you use method M you’ll be open to criticism Z. You can argue your case effectively, however, if you use procedure P and are lucky enough to get result R. If you don’t get Result R, then I’m afraid you’ll have to settle for a weaker claim."

Eventually, I began to appreciate an underlying implication of the way I found myself responding: namely, that the presentation of the inferences drawn from statistical analysis importantly involves rhetoric. When you do research, critics may quarrel with the interpretation of your results, and you had better be prepared with convincing counterarguments. (These critics may never in reality materialize, but the anticipation of criticism is fundamental to good research and data analysis. In fact, imagined encounters with antagonistic sharpsters should inform the design of your research in the first place.) There are analogous features between the claims of a statistical analyst and a case presented by a lawyer--the case may be persuasive or flimsy (even fishy), the style of inference may be loose or tight, prior conventions and rules of evidence may be invoked or flouted, and so on. (p. xii)
I know I've not given a direct, unambiguous answer to your question. That's because I don't think there is one. Much will depend on which critics (or sharptsters) you run into.

HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
1 like
Comment
Simon Kuhn

Join Date: Jul 2018

Posts: 15
#5

08 Aug 2018, 08:54

Haha, thanks Bruce!
The more questions and problems I face running statistical analyses, I notice that gut feeling can help you quite a lot. Therefore, I think I'll still need to cut my teeth doing statistics
Comment

Announcement

May I "displace" non-correlating variables into "side" regressions to avoid overfitting?

Comment

Comment

Comment

Comment