Bruteforcing regressions

Pratap Pundir

Join Date: Oct 2018

Posts: 143
#1

Bruteforcing regressions

04 Dec 2018, 20:40

From a set of about 300 variables, I've pared down to about 30 potential x variables. Assuming a linear model, I am hoping to run regressions on all combinations of these, i.e. the smallest model will have 1 x, the biggest will have 30. The idea is that once I have the results, with the r^2, p, t, F, etc, I can choose a model, keeping in mind all factors, such as the principle of parsimony etc.

The problem is: I am not quite sure how to run a loop that chooses x variables and then does regression on them...

Code:

foreach i=1/30 { forsomething (some loop for choosing which combination of i variables to pick) regress y x* } }

Thank you for your help!

Stata SE/17.0, Windows 10 Enterprise
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

04 Dec 2018, 21:36

First, there are 2³⁰-1 (a little over a billion) models in contention here. This is known as a "combinatorial explosion." So you had better leave instructions in your will to your grandchildren about what to do with the results. They are likely to find sorting through a billion regression outputs a bit tedious, too, so be sure to be generous in your bequests to them.

Even if the number of models were small enough to manage, this is a terrible way to select a model. It is a recipe for generating irreproducible results that overfit the noise in your data. You need a plan that is based on science here. I suggest that you consult an expert in your subject matter area and pick a sensible model, and perhaps a small number of credible alternatives. Then split your data set into two halves. Try the different models in the first half of the data, then examine the results and chose what you like best, and then test that same model in the other half of the data to see whether it reproduces there. This approach will leave you with a somewhat credible result. (There are fancier ways to do model development and validation, but this is the simplest.)
3 likes
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#3

05 Dec 2018, 01:45

Pratap:
as an aside to Clydes's (as always) excellent advice, I would take a look at the literature in your research field and see what others did in the apst when presented with the same research topic.

Kind regards,
Carlo
(Stata 19.0)
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#4

05 Dec 2018, 05:13

I can’t remember the name of the command, but Charles Lindsay from StataCorp put together a command to identify the best subset of regressors for a given output. Perhaps that would be a simpler way to accomplish a simplified version of your goal?
Comment

Announcement

Bruteforcing regressions

Comment

Comment

Comment