Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bruteforcing regressions

    From a set of about 300 variables, I've pared down to about 30 potential x variables. Assuming a linear model, I am hoping to run regressions on all combinations of these, i.e. the smallest model will have 1 x, the biggest will have 30. The idea is that once I have the results, with the r^2, p, t, F, etc, I can choose a model, keeping in mind all factors, such as the principle of parsimony etc.

    The problem is: I am not quite sure how to run a loop that chooses x variables and then does regression on them...
    Code:
    foreach i=1/30 {
    forsomething (some loop for choosing which combination of i variables to pick)
    regress y x*
    }
    }
    Thank you for your help!

    Stata SE/17.0, Windows 10 Enterprise

  • #2
    First, there are 230-1 (a little over a billion) models in contention here. This is known as a "combinatorial explosion." So you had better leave instructions in your will to your grandchildren about what to do with the results. They are likely to find sorting through a billion regression outputs a bit tedious, too, so be sure to be generous in your bequests to them.

    Even if the number of models were small enough to manage, this is a terrible way to select a model. It is a recipe for generating irreproducible results that overfit the noise in your data. You need a plan that is based on science here. I suggest that you consult an expert in your subject matter area and pick a sensible model, and perhaps a small number of credible alternatives. Then split your data set into two halves. Try the different models in the first half of the data, then examine the results and chose what you like best, and then test that same model in the other half of the data to see whether it reproduces there. This approach will leave you with a somewhat credible result. (There are fancier ways to do model development and validation, but this is the simplest.)

    Comment


    • #3
      Pratap:
      as an aside to Clydes's (as always) excellent advice, I would take a look at the literature in your research field and see what others did in the apst when presented with the same research topic.
      Kind regards,
      Carlo
      (Stata 18.0 SE)

      Comment


      • #4
        I can’t remember the name of the command, but Charles Lindsay from StataCorp put together a command to identify the best subset of regressors for a given output. Perhaps that would be a simpler way to accomplish a simplified version of your goal?

        Comment

        Working...
        X