Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • LASSO (LARS) - SELECTION OF MANY NOT SIGNIFICANT VARIABLES

    Hellow,
    I am using LARS so as to select from a pool of about 80 regressors (independent variables) the best ones in explaining my dependent variable.

    LASSO selects almost 70 out of 80 variables (using the Cp criterion), but when I use all these variables in estimating my final model specification almost half of those regressors (selected by LARS) appear to be insignificant (p-value >> 5%)!!!

    Is there something I can do when using LASSO so as not to end up in a model with so many insignificant variables?

    Or, do you have any other ideas on how can I overcome this issue with my dataset?

    Thank you in advance,
    Nikos

  • #2
    Nick:
    model maqullage should be discouraged.
    That said, and without knowing the reason you are after selecting "the best predictors" for your dependent variable, you may be interested in taking a look at -help stepwise- with the usual warning that there are very good reasons to avoid that approach (i.e., overfitting and subsequent poor out of-sample prediction) (please, see a related thread at http://www.statalist.org/forums/foru...ise-regression).
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      I'll go on the assumption you're doing something where data-mining techniques *are* appropriate, a purely predictive exercise where causality and theory are moot. Most people around here have backgrounds that make data mining anathema, but I took a class in it from B-school here, and for some purposes (marketing, credit card fraud detection, etc.), it's potentially OK so long as the results are validated on new datasets. Anyway a few possibilities:
      • Are they anywhere near .05 or .10? Near-significant variables can have predictive value. The magic .05 cut-off is an arbitrary construct. Something totally insignificant is one thing, but if it's close to significance, it may pass the Cp criterion.
      • What happens to the other regressors when you remove the non-significant ones? Do significance levels go up or down? It's possible LASSO includes stuff like supressor effects, or sets of dummies where only one out of five is actually significant.
      • Have you tried CHAID (try -findit chaid- to get it). LASSO seems a blunt instrument compared to CHAID, which is another method of data mining.
      Of course, depending on what you're really up to, data mining may not be appropriate anyway.

      Comment


      • #4
        Regarding the issue you raised, I am providing below some further clarifications:
        - I think that data mining techniques are appropriate in my study, since the current literature selects arbitrarily the regressors (accounting ratios) while in reality there is a wide range of variables that can be employed but not given any special attention.
        - almost half of the variable selected have a p-value above 30% (highly insignificant)
        - when i remove the non significant regresors the remaining are not materially affected (slight increase in significance). The R-squared of the model almost remains unchanged (decrease by about 0.1%)
        - I will also try CHAID.

        Thank you in advance for your valuable help.
        Nikos,

        Comment

        Working...
        X