Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Can I combine feature selection and feature creation method?

    Hello,
    I am working on my thesis, with the topic "Factors determining life satisfaction in the USA". I was given a dataset of 265 variables and more than 2000 observations. My aim is to compare the effects of economics and social problems (which are not already variables in the dataset) on the happiness.
    I would like to first, using LASSO (lasso2 depvar indepvar, lic(aic) ) to choose the suitable variables for the regression. And then, from the selected variables, I use PCA to combine them to only 10 factors, including economics and social problems). The last step is to build a model from these factors, with happiness as the dependent variables.
    In this case, LASSO selected 98 variables, and the number is too large to build a model. Therefore, I would like to use PCA to both reduce the dimension and call out the needed latent variables.
    May I ask is this an acceptable method to combine LASSO and PCA? If not, could you please suggest me a better method?

    Thank you very much!

  • #2
    Van:
    welcome to this forum.
    Using PCA as a last step sounds reasonable.
    In my opinion, the main risk lieas ahead in the regression model (endogeneity due to reverse causation): when adjusted for the other predictors, happiness can well contribute to variation in life-satistaction but the other way round holds, too.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thank you very much for your answer
      So it means that I can use PCA after lasso feature selection, isn't it? By the way, should I use cvlasso to choose the value of Lambda before doing lasso?
      Anyway, here I consider happiness is the same as life satisfaction, not two different variables.

      Comment


      • #4
        Why do you want to combine lasso and PCA? Both lasso and PCA are regularization (dimension reduction) techniques, but they work in different ways and are usually used with a different aim in mind.

        The lasso is fine if you want to predict happiness or identify which variables determine happiness (endogeneity issues aside). -- But consider to use EBIC or AICc instead of AIC; both are more appropriate when you have many regressors.

        If I understand correctly, you are interested in latent factors; so why not apply PCA to the full set of regressors and regress happiness against a subset of those components? In this way PCA does the regularization for you.

        You might also want to consider Ridge regression-- Principal Components Regression and Ridge are closely related (see https://web.stanford.edu/~hastie/ElemStatLearn/).
        --
        Tag me or email me for ddml/pdslasso/lassopack/pystacked related questions. I don't check Statalist.

        Comment

        Working...
        X