Hi,
I have a methodological question concerning lasso regressions and the lasso linear command in Stata.
I have a dataset on daily investment flows of firms and a huge collection of dummy variables which constitute daily signals upon which the firms potentially invest.
There are more than one million observations and more than 2000 dummy variables (D_*) and a set of a few further controls (C_*).
I want to find out which of the dummy variables are most relevant to explain the dependent variable (FLOW).
To do so, I estimate a lasso linear regression command of FLOW on D_* with C_* being variables which are always included. Due to a long computation time over the whole sample, I first ran this command on a subsample of a random draw of 10,000 observations.
I obtained:
From a conceptual point of view, only explanatory variables with a positive effect on the dependent variable are of interest (i.e., those can be thought of positive stimulus to invest). Both explanatory variables with a significantly negative as well as negligible effect on the dependent variable are out of interest. However, of course, in my lasso specification, also variables with large negative coefficients are selected if those exist (and they do, as I found out after checking the variables from the selected lambda-model.)
Therefore, my question: Is it possible to run lasso such that it sets coefficients to zero which are close to zero or less than zero?
Thanks
I have a methodological question concerning lasso regressions and the lasso linear command in Stata.
I have a dataset on daily investment flows of firms and a huge collection of dummy variables which constitute daily signals upon which the firms potentially invest.
There are more than one million observations and more than 2000 dummy variables (D_*) and a set of a few further controls (C_*).
I want to find out which of the dummy variables are most relevant to explain the dependent variable (FLOW).
To do so, I estimate a lasso linear regression command of FLOW on D_* with C_* being variables which are always included. Due to a long computation time over the whole sample, I first ran this command on a subsample of a random draw of 10,000 observations.
Code:
lasso linear FLOW (C_*) D_* if random_sample == 1
Code:
Lasso linear model No. of obs = 10,000 No. of covariates = 2,179 Selection: Cross-validation No. of CV folds = 10 -------------------------------------------------------------------------- | No. of Out-of- CV mean | nonzero sample prediction ID | Description lambda coef. R-squared error ---------+---------------------------------------------------------------- 1 | first lambda 612.519 32 0.1412 1.13e+08 6 | lambda before 384.6798 34 0.1427 1.13e+08 * 7 | selected lambda 350.5059 35 0.1427 1.13e+08 8 | lambda after 319.3679 35 0.1426 1.13e+08 12 | last lambda 220.1279 57 0.1407 1.14e+08 --------------------------------------------------------------------------
Therefore, my question: Is it possible to run lasso such that it sets coefficients to zero which are close to zero or less than zero?
Thanks
Comment