I'm running a linear LASSO on a large dataset, with many categorical variables in the donor pool, each of which have many levels. I want to keep only the levels of the categorical variables which the LASSO procedure selects, to be used in subsequent analysis. I have a very messy work around for this, but think there must be something cleaner and quicker.
Minimal reproducible example:
I now get that the selected variables are:
I understand this to mean that a dummy variable for whether mpg==14 is selected, but not for mpg==12.
How do I generate dummies for each level of mpg and keep those which are selected in the LASSO? Currently I have something like this, but it's a mess and there must be an easier way:
Minimal reproducible example:
Code:
* load data sysuse auto * set up variable lists, specifying mpg as categorical for sake of example vl set vl move (price trunk weight length turn displacement) vlcontinuous vl move (mpg) vlcategorical vl substitute catvars = i.vlcategorical vl move (price) vlother * run the lasso lasso linear price $vlcontinuous $catvars, rseed(42)
Code:
. di "`e(allvars_sel)'" headroom weight turn displacement 4bn.rep78 5bn.rep78 0bn.foreign 1bn.foreign 14bn.mpg 15bn.mpg 17bn.mpg 18bn.mpg 19bn.mpg 20bn.mpg 21bn.mpg 22bn.mpg 23bn.mpg 24bn.mpg 25bn.mpg 35bn.mpg
How do I generate dummies for each level of mpg and keep those which are selected in the LASSO? Currently I have something like this, but it's a mess and there must be an easier way:
Code:
local selvars `e(allvars_sel)'
foreach word in `selvars' {
local seppos = strpos("`word'","bn.")
local catvar
local catval
if `seppos' !=0 {
local startpos = `seppos' + 3
local catvar = substr("`word'",`startpos',.)
local endpos = `seppos'-1
local catval = substr("`word'",1,`endpos')
di "`catvar' at `catval'"
cap drop keepme_`catvar'_`catval'
g keepme_`catvar'_`catval' = `catvar' == `catval' if !mi(`catvar')
}
}
keep keepme_*

Comment