Variable selection

Kjell Weyde

Join Date: May 2016
Posts: 129

Variable selection

19 Nov 2018, 06:52

Dear Statalist users,
I am interested in exploring several twoway interactions in multiple imputed datasets. I am not sure about the best way to do this, but one of the commands I have tried to follw a suggestion found here: https://rss.onlinelibrary.wiley.com/...8.2010.00740.x , using the cvlasso ( Ahrens, A., Hansen, C.B., Schaffer, M.E. 2018. cvlasso: Program for cross-validation using lasso, square-root lasso, elastic net, adaptive lasso and post-OLS estimators.). I want to loop over the imputed datasets, doing applying cvlasso on bootstrapped samples, and then post the results to a file. In the cvlasso, I have stated that all main effect variables are not to be penalized, only the interaction terms are. The code looks as follows:

Code:

 while _mi_m < 11 {
 bsample
 cvlasso M_5mC_pct logAs logHg logCd logMn logPb Se_std MORS_ALDER Maternal_edu Parity KJONN SMOKING_X logJod logFOLAT c.logAs#(c.logCd c.logHg c.logMn c.logPb c.Se_std) /*
  */ c.logCd#(c.logHg c.logMn c.logPb c.Se_std) c.logHg#(c.logMn c.logPb c.Se_std) c.logMn#(c.logPb c.Se_std) c.logPb#c.Se_std /*
  */ i.KJONN#(c.logAs c.logCd c.logHg c.logMn c.logPb c.Se_std) c.logJod#(c.logAs c.logCd c.logHg c.logMn c.logPb c.Se_std), /*
  */ notpen(logAs logHg logCd logMn logPb Se_std KJONN SMOKING_X logJod logFOLAT) /*
  */ lopt postest tolzero(1e-8) /*alphacount(5)*/
  mat allecoef=e(betaAll)
  local `b1' = allecoef[1,1]
  local `b2' = allecoef[1,2]
  local `b3' = allecoef[1,3]
  local `b4' = allecoef[1,4]
  local `b5' = allecoef[1,5]
  local `b6' = allecoef[1,6]
  local `b7' = allecoef[1,7]
  local `b8' = allecoef[1,8]
  local `b9' = allecoef[1,9]
  local `b10' = allecoef[1,10]
  local `b11' = allecoef[1,11]
  local `b12' = allecoef[1,12]
  local `b13' = allecoef[1,13]
  local `b14' = allecoef[1,14]
  local `b15' = allecoef[1,15]
  local `b16' = allecoef[1,16]
  local `b17' = allecoef[1,17]
  local `b18' = allecoef[1,18]
  local `b19' = allecoef[1,19]
  local `b20' = allecoef[1,20]
  local `b21' = allecoef[1,21]
  local `b22' = allecoef[1,22]
  local `b23' = allecoef[1,23]
  local `b24' = allecoef[1,24]
  local `b25' = allecoef[1,25]
  local `b26' = allecoef[1,26]
  local `b27' = allecoef[1,27]
  local `b28' = allecoef[1,28]
  local `b29' = allecoef[1,29]
  local `b30' = allecoef[1,30]
  local `b31' = allecoef[1,31]
  local `b32' = allecoef[1,32]
  local `b33' = allecoef[1,33]
  local `b34' = allecoef[1,34]
  local `b35' = allecoef[1,35]
  local `b36' = allecoef[1,36]
  local `b37' = allecoef[1,37]
  local `b38' = allecoef[1,38]
  local `b39' = allecoef[1,39]
  local `b40' = allecoef[1,40]
  local `b41' = allecoef[1,41]
  local `b42' = allecoef[1,42]
  local `b43' = allecoef[1,43]
  local `b44' = allecoef[1,44]
  local `b45' = allecoef[1,45]
  local `b46' = allecoef[1,46]
  local `b47' = allecoef[1,47]
 

 post mysim4 (`b1')(`b2')(`b3')(`b4')(`b5')(`b6')(`b7')(`b8')(`b9')(`b10')(`b11')(`b12')(`b13')(`b14')(`b15')(`b16')(`b17')(`b18')(`b19')(`b20')(`b21')(`b22')(`b23')(`b24')(`b25')(`b26')(`b27')(`b28')(`b29')(`b30')(`b31')(`b32')(`b33')(`b34')(`b35')(`b36')(`b37')(`b38')(`b39')(`b40')(`b41')(`b42')(`b43')(`b44')(`b45')(`b46')(`b47')
}
postclose mysim4

However, the end of the output gives me:

Code:

_= invalid name

1) Is there any more appropriate ways in Stata 15 to evaluate many twoway interactions?
2) Where does "_" come from?

Best regards,
Kjell Weyde

Tags: None

daniel klein

Join Date: Mar 2014

Posts: 3842
#2

19 Nov 2018, 07:31

I have never worked with *lasso type models, so I cannot give specific advice on this.

However, if you do not account for all possible interactions during the imputation process, you cannot (validly) test for these interactions in the imputed datasets later. Moreover, as far as I know, there is not an agreed-on method for combining multiple imputations with bootstrap procedures; I remember vaguely reading that you should actually make the imputation part of the bootstrap, i.e., bootstrap a sample with missing data, then impute the missing data, then analyze the imputed data, then draw the next bootstrap sample.

To summarize: I believe that you will need code that (1) resamples repeatedly from the original data with missing values, (2) performs the imputations (2a) while accounting for any possible interactions that you wish to test for, (3) do the *lasso analysis for each imputed dataset, (4) combine the results from the analysis in an appropriate way.

Best
Daniel
Comment

Announcement

Variable selection

Comment