Dear Statalisters,
I have some binary outcome data I wish to analyse: there are 201 participants divided into 3 groups (grp=1,2,3) according to which of 3 drugs they are taking. The focus of the study is to see whether or not there are differences in 'outcome' according to whether participants are in group 1, 2 or 3.
The dependent variable (outcome) is binary, and there are several explanatory variables, some of which are binary, (x1, x2, x3, x4, proc); also 'age'.
I previously tried to analyse these data with -logit-, and received some useful help on how to approach the modelling. An important issue is that the outcome is infrequent:
I have since read more about this type of modelling, and found a paper recommending the use of penalised regression methods such as Lasso in this type of situation1. Given the low number of events per variable, I hope to create a more parsimonious set of explanatory variables using Lasso. As recommended in the article, I then plan to bootstrap the results.
Using the -lars- command (package lars from http://fmwww.bc.edu/RePEc/bocode/l), I get the following output:
The low number of events per variable is acknowledged; nevertheless, I would like to be able to make more detailed analysis of the data than simply descriptive statistics.
My questions:
1. Given the main question of whether there is a difference in 'outcome' between groups (grp 1, 2 or 3), is Lasso followed by -bootstrap- the best way to explore this with my set of data?
2. In the final table in the output (Variable and Coefficient), is this what -lars-/Lasso suggest as the 3 significant predictors of 'outcome'? If so, is this for p<0.05? (The -lars- Help File does not appear to answer this).
3. In general terms, with Lasso are we hoping for the lowest value of Cp, and the highest R2?
4. Given grp is a nominal variable, is there a way to get -lars- to compare groups 1, 2 and 3 in the model, in the same way that you can with other Stata commands using i.grp? I get the following error if I try to use i.grp with -lars-:
Jem
Reference
1. Pavlou M et al, How to develop a more accurate risk prediction model when there are few events. BMJ 2015;351:h3868. doi: 10.1136/bmj.h3868
Stata/IC 14.2
Mac
I have some binary outcome data I wish to analyse: there are 201 participants divided into 3 groups (grp=1,2,3) according to which of 3 drugs they are taking. The focus of the study is to see whether or not there are differences in 'outcome' according to whether participants are in group 1, 2 or 3.
The dependent variable (outcome) is binary, and there are several explanatory variables, some of which are binary, (x1, x2, x3, x4, proc); also 'age'.
I previously tried to analyse these data with -logit-, and received some useful help on how to approach the modelling. An important issue is that the outcome is infrequent:
Code:
| outcome grp | 0 1 | Total -----------+----------------------+---------- 1 | 102 6 | 108 2 | 43 3 | 46 3 | 43 4 | 47 -----------+----------------------+---------- Total | 188 13 | 201
I have since read more about this type of modelling, and found a paper recommending the use of penalised regression methods such as Lasso in this type of situation1. Given the low number of events per variable, I hope to create a more parsimonious set of explanatory variables using Lasso. As recommended in the article, I then plan to bootstrap the results.
Using the -lars- command (package lars from http://fmwww.bc.edu/RePEc/bocode/l), I get the following output:
Code:
. lars outcome grp age x1 x2 x3 x4 proc, a(lasso) NOTE: Deleting all matrices ade[8,7] mu[1,1] meanx[1,7] R2[1,8] RSS[1,8] r2[1,1] rss[1,1] cp[1,8] normx[1,7] beta[8,7] sbeta[8,7] error[1,1] sbeta[8,7] c1 c2 c3 c4 c5 c6 r1 0 0 0 0 0 0 r2 0 0 0 0 0 .02670329 r3 0 .11976175 0 0 0 .14646504 r4 0 .27622894 0 0 .17078073 .35209206 r5 -.07296245 .33714013 0 0 .20257659 .4012965 r6 -.17237555 .41173968 0 .0360826 .24108976 .45939028 r7 -.20946439 .44511875 -.02437064 .05116421 .25880156 .48053282 r8 -.22669804 .46500635 -.03745335 .06197856 .27635318 .50026581 c7 r1 0 r2 0 r3 0 r4 0 r5 0 r6 0 r7 0 r8 -.02249678 Algorithm is lasso Cp, R-squared and Actions along the sequence of models +---------------------------------------+ | Step | Cp | R-square | Action | |------+-------------+----------+-------| | 1 | 1.5636 | 0.0000 | | | 2 | 3.1899 | 0.0019 | +x4 | | 3 | 2.4317 | 0.0159 | +age | | 4 | 1.0717 *| 0.0329 | +x3 | | 5 | 2.5161 | 0.0357 | +grp | | 6 | 4.0876 | 0.0378 | +x2 | | 7 | 6.0150 | 0.0382 | +x1 | | 8 | 8.0000 | 0.0383 | +proc | +---------------------------------------+ * indicates the smallest value for Cp The coefficient values for the minimum Cp +-------------------------+ | Variable | Coefficient | |----------+--------------| | age | 0.0012 | | x3 | 0.0256 | | x4 | 0.0607 | +-------------------------+
The low number of events per variable is acknowledged; nevertheless, I would like to be able to make more detailed analysis of the data than simply descriptive statistics.
My questions:
1. Given the main question of whether there is a difference in 'outcome' between groups (grp 1, 2 or 3), is Lasso followed by -bootstrap- the best way to explore this with my set of data?
2. In the final table in the output (Variable and Coefficient), is this what -lars-/Lasso suggest as the 3 significant predictors of 'outcome'? If so, is this for p<0.05? (The -lars- Help File does not appear to answer this).
3. In general terms, with Lasso are we hoping for the lowest value of Cp, and the highest R2?
4. Given grp is a nominal variable, is there a way to get -lars- to compare groups 1, 2 and 3 in the model, in the same way that you can with other Stata commands using i.grp? I get the following error if I try to use i.grp with -lars-:
Code:
. lars outcome i.grp age x1 x2 x3 x4 proc, a(lasso) factor variables and time-series operators not allowed
Reference
1. Pavlou M et al, How to develop a more accurate risk prediction model when there are few events. BMJ 2015;351:h3868. doi: 10.1136/bmj.h3868
Stata/IC 14.2
Mac
Comment