Help with lars / Lasso for binary outcome data

Jem Lane

Join Date: Nov 2014
Posts: 60

Help with lars / Lasso for binary outcome data

17 Dec 2017, 07:15

Dear Statalisters,

I have some binary outcome data I wish to analyse: there are 201 participants divided into 3 groups (grp=1,2,3) according to which of 3 drugs they are taking. The focus of the study is to see whether or not there are differences in 'outcome' according to whether participants are in group 1, 2 or 3.

The dependent variable (outcome) is binary, and there are several explanatory variables, some of which are binary, (x1, x2, x3, x4, proc); also 'age'.
I previously tried to analyse these data with -logit-, and received some useful help on how to approach the modelling. An important issue is that the outcome is infrequent:

Code:

           |        outcome
       grp |         0          1 |     Total
-----------+----------------------+----------
         1 |       102          6 |       108 
         2 |        43          3 |        46 
         3 |        43          4 |        47 
-----------+----------------------+----------
     Total |       188         13 |       201

I have since read more about this type of modelling, and found a paper recommending the use of penalised regression methods such as Lasso in this type of situation¹. Given the low number of events per variable, I hope to create a more parsimonious set of explanatory variables using Lasso. As recommended in the article, I then plan to bootstrap the results.
Using the -lars- command (package lars from http://fmwww.bc.edu/RePEc/bocode/l), I get the following output:

Code:

. lars outcome grp age x1 x2 x3 x4 proc, a(lasso)
NOTE: Deleting all matrices
          ade[8,7]
           mu[1,1]
        meanx[1,7]
           R2[1,8]
          RSS[1,8]
           r2[1,1]
          rss[1,1]
           cp[1,8]
        normx[1,7]
         beta[8,7]
        sbeta[8,7]
        error[1,1]

sbeta[8,7]
            c1          c2          c3          c4          c5          c6
r1           0           0           0           0           0           0
r2           0           0           0           0           0   .02670329
r3           0   .11976175           0           0           0   .14646504
r4           0   .27622894           0           0   .17078073   .35209206
r5  -.07296245   .33714013           0           0   .20257659    .4012965
r6  -.17237555   .41173968           0    .0360826   .24108976   .45939028
r7  -.20946439   .44511875  -.02437064   .05116421   .25880156   .48053282
r8  -.22669804   .46500635  -.03745335   .06197856   .27635318   .50026581

            c7
r1           0
r2           0
r3           0
r4           0
r5           0
r6           0
r7           0
r8  -.02249678

Algorithm is lasso

Cp, R-squared and Actions along the sequence of models

+---------------------------------------+
| Step |      Cp     | R-square |  Action |
|------+-------------+----------+-------|
|    1 |     1.5636  |  0.0000  |       | 
|    2 |     3.1899  |  0.0019  | +x4   | 
|    3 |     2.4317  |  0.0159  | +age  | 
|    4 |     1.0717 *|  0.0329  | +x3   | 
|    5 |     2.5161  |  0.0357  | +grp  | 
|    6 |     4.0876  |  0.0378  | +x2   | 
|    7 |     6.0150  |  0.0382  | +x1   | 
|    8 |     8.0000  |  0.0383  | +proc | 
+---------------------------------------+
* indicates the smallest value for Cp

The coefficient values for the minimum Cp

+-------------------------+
| Variable |  Coefficient |
|----------+--------------|
| age      |       0.0012 |
| x3       |       0.0256 |
| x4       |       0.0607 |
+-------------------------+

The low number of events per variable is acknowledged; nevertheless, I would like to be able to make more detailed analysis of the data than simply descriptive statistics.

My questions:
1. Given the main question of whether there is a difference in 'outcome' between groups (grp 1, 2 or 3), is Lasso followed by -bootstrap- the best way to explore this with my set of data?
2. In the final table in the output (Variable and Coefficient), is this what -lars-/Lasso suggest as the 3 significant predictors of 'outcome'? If so, is this for p<0.05? (The -lars- Help File does not appear to answer this).
3. In general terms, with Lasso are we hoping for the lowest value of Cp, and the highest R²?
4. Given grp is a nominal variable, is there a way to get -lars- to compare groups 1, 2 and 3 in the model, in the same way that you can with other Stata commands using i.grp? I get the following error if I try to use i.grp with -lars-:

Code:

. lars outcome i.grp age x1 x2 x3 x4 proc, a(lasso)
factor variables and time-series operators not allowed

Jem

Reference
1. Pavlou M et al, How to develop a more accurate risk prediction model when there are few events. BMJ 2015;351:h3868. doi: 10.1136/bmj.h3868

Stata/IC 14.2
Mac

Tags: None

Jem Lane

Join Date: Nov 2014

Posts: 60
#2

18 Dec 2017, 10:29

Hi,

is anyone able to offer any advice on this please?

thanks

Jem
Comment
Matthew J. Baker

Join Date: Mar 2014

Posts: 126
#3

18 Dec 2017, 11:13

Jem Lane --

I'm not sure if it helps, but I have programmed a few binary choice lasso models myself in Mata. My approach follows the paper by Max Farrell on Inference when a lasso is employed. The paper is here.

All of my work is in notebook form on the github site for the project, which is here. There is a workbook titled "Developing Group Lasso Programs In Stata" there that has a little bit of code interlaced with discussion.

I'm not sure if that helps, but I would be happy to further discuss if it seems like it might!

Best,

Matthew J. Baker
1 like
Comment
Jem Lane

Join Date: Nov 2014

Posts: 60
#4

18 Dec 2017, 15:53

Dear Matthew,

many thanks for your reply. I am grateful for your pointers, though I'm afraid the level of statistics is well beyond my understanding (my background is medical/biological sciences).
After reading the article in my first post, I read around the area of logistic regression and penalised methods, though at a fairly basic level, comparatively. While I'd love to understand things more fully, I'm really looking for help in using an existing command in Stata and in interpreting the output. Are you able to advise on the -lars- /Lasso commands above?

best

Jem
Comment
Matthew J. Baker

Join Date: Mar 2014

Posts: 126
#5

19 Dec 2017, 08:41

Jem Lane

I did a little poking around, and it seems like what you want is the plogit command, which is a user-written command that estimates a lassoed logistic model. I was going to use this command for my work that I pointed to above, but I also needed a lassoed multinomial logit.

In my experience, you might have to dive into the literature a little bit just so you can think about where the penalty function weight comes from. This is really the toughest issue. Also, note that plogit doesn't show up with a search, so you have to go to the web page of the author to install it!

Hope that helps!

Matthew J. Baker
Comment
Jem Lane

Join Date: Nov 2014

Posts: 60
#6

19 Dec 2017, 10:59

Thanks a lot for your help Matthew. I will have a look at this.

best wishes

Jem
Comment

Announcement

Help with lars / Lasso for binary outcome data

Comment

Comment

Comment

Comment

Comment