Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with lars / Lasso for binary outcome data

    Dear Statalisters,

    I have some binary outcome data I wish to analyse: there are 201 participants divided into 3 groups (grp=1,2,3) according to which of 3 drugs they are taking. The focus of the study is to see whether or not there are differences in 'outcome' according to whether participants are in group 1, 2 or 3.

    The dependent variable (outcome) is binary, and there are several explanatory variables, some of which are binary, (x1, x2, x3, x4, proc); also 'age'.
    I previously tried to analyse these data with -logit-, and received some useful help on how to approach the modelling. An important issue is that the outcome is infrequent:

    Code:
               |        outcome
           grp |         0          1 |     Total
    -----------+----------------------+----------
             1 |       102          6 |       108 
             2 |        43          3 |        46 
             3 |        43          4 |        47 
    -----------+----------------------+----------
         Total |       188         13 |       201

    I have since read more about this type of modelling, and found a paper recommending the use of penalised regression methods such as Lasso in this type of situation1. Given the low number of events per variable, I hope to create a more parsimonious set of explanatory variables using Lasso. As recommended in the article, I then plan to bootstrap the results.
    Using the -lars- command (package lars from http://fmwww.bc.edu/RePEc/bocode/l), I get the following output:

    Code:
    . lars outcome grp age x1 x2 x3 x4 proc, a(lasso)
    NOTE: Deleting all matrices
              ade[8,7]
               mu[1,1]
            meanx[1,7]
               R2[1,8]
              RSS[1,8]
               r2[1,1]
              rss[1,1]
               cp[1,8]
            normx[1,7]
             beta[8,7]
            sbeta[8,7]
            error[1,1]
    
    sbeta[8,7]
                c1          c2          c3          c4          c5          c6
    r1           0           0           0           0           0           0
    r2           0           0           0           0           0   .02670329
    r3           0   .11976175           0           0           0   .14646504
    r4           0   .27622894           0           0   .17078073   .35209206
    r5  -.07296245   .33714013           0           0   .20257659    .4012965
    r6  -.17237555   .41173968           0    .0360826   .24108976   .45939028
    r7  -.20946439   .44511875  -.02437064   .05116421   .25880156   .48053282
    r8  -.22669804   .46500635  -.03745335   .06197856   .27635318   .50026581
    
                c7
    r1           0
    r2           0
    r3           0
    r4           0
    r5           0
    r6           0
    r7           0
    r8  -.02249678
    
    Algorithm is lasso
    
    Cp, R-squared and Actions along the sequence of models
    
    +---------------------------------------+
    | Step |      Cp     | R-square |  Action |
    |------+-------------+----------+-------|
    |    1 |     1.5636  |  0.0000  |       | 
    |    2 |     3.1899  |  0.0019  | +x4   | 
    |    3 |     2.4317  |  0.0159  | +age  | 
    |    4 |     1.0717 *|  0.0329  | +x3   | 
    |    5 |     2.5161  |  0.0357  | +grp  | 
    |    6 |     4.0876  |  0.0378  | +x2   | 
    |    7 |     6.0150  |  0.0382  | +x1   | 
    |    8 |     8.0000  |  0.0383  | +proc | 
    +---------------------------------------+
    * indicates the smallest value for Cp
    
    The coefficient values for the minimum Cp
    
    +-------------------------+
    | Variable |  Coefficient |
    |----------+--------------|
    | age      |       0.0012 |
    | x3       |       0.0256 |
    | x4       |       0.0607 |
    +-------------------------+

    The low number of events per variable is acknowledged; nevertheless, I would like to be able to make more detailed analysis of the data than simply descriptive statistics.

    My questions:
    1. Given the main question of whether there is a difference in 'outcome' between groups (grp 1, 2 or 3), is Lasso followed by -bootstrap- the best way to explore this with my set of data?
    2. In the final table in the output (Variable and Coefficient), is this what -lars-/Lasso suggest as the 3 significant predictors of 'outcome'? If so, is this for p<0.05? (The -lars- Help File does not appear to answer this).
    3. In general terms, with Lasso are we hoping for the lowest value of Cp, and the highest R2?
    4. Given grp is a nominal variable, is there a way to get -lars- to compare groups 1, 2 and 3 in the model, in the same way that you can with other Stata commands using i.grp? I get the following error if I try to use i.grp with -lars-:

    Code:
    . lars outcome i.grp age x1 x2 x3 x4 proc, a(lasso)
    factor variables and time-series operators not allowed
    Jem

    Reference
    1. Pavlou M et al, How to develop a more accurate risk prediction model when there are few events. BMJ 2015;351:h3868. doi: 10.1136/bmj.h3868



    Stata/IC 14.2
    Mac

  • #2
    Hi,

    is anyone able to offer any advice on this please?

    thanks

    Jem

    Comment


    • #3
      Jem Lane --

      I'm not sure if it helps, but I have programmed a few binary choice lasso models myself in Mata. My approach follows the paper by Max Farrell on Inference when a lasso is employed. The paper is here.

      All of my work is in notebook form on the github site for the project, which is here. There is a workbook titled "Developing Group Lasso Programs In Stata" there that has a little bit of code interlaced with discussion.

      I'm not sure if that helps, but I would be happy to further discuss if it seems like it might!

      Best,

      Matthew J. Baker

      Comment


      • #4
        Dear Matthew,

        many thanks for your reply. I am grateful for your pointers, though I'm afraid the level of statistics is well beyond my understanding (my background is medical/biological sciences).
        After reading the article in my first post, I read around the area of logistic regression and penalised methods, though at a fairly basic level, comparatively. While I'd love to understand things more fully, I'm really looking for help in using an existing command in Stata and in interpreting the output. Are you able to advise on the -lars- /Lasso commands above?

        best

        Jem

        Comment


        • #5
          Jem Lane


          I did a little poking around, and it seems like what you want is the plogit command, which is a user-written command that estimates a lassoed logistic model. I was going to use this command for my work that I pointed to above, but I also needed a lassoed multinomial logit.

          In my experience, you might have to dive into the literature a little bit just so you can think about where the penalty function weight comes from. This is really the toughest issue. Also, note that plogit doesn't show up with a search, so you have to go to the web page of the author to install it!

          Hope that helps!

          Matthew J. Baker

          Comment


          • #6
            Thanks a lot for your help Matthew. I will have a look at this.

            best wishes

            Jem

            Comment

            Working...
            X