Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Analyzing nested response scales using nlogit: example from McCullagh GLM textbook, Section 5.2.5

    Suppose that in a study of the effects of radiation individuals are classified as dead or not dead (stage 1).

    The nature of the study requires that deaths are classified as "due to cancer" or "not due to cancer". (stage 2)

    Death from cancer can either be "leukemia deaths" or "deaths from other cancers." (stage 3)

    The 4 mutually exclusive groups are (also depicted in the figure attached from McCullagh GLM book, pg 161):

    1. alive
    2. death from causes other than cancer
    3. death from cancers other than leukemia
    4. death from leukemia

    Say that age and gender are two factors we are interested in examining in each stage. My [incorrect] approach using nested logistic approach is the following (using a simulated example, data attached) is shown below.

    What is the correct syntax for fitting this model? Thanks

    __________________________________________________ ____


    // viz data
    . tab choice_c4

    choice_c4 | Freq. Percent Cum.
    ------------+-----------------------------------
    0 | 18 18.00 18.00 (alive)
    1 | 33 33.00 51.00 (death from causes other than cancer)
    2 | 33 33.00 84.00 (death from cancers other than leukemia )
    3 | 16 16.00 100.00 (death from leukemia)
    ------------+-----------------------------------
    Total | 100 100.00

    . list id choice_c4 age sex_c2 in f/5,noobs

    +-----------------------------------+
    | id choice~4 age sex_c2 |
    |-----------------------------------|
    | 1 3 27.66556 1 |
    | 2 2 45.42451 0 |
    | 3 1 53.2578 0 |
    | 4 3 41.86436 1 |
    | 5 2 35.91204 1 |
    +-----------------------------------+

    .
    . // create choice indicators
    . generate choice0 = (choice_c4 == 0)

    . generate choice1 = (choice_c4 == 1)

    . generate choice2 = (choice_c4 == 2)

    . generate choice3 = (choice_c4 == 3)

    .
    . // format to long
    . reshape long choice, i(id) j(myclass)
    (j = 0 1 2 3)

    Data Wide -> Long
    -------------------------------------------------------------------
    > ----------
    Number of observations 100 -> 400
    Number of variables 8 -> 6
    j variable (4 values) -> myclass
    xij variables:
    choice0 choice1 ... choice3 -> choice
    -------------------------------------------------------------------
    > ----------

    . drop choice_c4


    . list id myclass age sex_c2 choice in f/10,noobs

    +-------------------------------------------+
    | id myclass age sex_c2 choice |
    |-------------------------------------------|
    | 1 0 27.66556 1 0 |
    | 1 1 27.66556 1 0 |
    | 1 2 27.66556 1 0 |
    | 1 3 27.66556 1 1 |
    | 2 0 45.42451 0 0 |
    |-------------------------------------------|
    | 2 1 45.42451 0 0 |
    | 2 2 45.42451 0 1 |
    | 2 3 45.42451 0 0 |
    | 3 0 53.2578 0 0 |
    | 3 1 53.2578 0 1 |
    +-------------------------------------------+

    .
    . // we will produce the tree architecture for the nested logistic
    > regression
    .
    . nlogitgen top = myclass(A: 0, BCD: 1|2|3)
    New variable top is generated with 2 groups
    label list lb_top
    lb_top:
    1 A
    2 BCD

    . nlogitgen middle = myclass(B: 1, CD: 2|3)
    New variable middle is generated with 2 groups
    label list lb_middle
    lb_middle:
    1 B
    2 CD

    . nlogitgen bottom = myclass(C: 2, D: 3)
    New variable bottom is generated with 2 groups
    label list lb_bottom
    lb_bottom:
    1 C
    2 D

    .
    . // run the nested logistic regression model
    . nlogit choice age sex_c2 || top: || middle: || bottom:, case(id)
    no cases remain after removing invalid observations
    r(2000);

    end of do-file

    r(2000);

    Attached Files
    Last edited by Jesus Vazquez; 17 Apr 2024, 13:30.

  • #2
    Let me start by saying that this is not my area and I'm not familiar with -nlogit-, so take what follows with a heap of salt. My naive impression after skimming the manual for -nlogit- is that this might not be the best model as the PDF manual offers a technical note in Example 1 that suggests that the nested logit model does not imply or require temporality of choices as implied by the tree structure, such that people "choose" to be alive or dead, and if dead, then cause is due to cancer or not cancer, and if cancer, then leukemia or other cancer. The reason I brink this up is that in that example, the choices are among different restaurants and all choices are mutually exclusive. In this example, the structure is more hierarchical (e.g., death due to any cause is further subdivided by cause).

    When I've seen these structures, I usually think of sequential logistic regression models. The idea of this model is to consider first the logistic regression of died vs alive. The next step is to model cancer deaths vs other causes of death among those that died. The last level is a model for leukemia deaths vs other cancers among those who died of cancer. As you proceed down the hierarchy, the available sample gets (appropriately) smaller. Below I show an example using your data and show that the conditional probabilities of death at each stage are preserved when modeled without covariates. You can of course, add covariates to each stage (and I think they may also be different at each stage if you wish).

    Code:
    gen byte died = inrange(choice_c4, 1, 3)
    gen byte died_cancer = inrange(choice_c4, 2, 3) if died
    gen byte died_leukemia = choice_c4==3 if died & died_cancer
    
    tab choice_c4
    
    qui logit died
    margins
    qui logit died_cancer
    margins
    qui logit died_leukemia
    margins
    
    groups choice_c4 died*, missing abbrev(20)
    Selected results

    [code]

    . tab choice_c4

    choice_c4 | Freq. Percent Cum.
    ------------+-----------------------------------
    0 | 18 18.00 18.00
    1 | 33 33.00 51.00
    2 | 33 33.00 84.00
    3 | 16 16.00 100.00
    ------------+-----------------------------------
    Total | 100 100.00

    . qui logit died
    . margins
    Expression: Pr(died), predict()

    ------------------------------------------------------------------------------
    | Delta-method
    | Margin std. err. z P>|z| [95% conf. interval]
    -------------+----------------------------------------------------------------
    _cons | .82 .0384187 21.34 0.000 .7447006 .8952994
    ------------------------------------------------------------------------------

    . qui logit died_cancer
    . margins
    Expression: Pr(died_cancer), predict()

    ------------------------------------------------------------------------------
    | Delta-method
    | Margin std. err. z P>|z| [95% conf. interval]
    -------------+----------------------------------------------------------------
    _cons | .597561 .0541545 11.03 0.000 .4914202 .7037018
    ------------------------------------------------------------------------------

    . qui logit died_leukemia
    . margins
    Expression: Pr(died_leukemia), predict()

    ------------------------------------------------------------------------------
    | Delta-method
    | Margin std. err. z P>|z| [95% conf. interval]
    -------------+----------------------------------------------------------------
    _cons | .3265306 .066992 4.87 0.000 .1952287 .4578325
    ------------------------------------------------------------------------------

    . di 82/100
    .82

    . di 49/82
    .59756098

    . di 16/49
    .32653061

    Comment


    • #3
      Thank you Leonardo for the thoughtful response.

      Prior to posting my question, I tried the approach that you are recommending. For prediction purposes (as opposed to estimation), this process should suffice.

      My problem is that, when applied to the real data, everyone gets assigned to the alive category at the first stage. Thus, I don't have anyone else to classify at the 2nd (cancer or not) and 3rd stages (leukemia or not).

      At the first stage, everyone gets assigned to the alive category because the predicted probabilities for this category are all above 0.5 due to high class imbalance. This leaves me with two-options:

      1. try to use a nested logistic regression model as specified by nlogit (what I tried).
      2. tune the predicted-probability cutoff

      I was hoping the first option would work because the second option is not trivial how best to implement. Any leads are helpful.

      Thanks
      Last edited by Jesus Vazquez; 18 Apr 2024, 09:29.

      Comment


      • #4
        Well it's still not clear what your aim is. I came to this thread thinking you wanted to reproduce results from a textbook, and now you have real data with an unstated research question. I'll make 2 general remarks.

        1) assigning class membership based on an arbitrary cut-point of 0.5 doesn't make sense when classes are imbalanced. The regression constant in the unconditional model is the "tuned" turning point between classes. However, this suggests that you should be performing a simulation using whatever cutoff as the probability of assignment.
        2) discrete models such as logistic regression may not be of interest if you have time to event data where competing risks or other kinds of Cox regression may be more suitable to your question.

        Comment

        Working...
        X