Two part model

kusum shekhawat

Join Date: Jun 2022
Posts: 19

11 Jun 2022, 05:22

Hi,
I have used two part model for my healthcare cost data and used the following code
"twopm total_cost age i.age_grp i.sex i.comorb_cat ib2.health_insurance i.wealth_tertile i.facility1 i.level1 i.treatment1 i.flu1 ib2.sample_type_final ib2.Site, ///
firstpart(logit, nolog) secondpart(glm, family(gamma) link(log) nolog)"

but getting different number of observations in the first part, i dont understand why

Code:

. ta total_cost if total_cost==0

 total_cost |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        575      100.00      100.00
------------+-----------------------------------
      Total |        575      100.00

. sum total_cost

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
  total_cost |      3,729    974.1922    1202.411          0   19366.55

 twopm total_cost age i.age_grp i.sex i.comorb_cat ib2.health_insurance i.wealth_tertile i.facility1 i.level1 i.treatment1 i.flu1
>  ib2.sample_type_final ib2.Site, ///
> firstpart(logit, nolog) secondpart(glm, family(gamma) link(log) nolog)

Fitting logit regression for first part:
note: 2.level1 != 0 predicts success perfectly
      2.level1 dropped and 117 obs not used

note: 3.level1 != 0 predicts success perfectly
      3.level1 dropped and 53 obs not used

note: 3.treatment1 != 0 predicts success perfectly
      3.treatment1 dropped and 30 obs not used


Fitting glm regression for second part:

Two-part model
------------------------------------------------------------------------------
Log pseudolikelihood = -26187.565                 Number of obs   =       3529

Part 1: logit
------------------------------------------------------------------------------
                                                  Number of obs   =       3529
                                                  LR chi2(17)     =     941.09
                                                  Prob > chi2     =     0.0000
Log likelihood = -1098.1144                       Pseudo R2       =     0.3000

Part 2: glm
------------------------------------------------------------------------------
                                                   Number of obs   =      3154
Deviance         =  2317.199533                    (1/df) Deviance =  .7396104
Pearson          =  2677.580772                    (1/df) Pearson  =   .854638

Variance function: V(u) = u^2                      [Gamma]
Link function    : g(u) = ln(u)                    [Log]

                                                   AIC             =  15.92292
Log likelihood   = -25089.45093                    BIC             = -22923.59
-----------------------------------------------------------------------------------
       total_cost |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
logit             |
              age |  -.0057328   .0036571    -1.57   0.117    -.0129006    .0014349
                  |
          age_grp |
           65-69  |  -.0207899   .1372475    -0.15   0.880      -.28979    .2482103
    70 and above  |   .0716346   .1354971     0.53   0.597    -.1939347     .337204
                  |
              sex |
               M  |  -.2404908   .1097937    -2.19   0.028    -.4556825    -.025299
                  |
       comorb_cat |
             One  |   .1405211   .1296914     1.08   0.279    -.1136694    .3947116
   More than one  |   .2788304   .1397698     1.99   0.046     .0048866    .5527743
                  |
 health_insurance |
             Yes  |  -.1972704   .1624971    -1.21   0.225    -.5157588    .1212181
                  |
   wealth_tertile |
               2  |   .4299188   .1629887     2.64   0.008     .1104669    .7493707
               3  |   .2415955   .1526641     1.58   0.114    -.0576207    .5408117
                  |
        facility1 |
         Private  |   3.134358   .6235388     5.03   0.000     1.912245    4.356472
                  |
           level1 |
         Primary  |   .1779982   .2299553     0.77   0.439    -.2727059    .6287023
                  |
       treatment1 |
      Ambulatory  |   3.532288   .7454025     4.74   0.000     2.071326     4.99325
                  |
             flu1 |
         flu/RSV  |    .727569   .2962698     2.46   0.014     .1468909    1.308247
                  |
sample_type_final |
            ALRI  |   .8066968   .1361144     5.93   0.000     .5399174    1.073476
                  |
             Site |
         Chennai  |   1.463906   .3477832     4.21   0.000     .7822639    2.145549
         Kolkata  |   2.587106   .3871506     6.68   0.000     1.828305    3.345907
            Pune  |   1.377833   .3393943     4.06   0.000     .7126324    2.043033
                  |
            _cons |  -.1149095   .1850908    -0.62   0.535    -.4776808    .2478617
------------------+----------------------------------------------------------------
glm               |
              age |  -.0031744   .0010904    -2.91   0.004    -.0053116   -.0010372
                  |
          age_grp |
           65-69  |  -.0194222   .0405161    -0.48   0.632    -.0988323    .0599878
    70 and above  |   .1175043   .0417505     2.81   0.005     .0356748    .1993339
                  |
              sex |
               M  |   .0591213   .0345553     1.71   0.087    -.0086058    .1268483
                  |
       comorb_cat |
             One  |   .1126224   .0449867     2.50   0.012       .02445    .2007947
   More than one  |   .2371152   .0446565     5.31   0.000     .1495901    .3246403
                  |
 health_insurance |
             Yes  |  -.0276218   .0589608    -0.47   0.639    -.1431829    .0879393
                  |
   wealth_tertile |
               2  |  -.0503335   .0443209    -1.14   0.256    -.1372007    .0365338
               3  |   -.026519   .0531616    -0.50   0.618    -.1307139    .0776758
                  |
        facility1 |
         Private  |   .2387775    .054445     4.39   0.000     .1320672    .3454877
                  |
           level1 |
         Primary  |  -.2222971   .0740025    -3.00   0.003    -.3673394   -.0772548
       Secondary  |  -.1561688   .1202328    -1.30   0.194    -.3918207    .0794831
        Tertiary  |   .1886037   .1490505     1.27   0.206    -.1035298    .4807372
                  |
       treatment1 |
      Ambulatory  |    .610182   .0699323     8.73   0.000     .4731172    .7472468
   Emergency/IPD  |   1.130927   .1708076     6.62   0.000     .7961499    1.465703
                  |
             flu1 |
         flu/RSV  |   .1072076   .0673063     1.59   0.111    -.0247104    .2391255
                  |
sample_type_final |
            ALRI  |   .4263658    .037677    11.32   0.000     .3525203    .5002113
                  |
             Site |
         Chennai  |    .217353    .079374     2.74   0.006     .0617827    .3729232
         Kolkata  |   -.237404   .0842827    -2.82   0.005    -.4025952   -.0722129
            Pune  |   .0250936   .0820798     0.31   0.760    -.1357799    .1859672
                  |
            _cons |   6.444186    .063974   100.73   0.000     6.318799    6.569573
-----------------------------------------------------------------------------------

.
end of do-file

Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

11 Jun 2022, 06:57

Kusum:

Code:

Fitting logit regression for first part:
note: 2.level1 != 0 predicts success perfectly
      2.level1 dropped and 117 obs not used

note: 3.level1 != 0 predicts success perfectly
      3.level1 dropped and 53 obs not used

note: 3.treatment1 != 0 predicts success perfectly
      3.treatment1 dropped and 30 obs not used

They sum up to 200 observations (that is, the difference between the number of observations between the first and the second part of your regression model).

Kind regards,
Carlo
(Stata 19.0)

Comment

John Mullahy

Join Date: Dec 2016

Posts: 752
#3

11 Jun 2022, 07:12

The 200 observations dropped due to perfect prediction explains the reduction in sample size from 3,729 to 3,529 in part one.
2 likes
Comment

kusum shekhawat

Join Date: Jun 2022
Posts: 19

12 Jun 2022, 06:37

I ran these following post estimation commands
where "Predict" gives me 3729 obsversations but "margin" and "margin, dydx(*)" gives 3529 observations.
so i have follwing doubts
1) margins command gives the predicted mean(if i'm correct), then why it is different after predict command (est mean is 977.46) and it's a two part model, so i do explain the 200 missing if i use the margins
2) Interpretation of dy/dx, if someone can help me giving me an example.
3) how do i get combined B- coefficient for this model, and again the combined B coefficient will be interpreted on log scale?

Code:

 predict twopmhat
 sum twopmhat total_cost

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    twopmhat |      3,729    977.4612    617.3496   176.1645   6136.087
  total_cost |      3,729    974.1922    1202.411          0   19366.55

. margins
Warning: cannot perform check for estimable functions.

Predictive margins                              Number of obs     =      3,529

Expression   : twopm combined expected values, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   920.0791   17.81262    51.65   0.000      885.167    954.9912
------------------------------------------------------------------------------

. margins, dydx(*)
Warning: cannot perform check for estimable functions.

Average marginal effects                        Number of obs     =      3,529

Expression   : twopm combined expected values, predict()
dy/dx w.r.t. : age 2.age_grp 3.age_grp 2.sex 1.comorb_cat 2.comorb_cat 1.health_insurance 2.wealth_tertile 3.wealth_tertile
               2.facility1 1.level1 2.level1 3.level1 2.treatment1 3.treatment1 2.flu1 1.sample_type_final 1.Site 3.Site 4.Site

-----------------------------------------------------------------------------------
                  |            Delta-method
                  |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
              age |  -3.394332   1.051936    -3.23   0.001    -5.456088   -1.332576
                  |
          age_grp |
           65-69  |  -18.74307   37.27306    -0.50   0.615    -91.79693    54.31079
    70 and above  |   117.2507   41.37961     2.83   0.005     36.14818    198.3533
                  |
              sex |
               M  |   34.42272   33.42079     1.03   0.303    -31.08082    99.92626
                  |
       comorb_cat |
             One  |    105.404   38.92887     2.71   0.007     29.10485    181.7032
   More than one  |   235.9656   41.05419     5.75   0.000     155.5008    316.4303
                  |
 health_insurance |
             Yes  |  -41.54754   54.18024    -0.77   0.443    -147.7389    64.64379
                  |
   wealth_tertile |
               2  |  -10.24832    43.0106    -0.24   0.812    -94.54756    74.05091
               3  |   -3.28981    50.6795    -0.06   0.948    -102.6198    96.04018
                  |
        facility1 |
         Private  |    393.412   65.35237     6.02   0.000     265.3237    521.5003
                  |
           level1 |
         Primary  |  -192.9078   71.31992    -2.70   0.007    -332.6923   -53.12336
       Secondary  |   -148.369   109.9265    -1.35   0.177    -363.8209    67.08292
        Tertiary  |   212.9941   180.4417     1.18   0.238    -140.6651    566.6534
                  |
       treatment1 |
      Ambulatory  |   867.8743   102.3894     8.48   0.000     667.1948    1068.554
   Emergency/IPD  |   1561.687   384.6989     4.06   0.000     807.6912    2315.683
                  |
             flu1 |
         flu/RSV  |   160.5259    66.9739     2.40   0.017     29.25945    291.7923
                  |
sample_type_final |
            ALRI  |   494.7202   44.08234    11.22   0.000     408.3204      581.12
                  |
             Site |
         Chennai  |    369.941   88.66665     4.17   0.000     196.1576    543.7244
         Kolkata  |  -37.47339   74.12326    -0.51   0.613    -182.7523    107.8055
            Pune  |   154.0903   82.19243     1.87   0.061    -7.003907    315.1845
-----------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

Comment

John Mullahy

Join Date: Dec 2016

Posts: 752
#5

12 Jun 2022, 06:53

On your point #1, predict and margins treat the estimation sample differently. predict will use all available information unless you wish to restrict the prediction to the estimation sample, in which case you can use

Code:

predict twopmhat if e(sample)

Conversely margins will use only the estimation sample unless you instruct it not to do so

Code:

margins, noesample
1 like
Comment
kusum shekhawat

Join Date: Jun 2022

Posts: 19
#6

16 Jun 2022, 14:57

Hi,
i was doing some explanatory analysis in R
like i said in #1, i used twopm model for my cost data
and Carlo Lazzaro suggested that cost data follows gamma distribution, but when i checked the distribution of my data excluding zeroes, it fits better as lognormal distribution(compared weibull, gamma and log normal) based on AIC and Q-Q plot comparison, but not sure what to make out of it?
Is it ok to that my cost data follows log normal distribution?
and if it is fine then what should i mention in glm as family instead of gamma in stata with twopm command?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#7

16 Jun 2022, 15:17

Kusum:
as we know, omitting zero (or any othe value) is clearly arbitrary.
As far as healthcare costs are concerned, you may have a (hopefully very small) fraction of your patients who passed away just after entering the study (zero costs).
That said, sticking with -glm- you may want to explore -link(log)- with -family(Gaussian)-.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
kusum shekhawat

Join Date: Jun 2022

Posts: 19
#8

16 Jun 2022, 15:28

i tried including zeroes in R, to check the distribution of my data but gamma and weibull throws some errors which means there are zeroes in the data, so i had to exclude them
so, i'll ask the very basic question now
how to check the distribution and skewness of my data in stata without excluding zeroes or without zeroes, because i need to show the distribution on graph and the distribution type as well that will support my choice of glm family
Comment
John Mullahy

Join Date: Dec 2016

Posts: 752
#9

16 Jun 2022, 15:35

Whether you use glm or twopm it is strongly recommended that you specify some form of robust vce as an option, e.g. vce(robust). This would be especially true if you decided to use a one- or two-part glm model with a log link, as Carlo suggested in #7.
1 like
Comment

Announcement

Two part model

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment