Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fitting a bimodal distribution using fmm, meed some insights

    Hi everybody:

    Background. I'm working on a friend's dataset of 135 bettles from the north of Spain belonging, allegedly, to two different subspecies (european and african), characterised by a variable called E/R (ratio of total wing length to wing color patch length). I used Hartigan's dip test to reject unimodality on the complete dataset (succeeded; p=0.005). Then my friend, a p-value worshipper, asked me to verify if bimodality was also present in a subset of 43 cases, from a very restricted geographical area. Unfortunately, although violin plots suggest bimodality, dip test fails to be significant. I have been unable to convince him that absence of evidence is not evidence of absence (lack of significance doesn't imply proving unimodality).

    So, I tried a different approach (showing that AIC is minimized in a mixture model compared to a single normal distribution). I used -fmm- (by Partha Deb, City University of New York) to fit a 2 components mixture for the overall dataset:

    Code:
    . fmm e_r, mix(normal) comp(2)
    
    Fitting Normal regression model:
    
    Iteration 0:   log likelihood = -168.39943  
    Iteration 1:   log likelihood = -168.39757  
    Iteration 2:   log likelihood = -168.39757  
    
    Fitting 2 component Normal model:
    
    Iteration 0:   log likelihood = -168.71114  (not concave)
    Iteration 1:   log likelihood = -168.41975  (not concave)
    Iteration 2:   log likelihood = -168.39071  (not concave)
    Iteration 3:   log likelihood = -168.39012  (not concave)
    Iteration 4:   log likelihood = -168.38867  (not concave)
    Iteration 5:   log likelihood = -168.38748  (not concave)
    Iteration 6:   log likelihood = -168.38619  (not concave)
    Iteration 7:   log likelihood = -168.38379  (not concave)
    Iteration 8:   log likelihood =  -168.3813  (not concave)
    Iteration 9:   log likelihood = -168.19305  (not concave)
    Iteration 10:  log likelihood = -163.63375  (not concave)
    Iteration 11:  log likelihood = -158.61347  (not concave)
    Iteration 12:  log likelihood = -151.78992  
    Iteration 13:  log likelihood = -147.56396  
    Iteration 14:  log likelihood =  -146.9277  
    Iteration 15:  log likelihood = -146.92397  
    Iteration 16:  log likelihood = -146.92397  
    
    2 component Normal regression                     Number of obs   =        135
                                                      Wald chi2(0)    =          .
    Log likelihood = -146.92397                       Prob > chi2     =          .
    
    ------------------------------------------------------------------------------
             e_r |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    component1   |
           _cons |   3.372406   .0441419    76.40   0.000     3.285889    3.458922
    -------------+----------------------------------------------------------------
    component2   |
           _cons |   4.825329   .0813699    59.30   0.000     4.665847    4.984811
    -------------+----------------------------------------------------------------
     /imlogitpi1 |  -.5489003   .2208433    -2.49   0.013    -.9817452   -.1160554
       /lnsigma1 |  -1.426499   .1336279   -10.68   0.000    -1.688404   -1.164593
       /lnsigma2 |  -.5804232   .1125089    -5.16   0.000    -.8009366   -.3599098
    ------------------------------------------------------------------------------
          sigma1 |   .2401483   .0320905                      .1848142    .3120497
          sigma2 |   .5596615   .0629669                      .4489083    .6977392
             pi1 |   .3661196   .0512524                      .2725456    .4710187
             pi2 |   .6338804   .0512524                      .5289813    .7274544
    ------------------------------------------------------------------------------
    
    . estat ic
    
    -----------------------------------------------------------------------------
           Model |    Obs    ll(null)   ll(model)     df          AIC         BIC
    -------------+---------------------------------------------------------------
               . |    135           .    -146.924      5     303.8479    318.3743
    -----------------------------------------------------------------------------
    Now I'm stuck trying to fit a single normal distribution model to the same data, to compare the AIC.

    Another question: is there some info I can use to compute a bimodality p-value for my friend, the p-value worshipper? I can test that the 2 means (3.37 and 4.83) are significantly different, also I can show that pi1 is significantly different from 0. Is that enough?

    Thanks in advance

    Marta GG

  • #2
    Follow-up to my own question

    I kept on trying on my own, and I finally ended up with this:

    Code:
    program define normal
    version 12.1
    args lnf mu sigma
    tempvar z
    quietly {
    gen double `z'=($ML_y1 - `mu')/`sigma'
    replace `lnf'=ln(normalden(`z')) - ln(`sigma')
    }
    end
    
    . ml model lf normal (e_r=) (e_r=)
    . ml maximize
    
    initial:       log likelihood =     -<inf>  (could not be evaluated)
    feasible:      log likelihood = -4107.3034
    rescale:       log likelihood = -314.56306
    rescale eq:    log likelihood = -177.76262
    Iteration 0:   log likelihood = -177.76262  
    Iteration 1:   log likelihood = -168.66962  
    Iteration 2:   log likelihood = -168.39832  
    Iteration 3:   log likelihood = -168.39757  
    Iteration 4:   log likelihood = -168.39757  
    
                                                      Number of obs   =        135
                                                      Wald chi2(0)    =          .
    Log likelihood = -168.39757                       Prob > chi2     =          .
    
    ------------------------------------------------------------------------------
                 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    eq1          |
           _cons |   4.293385   .0724987    59.22   0.000      4.15129     4.43548
    -------------+----------------------------------------------------------------
    eq2          |
           _cons |   .8423589   .0512643    16.43   0.000     .7418826    .9428351
    ------------------------------------------------------------------------------
    
    . estat ic
    
    -----------------------------------------------------------------------------
           Model |    Obs    ll(null)   ll(model)     df          AIC         BIC
    -------------+---------------------------------------------------------------
               . |    135           .   -168.3976      2     340.7951    346.6057
    -----------------------------------------------------------------------------
                   Note:  N=Obs used in calculating BIC; see [R] BIC note


    So now, my question is: how can I compare both AIC (303.8479 vs 340.7951)?

    Thanks a lot

    Marta

    Comment


    • #3
      I would (also) do something much more exploratory, which is to generate a series of density estimates with the same kernel shape but increasing halfwidth and to keep track of how many modes are visible in the different estimates of the density function. It should be 2 over a wide range of halfwidths if the distribution is genuinely bimodal. There is small literature on this method. Google "mode trees".

      Disclaimer: I don't know a Stata implementation of the displays that search would find, but minimally a portfolio of say 16 density estimates is just obtainable by a loop over 16 halfwidths.

      Comment


      • #4
        Thanks a lot, I'll try that.

        Comment

        Working...
        X