Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Advice/tips for numerical optimization

    Hi All:

    I am dealing with a hard-to-maximize likelihood function. My experience thus far is that cycling between multiple optimization algorithms via ml's technique () option is very useful to find the peaks of difficult-to-maximize functions. I have used technique (bhhh 4 nr 5), technique(nr 5 dfp 5), and so on. Doing so achieves convergence, but with large (70-90) number of iterations and the maximization is quite slow.

    Following the excellent advice in Gould et al. MLE in Stata (4th Edition), examining the trace reveals that the Newton Raphson (NR) steps stay in the problematic region of the likelihood for most of the simulation although NR escapes it and achieves stability towards the end when convergence is achieved. In contrast, the Davidon–Fletcher–Powell (DFP) steps are (expectedly) much faster and find the function peak in fewer iterations, except for backing up in the first few iterations.

    I am interested in knowing your thoughts on using DFP instead of NR in this scenario. We have around 50k observations and the calculation of Hessian is highly expensive as simulation is involved in the maximization. The assumption of a random sample seems reasonable in this particular application, so I guess I am very inclined towards using an empirical OPG variance estimator but I was curious if I might be missing something here?

    In case it helps, I am trying to maximize a joint discrete-continuous choice likelihood, with integrals involved in the likelihood as well as in computing nuisance heterogeneity parameters, therefore I am using maximum simulated likelihood in the likelihood evaluator program.

    Thanks much for any suggestions!

  • #2
    What works, works. So if Davidon-Fletcher-Powell (DFP) does the job to your satisfaction, go for it. Another maximization method that works well for difficult integrands is adaptive quadrature. However, these methods don’t have to be mutually exclusive.

    You can start with DFP or adaptive quadrature, and once you achieve convergence, switch to Newton-Raphson for faster refinement. This hybrid approach is used in practice; for example, the gllamm command from SSC follows a similar strategy (see below).

    Code:
    webuse tvsfpors, clear
    gen cctv= c.cc#c.tv
    gllamm thk prethk cc tv cctv, i(school) family(binomial) link(ologit) adapt
    Res.:

    Code:
    . gllamm thk prethk cc tv cctv, i(school) family(binomial) link(ologit) adapt
    
    Running adaptive quadrature
    Iteration 0:    log likelihood = -2123.8577
    Iteration 1:    log likelihood = -2120.0494
    Iteration 2:    log likelihood = -2119.7702
    Iteration 3:    log likelihood = -2119.7605
    Iteration 4:    log likelihood = -2119.7506
    Iteration 5:    log likelihood = -2119.7444
    Iteration 6:    log likelihood = -2119.7442
    
    
    Adaptive quadrature has converged, running Newton-Raphson
    Iteration 0:  Log likelihood = -2119.7442  
    Iteration 1:  Log likelihood = -2119.7442  (backed up)
    Iteration 2:  Log likelihood = -2119.7428  
    Iteration 3:  Log likelihood = -2119.7428  
     
    number of level 1 units = 1600
    number of level 2 units = 28
     
    Condition Number = 16.579687
     
    gllamm model 
     
    log likelihood = -2119.7428
     
    ------------------------------------------------------------------------------
             thk | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
    thk          |
          prethk |   .4032892     .03886    10.38   0.000      .327125    .4794533
              cc |   .9237908   .2040746     4.53   0.000     .5238119     1.32377
              tv |   .2749939   .1977431     1.39   0.164    -.1125754    .6625633
            cctv |  -.4659264   .2845972    -1.64   0.102    -1.023727     .091874
    -------------+----------------------------------------------------------------
    _cut11       |
           _cons |  -.0884495   .1641067    -0.54   0.590    -.4100927    .2331937
    -------------+----------------------------------------------------------------
    _cut12       |
           _cons |   1.153364   .1656165     6.96   0.000     .8287615    1.477966
    -------------+----------------------------------------------------------------
    _cut13       |
           _cons |    2.33195   .1734203    13.45   0.000     1.992052    2.671847
    ------------------------------------------------------------------------------
     
     
    Variances and covariances of random effects
    ------------------------------------------------------------------------------
    
     
    ***level 2 (school)
     
        var(1): .07351208 (.03831112)
    ------------------------------------------------------------------------------
    Last edited by Andrew Musau; 08 Feb 2025, 13:06.

    Comment


    • #3
      Many thanks for your response! DFP worked with my likelihood this time. We have a 7-dimensional rectangular integral, so (adaptive) quadrature was not an option unfortunately. As a side note, looking at your output, it appears gllamm uses adaptive quadrature as an optimizer, beyond its typical use for numerical integration, e.g., as in Mata's Quadrature() function. I'll look into this further. Thanks.

      Comment


      • #4
        Originally posted by Behram Wali View Post
        looking at your output, it appears gllamm uses adaptive quadrature as an optimizer
        Indeed, the idea is to use adaptive quadrature to navigate the problematic regions of the likelihood function where Newton-Raphson struggles. Once you move closer to the maximum, where the likelihood surface is more well-behaved, you can switch to Newton-Raphson, as it is typically more efficient in that region. To implement this strategy, you would generally set the tolerance level for adaptive quadrature not too conservatively, allowing for a smoother transition to Newton-Raphson.

        In your case, you could experiment with using DFP as an optimizer, as it can provide a good balance between convergence stability and computational efficiency.

        Comment

        Working...
        X