Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Running time is extremely long with GSEM

    Hello, every,

    I am trying to estimate a multilevel multinomial logistic model with GSEM, while the running speed is extremely slow, and I never get a result within five days. Does anyone have a solution?

    Here are some details.

    The dataset has 5.24 million observations nested into 60,000 second-level groups (tid), further nested into 500 top-level groups (pid).

    The dependent variable has six categories, and there are three level-2 categorical indicators and seven level-3 categorical indicators.

    The command and the results are:
    __________________________________________________ ________________
    . gsem (i.speed_pattern <- b0.weekend_holiday b0.time_period b0.light b1.bikelane_type b1.bridge_tunnel_type /// > b0.intersection_type b0.intersection_ad b0.turn_ad b0.slope_type b0.land_use_type Le1[tid] Le2[pid>tid]), mlogit startvalues(zero)

    Refining starting values:

    Grid node 0: log likelihood = .
    Grid node 1: log likelihood = -9410006.8
    Grid node 2: log likelihood = -9409911.4
    Grid node 3: log likelihood = -9409916.5

    Fitting full model:
    __________________________________________________ ________________


    Because of the error 1400, I used startvalues(zero). However, it is constantly fitting full model without finishing even one iteration. I also tried a null model with no indicator, and it also failed to complete one iteration.

    Can you help me with this problem?
    Last edited by Hong Yan; 09 Mar 2023, 13:30.

  • #2
    You are fitting a very complicated model to a very large dataset. When I have run problems of similar complexity and a data set about that size, the run time on a mid-range Windows box with 4 cores has been a little longer than two weeks. So I think you are in for a long haul, at best.

    The numerical overflow error you got with default starting values is very worrisome. It is not guaranteed that you will avoid this just by setting the start values to zero. If this calculation is going to converge, sooner or later it has to get into the part of parameter space where the log likeilhood starts to look like you got previously. At that point, there is a good chance you will overflow again. The overflow problem is a function of sample size. The log likelihood is added up over the observations, so it scales as O(estimation sample size). When you have a big data set like this, it is very easy for the accumulated total of the log likelihood calculation to run over the max that can be held in a double. You might get lucky, and I wish you well. But be prepared to see error 1400 again.

    If you haven't already done so, I would do a test-run on a modest-size data set to make sure your code does what you expect it to do and that the model is not inherently unidentifiable. After all, you don't want to run the real thing for a couple of weeks only to end up with (foreseeable) non-convergence!
    Last edited by Clyde Schechter; 09 Mar 2023, 14:14.

    Comment


    • #3
      Many thanks, Clyde,

      I use the SE version (only one core could be used), so it seems to cost much more time.

      I tried to estimate small datasets; 1% sample costs 3 hours, 2% sample costs 7 hours and 3% costs 15 hours. When I tried a 4% sample, it failed to converge. After changing the start value to zero, I did not get the result within 5 days (I am using a supercomputer whose running time limitation is 5 days).

      Do you have other suggestions?

      Best,
      Hong

      Comment


      • #4
        I'm sorry, but I can't think of anything you can do to estimate this model in a more timely way. It's a very complicated model. If this were my project, I would probably try to simplify the model. Among the things I would look into:
        1. Can the categories of speed_pattern be collapsed in a meaningful way so that one could use -logit- instead of -mlogit-. Or are there some categories of speed_pattern that are rare--so that I might just eliminate those observations altogether.
        2. What are tid and pid? Do I really need random intercepts at both of those levels? Your results from the 1% and 2% samples you already ran might shed light on this. If the variance component at either level is close to zero, it would greatly simplify the model and speed up the calculations if you eliminated that level's random intercept.
        3. What about all those explanatory variables: are there any with a large number of categories that might be collapsed, or observations with rare values eliminated?

        Comment


        • #5
          Thanks Clyde,

          Indeed, the extra level slows the modelling.

          I had tried all things you mentioned; (1) a three-level binary model (e.g. 1.speed_pattern vs 2.speed_pattern), (2) two-level models, (3) a three-level null model (no predictors). I can get the result with the second method, namely two-level models (tid and the lowest level). The variance component at the pid level is close to zero, so it seems that a two-level model can be an option.

          However, I am also confused about several things.

          1) How do the higher-level characteristics affect running speed? I tested two two-level models, one with tid (more than 60,000 groups) and the lowest level, and another one with pid (around 500 groups) and the lowest level. I assumed that it could be faster with the second model as fewer clusters, but the result is that the first one is much faster. Are there some reasons for this? Is it related to the possible variance at these levels; small variance means difficult to estimate?

          2) I do not fully understand how the effect of higher-level variables is calculated, and I did not find an explanation in Stata reference manual. Do the higher-level variables affect the group mean log odds ratio of being a specific class?

          Comment


          • #6
            Is it related to the possible variance at these levels; small variance means difficult to estimate?
            Yes, that's correct. The way the estimates are calculated, the likelihood function is parameterized not in terms of the variance components themselves but their natural logarithms. So if you have a variance component that is close to zero, its log is "close to negative infinity," which means that the maximizer has a long, climb along the likelihood in a region where it takes very large changes in the log of the variance component to make a modest difference in the estimate of the variance component itself--i.e. the likelihood is relatively flat there.

            Do the higher-level variables affect the group mean log odds ratio of being a specific class?
            Yes. In this regard it is somewhat similar to introducing indicator variables for the pid/tid categories into a non-hierarchical regression. But, when the variance of those higher level effects is close to zero, then it is almost like introducing a "constant variable" into the model, except that it doesn't get dropped due to colinearity. The implications of that are thatit has only a minimal effect on the group mean estimates.
            Last edited by Clyde Schechter; 23 Mar 2023, 12:01.

            Comment

            Working...
            X