Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Logit estimation and Iteration

    Hello Users,

    I have a problem with iteration. I am trying to estimate a logit regression and the iterations done by stata are taking more than one week to give me results. The data is so big. I have been advised to use stata
    Code:
    set iterlog
    . And also to use the
    Code:
    iterate(#)
    .

    Does anyone know what differences I could have in my result if I chose to set the iteration off? And my second question, if I still need to keep the iteration on, what is the optimal number of iterations that one has to use?

    Thank you
    Li
    Last edited by Jade Li; 20 Aug 2023, 12:07.

  • #2
    More information is needed to give specific advice. There are many possible problem situations here, as well as the possibility that nothing at all is wrong and you just need to be more patient.

    Since the range of possibilities is so large, let's narrow it down with a few screening questions.

    1. What exact command are you running? (Please provide the complete command, exactly as you are giving it to Stata: do not leave anything out or edit anything.)
    2. What is happening to the log likeliood? Does it continue to increase in each iteration? Or does it at some point either get stuck at one value (perhaps accompanied by a notation that it is "backed up" or "not concave")? Or does it go around in circles after some point?
    3. How many observations are there in your data set (including observations that will not be in the estimation sample)?
    4. About how many observations do you expect in the estimation sample?
    5. Exactly how long has it been iterating, and how many iterations have occurred in that time? (There is a difference between a large number of iterations, and a moderate number each of which takes a long time.)

    With answers to those questions, we can narrow down the possibilities to a smaller number and then focus on diagnosing and solving the problem within those bounds.

    Comment


    • #3
      We need information like Clyde suggests. My own experience is that if something takes forever to converge there is often a problem with the model or the data. Depending on what you are doing, the following may have some useful tips on things to try.

      https://www3.nd.edu/~rwilliam/xsoc73994/L02.pdf

      Occasionally the difficult option, or just rescaling some variables, will work miracles, and both are easy to do.
      -------------------------------------------
      Richard Williams, Notre Dame Dept of Sociology
      StataNow Version: 19.5 MP (2 processor)

      EMAIL: [email protected]
      WWW: https://www3.nd.edu/~rwilliam

      Comment


      • #4
        Dear Mr. Schechter and dear Mr. Williams,

        Thank you for your replies.

        I am using confidential data from France. I do not have answers to all of your questions because I do not see what stata shows when it is running the code.

        Question 1.

        I am running two codes, and both of them, according to the data admin is causing this iteration issue:
        Code:
        xtset panelvariable
        xtprobit  dep_main c.prop##c.inco  c.prop##c.rev c.prop2##c.inco  c.prop2##c.rev, vce(cluster sec)
        The code above takes 2 weeks to generate the results.

        Code:
        xtset panelvariable
        xtlogit dep_main c.prop##c.inco  c.prop##c.rev c.prop2##c.inco  c.prop2##c.rev, vce(cluster sec)
        This code took around 11 to 12 days (in another specification I may use control variables).

        Question 2.
        I do not know, because I do not run the code myself.

        Question 3.
        The total number of observations: 16,487,660

        Question 4.
        They are between 13,000,000 and 15,000,000

        Question 5.
        It has been running for more than 3 weeks now. But also the admin asked to change something in the code. During the first week, it was 729 iterations and the code was still running.

        Comment


        • #5
          This clarifies that you are running a pretty massive problem. You may just need to be patient.

          Have you tried running a much simpler model with fewer variables? If you build a model gradually you may identify where a problem is.

          How are your variables scaled? If, say, you have income in dollars, income in thousands of dollars May work better. Even computers can have precision problems with very large numbers.

          Do you really need to analyze all the cases? Could you take a sub sample of, say, 100,000 cases? Even if you eventually want/need all the cases, preliminary work might be done with a sub sample. If there is a serious flaw in your model I’d rather find out in a day than two weeks.

          In short everything may be fine. But I personally would have have started with simpler models and built up. I’d also look at the coding and scaling.
          -------------------------------------------
          Richard Williams, Notre Dame Dept of Sociology
          StataNow Version: 19.5 MP (2 processor)

          EMAIL: [email protected]
          WWW: https://www3.nd.edu/~rwilliam

          Comment


          • #6
            I will try using a sample then and see how this would work.
            Merci!

            Comment


            • #7
              Just to add: I need to use the variables I mentioned above. I cannot use fewer variables because these are the main variables in the model (if I understood you correctly). I also checked my income, investment, and revenue variables, they are all scaled in billions of euros.

              Comment


              • #8
                For sampling, you may want to use the user-written gsample command with the cluster option. For example, if there are 100,000 groups in your sample, you could sample 1,000 of them, including all the records for each of them.

                Do you absolutely positively need to have all the interactions in?

                Even if you think this absolutely positively has to be the final model, there is nothing that says you can't start more simply and build up. You might identify a problem point which you can then try to work around.

                Also just try basic descriptive stuff and data cleaning, e.g. make sure there aren't any possible coding errors and that missing data is being handled correctly.

                Again, you may be fine. But if I was going to submit something that will take weeks to run, I'd do my best to catch any errors beforehand. I wouldn't run something this massive as the first thing I tried. Further, depending on your purposes, a sample may be perfectly fine. Or get it to work with a sample and then let it grind away for two weeks if you want/need all the cases.
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                StataNow Version: 19.5 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  I agree with everything Richard Williams has said. I'll just note that my own experience running this kind of multi-level binary outcome model with a data set of this size is quite similar to what you are experiencing: it takes weeks. With only 729 iterations in each week, it follows that the average iteration is taking about 14 minutes--which is a pretty long time. This is natural when the data set is so large and the calculations require reading every observation in the estimation sample in every iteration and then doing a lot of number crunching. But on top of that, the fact that the model still hadn't converged after 729 iterations suggests that it is very difficult to fit this model to the data.

                  You should be able to get good estimates of your parameters with a much smaller sample of the data unless the outcome you are modeling is very rare. (When I have had to run models like this for weeks it is because I am trying to model an outcome that only takes on the 1 value in about 1 out of every 10,000 cases.)

                  Another possible cause of difficulty converging is if the variance of the random intercepts is very close to zero. While you are experimenting with simpler models and samples, I suggest you pay attention to the values of rho you see. If they are very close to zero, it is likely that you can go to a single-level model with -probit- (or -logit-). These converge much more easily and run much more quickly.

                  If you do have to stick with an -xt- model and the full sample, you might consider decreasing the -intpoints()- option. The default value is 12. The larger the value of -intpoints()- the longer the computations take. Of course, you pay a price for a smaller -intpoints()- value: your estimates will be less accurate. This is another thing you might consider exploring in smaller samples to see if you can reduce that and not lose too much accuracy.

                  I'll just mention in passing that if all of your model variables were discrete instead of continuous there is a way to massively speed things up by -contract-ing the data to a single observation for each combination of all the model variables (plus a new variable _freq showing the count of original observations it represents) set and then running the model with -fweights-. Doing this can reduce weeks of execution time to minutes! But, alas, it cannot be applied with continuous variables.

                  Comment


                  • #10
                    Dear Mr. Williams and dear Mr. Schechter,

                    I appreciate your detailed explanation. It is absolutely helpful. Thank you! I will consider all the points you mentioned and try to solve this issue accordingly.

                    Great regards,
                    Jade

                    Comment


                    • #11
                      -xtlogit- has -iterate- and -trace- options which can be used to diagnose problems in non-linear estimation. It may be that the optimizer is fussing around the same point for hundreds of iterations without announcing convergence. In that case you might legitimately stop it and declare convergence yourself. Or it may be that one coefficient is going off to plus infinity and another to minus infinity (colinearity among explanatory variables) in which case the equation is probably not estimatable from the data at hand. See also http://www.nber.org/stata/efficient/non-linear.html

                      Comment

                      Working...
                      X