Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • 'Not concave' iterations in xtlogit regression

    Hello,

    I'm trying to run logistic regressions on panel data where the dependent and most of the independent variables are factor variables.

    This was going well, or so I thought, all of yesterday. I locked my computer for the night, then came back in this morning to find that Stata had closed (it must've crashed overnight). I reopen it, and try to continue with the regressions but I started getting the 'not concave' message next to iterations and they take forever, and give me different coefficients for the same variables than they did yesterday. I've looked at the data quite extensively to confirm that nothing has changed. I can't find any changes in the data from yesterday (since Feb 26th I've been using the same data file). So that I don't understand but I'm not sure if it's relevant in the end.

    So for example if I try to regress these two variables:
    tabulate Sigdum

    1 stands |
    for |
    signatory |
    ever, 0 for |
    non |
    signatory |
    ever | Freq. Percent Cum.
    ------------+-----------------------------------
    0 | 4,565 66.21 66.21
    1 | 2,330 33.79 100.00
    ------------+-----------------------------------
    Total | 6,895 100.00

    and
    tabulate invtype1

    invsubtype= |
    =Bank | Freq. Percent Cum.
    ------------+-----------------------------------
    0 | 5,975 86.66 86.66
    1 | 920 13.34 100.00
    ------------+-----------------------------------
    Total | 6,895 100.00

    collin Sigdum invtype1
    (obs=6895)

    Collinearity Diagnostics

    SQRT R-
    Variable VIF VIF Tolerance Squared
    ----------------------------------------------------
    Sigdum 1.06 1.03 0.9403 0.0597
    invtype1 1.06 1.03 0.9403 0.0597
    ----------------------------------------------------
    Mean VIF 1.06

    Cond
    Eigenval Index
    ---------------------------------
    1 1.6991 1.0000
    2 0.9754 1.3198
    3 0.3255 2.2846
    ---------------------------------
    Condition Number 2.2846
    Eigenvalues & Cond Index computed from scaled raw sscp (w/ intercept)
    Det(correlation matrix) 0.9403

    I get this:

    xtlogit Sigdum invtype1

    Fitting comparison model:

    Iteration 0: log likelihood = -4410.3884
    Iteration 1: log likelihood = -4155.0693
    Iteration 2: log likelihood = -4141.9113
    Iteration 3: log likelihood = -4141.7319
    Iteration 4: log likelihood = -4141.7318

    Fitting full model:

    tau = 0.0 log likelihood = -4141.7318
    tau = 0.1 log likelihood = -3859.1785
    tau = 0.2 log likelihood = -3589.5994
    tau = 0.3 log likelihood = -3329.4891
    tau = 0.4 log likelihood = -3075.1791
    tau = 0.5 log likelihood = -2822.4534
    tau = 0.6 log likelihood = -2565.8646
    tau = 0.7 log likelihood = -2297.524
    tau = 0.8 log likelihood = -2004.2833

    Iteration 0: log likelihood = -2297.6501
    Iteration 1: log likelihood = -922.75302 (not concave)
    Iteration 2: log likelihood = -903.64802 (not concave)
    Iteration 3: log likelihood = -880.97104 (not concave)
    Iteration 4: log likelihood = -880.97104 (not concave)
    Iteration 5: log likelihood = -762.70679 (not concave)
    Iteration 6: log likelihood = -685.72911
    Iteration 7: log likelihood = -662.53369
    Iteration 8: log likelihood = -662.13046
    Iteration 9: log likelihood = -662.12966
    Iteration 10: log likelihood = -662.12966

    Random-effects logistic regression Number of obs = 6895
    Group variable: AccountName_~m Number of groups = 1379

    Random effects u_i ~ Gaussian Obs per group: min = 5
    avg = 5.0
    max = 5

    Wald chi2(1) = 147.23
    Log likelihood = -662.12966 Prob > chi2 = 0.0000

    ------------------------------------------------------------------------------
    Sigdum | Coef. Std. Err. z P>|z| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    invtype1 | -7.662794 .6315268 -12.13 0.000 -8.900564 -6.425024
    _cons | -1.080827 .2690903 -4.02 0.000 -1.608234 -.5534194
    -------------+----------------------------------------------------------------
    /lnsig2u | 3.566572 .0485766 3.471363 3.66178
    -------------+----------------------------------------------------------------
    sigma_u | 5.949373 .1445001 5.672793 6.239437
    rho | .9149573 .0037798 .9072505 .9220788
    ------------------------------------------------------------------------------
    Likelihood-ratio test of rho=0: chibar2(01) = 6959.20 Prob >= chibar2 = 0.000

    And it takes forever as mentioned. I've been looking for solutions all day but I'm not very advanced in stata or statistics in general but what I read seemed to link this issue to collinearity. Hence I did the test above to check for collinearity but the VIF measure seems okay since it's below 2.5 which I understand is where it gets problematic.

    This is the output for the same regression from yesterday which I had saved in Excel. As you can see the coefficient is much lower here:
    xtlogit Sigdum invtype1
    Wald chi2(1) = 254.8
    Log likelihood = -4141.73 Prob > chi2 = 0
    ------------------------------------------------------------------------------
    Sigdum | Coef. Std.e z P>|z| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    invtype1 | -2.61532 0.163843 -15.96 0 -2.93645 -2.2942
    _cons | -0.47572 0.026609 -17.88 0 -0.52787 -0.42357
    -------------+----------------------------------------------------------------
    /lnsig2u | -37.5239 5223377 -1.02E+07 1.02E+07
    -------------+----------------------------------------------------------------
    sigma_u | 7.11E-09 0.018566 0 .
    rho | 1.54E-17 8.02E-11 0 .
    ------------------------------------------------------------------------------
    Likelihood-ratio test of rho=0: chibar2(01) = 0 Prob >= chibar2 = 1
    It may be clearer from this screenshot:

    I have the same problem with virtually any other variable in the dataset, including continuous variables, and also when I try to run the regression with more than one explanatory variable, keeping Sigdum as the factor dependent variable. I would be very grateful if anyone had an idea what the problem might be. Please ask for extra information if needed - I'm not sure what is most useful/neccessary to include in this question.

    Many thanks in advance.
    Attached Files

  • #2
    Well, the coefficients aren't the only thing that has changed. Look at sigma_u and rho. In today's data the numbers look very reasonable. In yesterday's, the variance component at the group level is effectively zero. While that is possible, it is unusual and absent other information, I would suspect that today's results are more correct than yesterday's.

    You don't show us the top of yesterday's output, the number of observations and number of groups. Did that change? If so you are not running the same analysis on the same data.

    As for the (not concave) message, as long as that isn't present at the final iteration, it doesn't matter. You can ignore it.

    As for it taking a long time to run, that's not surprising: -xtlogit- is quite computationally intensive and in a data set of this size, I would expect it to feel slow on a typical desktop configuration. If it didn't take a long time to run yesterday, then I suspect you were (without realizing it) running it on some relatively small subset of your data. Again, -xtlogit- is computationally intensive, and if it just spat out results for you quickly yesterday then it wasn't doing the job.

    The combination of the strange results for sigma_u, rho, and the fact that you didn't find it slow yesterday suggest to me that you had dropped some substantial part of your data set when you ran it yesterday. Check the observations and groups in your logs from yesterday.

    Comment


    • #3
      Dear Clyde,

      thank you so much for your answer. I checked and you are right - so the xtset is as follows:
      xtset
      panel variable: AccountName_num (strongly balanced)
      time variable: Datayear, 2007 to 2011
      delta: 1 unit

      And two days ago, the 'quick' logistic regression was running using the time variable as the group variable, so the number of groups was 5. The 'slow' logistic regressions ran using the panel variable as the group variable, making the number of groups 1379, that is 1379 firms, 5 annual observations per firm. The total number of observations was the same: 6895 only the group variable changed. So from what you are saying this is likely to be the cause? Which should be the group variable in this regression?

      Thank you so much again

      Comment


      • #4
        (From looking at this http://dss.princeton.edu/training/Panel101.pdf it looks to me like having the panel variale as the group variable is the correct way to do it but I would be much surer of this if you confirmed it)

        Comment


        • #5
          Which should be the group variable in this regression?
          In principle, it depends on what your research question is and what you are trying to estimate. But suffice it to say that except in unusual situations, the grouping variable would normally be the panel variable.

          Comment


          • #6
            I'm trying to estimate what factors influence whether or not the firms (panel variable) enter a certain group (factor variable that is the dependent variable), over a period of five years (time variable).
            Last edited by Sue Bowers; 13 Mar 2015, 09:13.

            Comment


            • #7
              Well, it sounds like you would want the firm to be the panel variable here. An additional question is whether time also needs to be explicit in the model in some way. That is, are there time trends influencing this transition, or were there any "shocks" promoting or inhibiting that transition during the years under study. That would be a content question that I'm not in any position to comment on.

              Comment


              • #8
                Hi,
                I have same problem with log binomial regression. I share codes and png. I could not do anything. Can you help for this?
                Attached Files

                Comment


                • #9
                  Didem:
                  please read the FAQ on how to post a readable query (screenshots are probably the worst way to post yopur doubts/concerns).
                  That said, whenever a regression model does not converge, the usual recipe is adding one predictor at time and see when Stata starts chocking.
                  Kind regards,
                  Carlo
                  (Stata 19.0)

                  Comment


                  • #10
                    Thank you Carlo, but can you give an example for codes? My codes are:
                    Code:
                    glm bki_kat ogrgrpuhskcok caliskat3 yas_kat if Cinsiyet==2, fam(bin) link(identity)
                    Last edited by didem okmen; 11 Sep 2018, 03:56.

                    Comment


                    • #11
                      Didem:
                      I would go through the following three steps and for each one of them I would check when Stata starts to chocke:
                      Code:
                      glm bki_kat ogrgrpuhskcok if Cinsiyet==2, fam(bin) link(identity)
                      Code:
                      glm bki_kat ogrgrpuhskcok caliskat3 if Cinsiyet==2, fam(bin) link(identity)
                      Code:
                      glm bki_kat ogrgrpuhskcok caliskat3 yas_kat if Cinsiyet==2, fam(bin) link(identity)
                      Kind regards,
                      Carlo
                      (Stata 19.0)

                      Comment


                      • #12
                        The last is chocking. Others are working.

                        Comment


                        • #13
                          Didem:
                          - frisk the culprit and search for weird values (eg, erroneous data entry) or other features that can justify Stata's chocking and deal with them (if feasible);
                          - if what above does not work, consider re-specificing your regerssion model.
                          Kind regards,
                          Carlo
                          (Stata 19.0)

                          Comment


                          • #14
                            Originally posted by Carlo Lazzaro View Post
                            Didem:
                            - frisk the culprit and search for weird values (eg, erroneous data entry) or other features that can justify Stata's chocking and deal with them (if feasible);
                            - if what above does not work, consider re-specificing your regerssion model.
                            Didem, related to this, I see what look like manually-created dummies for your variable yas_kat. The label for caliskat3 also makes it look like this might be categorical. The commands as you typed them would have treated these independent variables as continuous, which might not be what you wanted. You want to use Stata's factor variable syntax, e.g.

                            Code:
                            glm bki_kat ogrgrpuhskcok i.caliskat3 i.yas_kat if Cinsiyet==2, fam(bin) link(identity)
                            Your dependent variable should not be coded with factor variable syntax, because telling -glm- that it should use the binomial family already implies that. This applies to categorical independent variables (that includes binary, ordered categorical, and un-ordered categorical).

                            If this issue applies to you, then correct it and repeat the exercise. If any variable causes Stata to choke, examine it further for data errors. In my experience, it should be hard to get GLM to choke if there are no data errors.
                            Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

                            When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                            Comment


                            • #15
                              Originally posted by Weiwen Ng View Post

                              Didem, related to this, I see what look like manually-created dummies for your variable yas_kat. The label for caliskat3 also makes it look like this might be categorical. The commands as you typed them would have treated these independent variables as continuous, which might not be what you wanted. You want to use Stata's factor variable syntax, e.g.

                              Code:
                              glm bki_kat ogrgrpuhskcok i.caliskat3 i.yas_kat if Cinsiyet==2, fam(bin) link(identity)
                              Your dependent variable should not be coded with factor variable syntax, because telling -glm- that it should use the binomial family already implies that. This applies to categorical independent variables (that includes binary, ordered categorical, and un-ordered categorical).

                              If this issue applies to you, then correct it and repeat the exercise. If any variable causes Stata to choke, examine it further for data errors. In my experience, it should be hard to get GLM to choke if there are no data errors.
                              I just reviewed your screenshot, which is a bit hard to read. We typically recommend posting your exact code in code delimiters. This can sound pedantic, but sometimes, little errors can be relevant because machines are very, very literal.

                              The exact commands you typed look more like this:

                              Code:
                              glm bki_kat i.d_mes1 i.d_mes2 i.d_mes3 i.d_mes4 i.d_mes4 i.d_mes5
                              I'm simplifying a bit because I don't want to manually type everything. If d_mes is a categorical with values 1, 2, 3, and 4, and you manually created 4 dummy variables, then the problem is that you included all 4. You can't do that. You have to omit the base category. You are not omitting the base category by specifying something like

                              Code:
                              glm bki_kat ib0.d_mes1 ib0.d_mes2 ib0.d_mes3 ib0.d_mes4 ib0.d_mes4 ib0.d_mes5
                              You should just type:

                              Code:
                              glm bki_kat i.d_mes
                              The above will automatically omit the lowest value of d_mes and treat it as the reference category. If you fail to do this, then I'm not sure if it will cause an infinite iteration log, but it will for sure render your regression meaningless (or you should see a lot of messages indicating that one or more categories were omitted due to collinearity).

                              Also, you typed
                              Code:
                              destring caliskat, generate(caliskat3)
                              You got an error message that the variable was already numeric. Chances are, the values of that variable were assigned labels, which display when you tabulate that variable. The underlying variable is already numeric and you can use it in a regression without modification.
                              Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

                              When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                              Comment

                              Working...
                              X