Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • error statement r(1400) "initial values not feasible" when running xtlogit

    Hi. I'm using Stata 15.1/MP and am running into an error that I've not seen before. The model is using 324,000 observations with 35 df and 45 clusters.

    Code:
    . xtset statefip
           panel variable:  statefip (unbalanced)
    
    . xtlogit y i.year $xvar $zvar if keep, vce(robust)
    
    Fitting comparison model:
    
    Iteration 0:   log pseudolikelihood = -199588.67  
    Iteration 1:   log pseudolikelihood = -190549.35  
    Iteration 2:   log pseudolikelihood = -190362.87  
    Iteration 3:   log pseudolikelihood = -190362.34  
    Iteration 4:   log pseudolikelihood = -190362.34  
    
    Fitting full model:
    
    tau =  0.0     log pseudolikelihood = -190362.34
    tau =  0.1     log pseudolikelihood = -180501.31
    tau =  0.2     log pseudolikelihood = -180570.24
    
    initial values not feasible
    r(1400); t=98.77 17:21:29
    The r(1400) code is "numerical overflow." Since the model used to run fine, I'm wondering if I screwed up the memory settings somehow.
    Code:
    . memory
    
      Memory usage
                                                used                allocated
        ---------------------------------------------------------------------
        data                             770,302,975            4,362,076,160
        strLs                                      0                        0
        ---------------------------------------------------------------------
        data & strLs                     770,302,975            4,362,076,160
    
        ---------------------------------------------------------------------
        data & strLs                     770,302,975            4,362,076,160
        var. names, %fmts, ...               246,690                  362,858
        overhead                           2,130,424                2,130,712
    
        Stata matrices                             0                        0
        ado-files                            235,518                  235,518
        stored results                        17,277                   17,277
    
        Mata matrices                        136,000                  136,000
        Mata functions                       134,368                  134,368
    
        set maxvar usage                   2,164,426                2,164,426
    
        other                                255,809                  255,809
        ---------------------------------------------------------------------
        grand total                      775,259,271            4,367,513,128
    r; t=0.00 17:31:33
    
    . q memory
    -----------------------------------------------------------------------------------------------------------------------
        Memory settings
          set maxvar           2048       2048-120000; max. vars allowed
          set matsize          800        10-11000; max. # vars in models
          set niceness         2          0-10
          set min_memory       4g         0-3200g
          set max_memory       .          4g-3200g or .
          set segmentsize      64m        1m-32g
    r; t=0.00 17:31:45
    The model fines as -xtreg-. Any thoughts? Thank you!

  • #2
    No, this has nothing to do with memory. The overflow referred to is numeric overflow. When Stata is trying to calculate the log-likelihood, it has to add up the values of the log-likelihood in each observation of the data set. And apparently someplace before it reaches the end of your 324,000 observation data set the resulting number is too large to fit in a double-precision floating point number. So the calculation is abandoned.

    Given the large size of your data set, but a small N, I believe that you don't have to worry about the incidental parameters problem, so try unconditional fixed-effects logistic regression:
    Code:
    logit y i.year $xvar $zvar i.statefip if keep, vce(robust)
    Or a fixed-effects linear probability model:
    Code:
    xtreg y $xvar $zvar if keep, vce(robust)

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      No, this has nothing to do with memory. The overflow referred to is numeric overflow. When Stata is trying to calculate the log-likelihood, it has to add up the values of the log-likelihood in each observation of the data set. And apparently someplace before it reaches the end of your 324,000 observation data set the resulting number is too large to fit in a double-precision floating point number. So the calculation is abandoned.

      Given the large size of your data set, but a small N, I believe that you don't have to worry about the incidental parameters problem, so try unconditional fixed-effects logistic regression:
      Thanks, Clyde. What do you mean by small N? I'm fairly certain (although the memory of my brain is declining with age!) that I've run this before and not received such an error.
      Also, did you mean to leave out i.year when you suggested
      Code:
       
       xtreg y $xvar $zvar if keep, vce(robust)

      Comment


      • #4
        You have, you said, 45 clusters. That's relatively small. In any case, the T must, on average then, be huge. If T >> N, the incidental parameters problem is not an issue.

        However, I'm just noticing the you are running a random effects -xtlogit-, not fixed effects. So incidental parameters problem is not a consideration in any case! The error message is still an issue of numerical overflow. But going to the uncondtional fixed effects logit is a major change in the model and probably not what you should do. I would go for the linear probability model instead. And my omission of i.year was a mistake. You should keep i.year in the model.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          You have, you said, 45 clusters. That's relatively small. In any case, the T must, on average then, be huge. If T >> N, the incidental parameters problem is not an issue.

          However, I'm just noticing the you are running a random effects -xtlogit-, not fixed effects. So incidental parameters problem is not a consideration in any case! The error message is still an issue of numerical overflow. But going to the uncondtional fixed effects logit is a major change in the model and probably not what you should do. I would go for the linear probability model instead. And my omission of i.year was a mistake. You should keep i.year in the model.
          Running -xtlogit- with fe option takes forever even if use just 5% of the total observations (and Stata is using four cores of a computer with 32g RAM).

          I did realize that my original model does run if sample 20% of the observations:
          Code:
          gen rand=runiform()
           egen sample5=cut(rand), group(5)  
          xtlogit y i.year $xvar $zvar if keep & sample5==1, vce(robust)
          Is it possible to do something with the results of multiple such models? Or is that crazy?

          Comment


          • #6
            Do yo need a fixed-effects or random effects model? This is not a matter of convenience: it is a question of which one is a reasonable model for your real-world data generating process? In the code you show in #1 you show a random effects model. Admittedly, I confused matters in #2 by talking about the incidental parameters problem, which affects fe, but not re models. If you need a random effects model, running -xtlogit, fe- is not a substitute for that, nor vice versa.

            The code you show in #5 at the bottom is a random effects logistic regression on 20% of your sample. Yet you talk about running -xtlogit, fe- in the first sentence there. So I don't know which model you are trying to do.

            In any case, paring down the sample size is another reasonable solution to the problem of numerical overflow. There is no way I know of to put together the results of five analyses on five subsamples. But with a total of 340,000 observations, unless you are trying to detect some very minuscule effect, you should be just fine with a single 20% sample. That is, if you select it carefully. The code you show in #5 selects 20% of all the observations. Depending on what you are doing, it might be better to sample all observations on 20% of the clusters. Since you have not explained your context, I can't advise you, but there are situations where the latter approach would be far preferable. I cannot tell if yours is one of those.

            Comment


            • #7
              I guess I can get confused by what people (or Stata) mean by random effects and fixed effects. See the Gelman and Hill comments on the many ways people use these terms (quoted on https://stats.stackexchange.com/ques...ed-effect-mode ).

              In my case, the observations in the model are individuals from a US Census survey. The x variables are measured at the individual level from a Census survey sample. My data includes state-level policies and the population targetted by the policy (also that pop squared) for all relevant states in the US (the z variables in the macro in the code), plus seven years and the 45 states included. If use -logit- or -reg- (the outcome is binary with a mean of about 0.6), I include years and states as indicators (what some people call fixed effects) and I cluster the errors on states. If I use -xtreg- I use the -fe- option and remove states from the command as it's in the cluster() and i() options. Is that right? If so, it sounds as if it is impossible for Stata to run a -xtlogit- with the -fe- option when the data is this large. Even at 20% of the full sample, it seems it would have to run overnight (which might work, but then running -margins- after that might take several hours more). If the model can be a random-effects model, then I could play with the sample size. So, would you say this is a fixed-effects model? Thanks again for your excellent help.

              Comment


              • #8
                Yes, the terms fixed and random effects are quite overloaded, especially fixed effects. What's most relevant here is the difference between the -fe- and -re- -xtlogit- models. While you say that you use the -fe- option when you use -xtreg-, all of the -xtlogit- code that you show here says neither -fe-, nor -re-, and so, by default, is -re-. There is a very important difference between the -fe- and -re- models. The -fe- models only estimate within state effects of variables. For the -re- models, they are based on the assumption that the within and between state effects are the same and it estimates that common effect. If those effects are not, in fact, equal, the coefficients you get are a weighted average of the within and between effects, with weights that are not readily discernible, and so these coefficients are hard to interpret in that context.

                If you are looking to see the impact of policies adopted by states on outcomes measured at the individual level, then you most likely want to measure within-state effects and should use -fe- estimators. If you have both pre- and post-policy implementation data, then you should be doing a difference-in-differences analysis, and the effect you want to see is a within-state causative effect. So you should be doing that with an -fe- model, not -re-.

                If, however, you don't have pre- and post-policy data, then you are reduced to comparing outcomes in states with policy implemented to states without policy implemented--this is a weak design, and running a gargantuan sample does not salvage that. If you are lucky, you have the right covariates included to identify a causal effect--but you never know if that's happened or not. Anyway, in that case you want a between-states comparison and you must use the -re- models to get at that.

                Now, multi-level logistic models (whether -fe- or -re-) can be very slow to run in large samples. I have done some that have taken several weeks to finish. So I'm not surprised at what you're seeing. -xtreg- is usually much faster, and with a mean outcome of about 0.6, the linear probability model seems like a good alternative. Another good alternative is to just pull a random sample and run that. Honestly, if the effect you are looking for is so small that you can't detect it with a sample size of 340,000/5, it's too small to be of any practical importance anyway and cranking up the sample size all the way is just going to divert you into saying nonsensical things about "statistically significant" findings that are meaningless in real world terms. If it's large enough to matter from a policy perspective, you can surely find it in the smaller sample. And, running overnight to get the results shouldn't be a real problem, no? I mean policy analysis like this is usually not an emergency.

                If you are running -xtlogit, fe- on the reduced sample and get results overnight, the next question is, what are you going to do with -margins-. There is nothing really important that you can get from it. You cannot get meaningful predicted margins after -xtlogit, fe- and should not try to do so. You can get meaningful marginal effects in different states, although if you are using difference-in-differences the only truly important effect is the causal effect estimate, which is the coefficient of the treatment#time interaction variable in the regression model. The state-specific marginal effects are of secondary importance, and are not good causal estimates anyway. So my point is that in the -fe- regime, you have little reason to run -margins- in the first place, so I wouldn't worry about how long it might take. Just skip it.

                If you are doing a between-states comparison of policy states vs non-policy states, then you have to go to -re- models. The considerations about -margins- are different there: the predictive margins can be validly computed and may be of some interest. The state-specific marginal effects are still not causal effects (unless you got the covariates in the model just right), but your design doesn't really give you causal effects anyway, so their no less interesting than the overall policy effect.

                Comment


                • #9
                  Thank you, Clyde Schechter. This confirms some thoughts I had and greatly clarifies others. I've toyed with a d-i-d model, but I've assumed that d-i-d models require state-aggregated data and I prefer to work with individual-level data. I do have pre and post-policy implementation data as well as data from states that have not implemented the policy over the seven years. However, my dep variable is biennial (e.g., registered to vote from a November Census survey in even-numbered years) and the policies were implemented in 2016-2020 in varying months. I assume for a d-i-d model I need states that implemented the policy close to the same time. Using only a couple of states in the pre- and the post-policy group seems unwise given how much influences registration that is state-specific. I may try to find a way to match untreated states to those, but I'm skeptical of matching methods for this.

                  As an aside: Using large cross-sectional time-series data is possible in policy analysis in so many important areas now, that more policy analysts need training in it. Sadly, many (or even most) of us haven't the math background or experience with how economists talk about data (they really have their own obsession with vocab and symbols) to work through some of the standard texts. We need a textbook from the perspective of--and written from questions by--researchers facing practical problems with real data.

                  Comment


                  • #10
                    I am analyzing US county-level migration data from 1994-2020 I have a perfectly balanced panel with 70,800 observations and 2,950 panels. When I attempted to run spxtregress, I get the following error message:"initial values not feasible unconcentrated optimization fails"
                    Below is my code
                    spxtregress netmigpeople_pctchange pctforest ln_incomepercapita ln_unemployment ln_popdens pct65andover ln_nh_black ln_hispanic ln_other agriculture mining manufacturing government irschange, fe force

                    Stata manual does not offer much help on how to address this problem. Could anyone please help? I have attached my data.
                    Thanks a lot
                    Ephraim

                    Comment


                    • #11
                      here is the file that didn't load successfully.
                      Ephraim

                      Comment

                      Working...
                      X