Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generalized Linear Models (GLM) versus OLS

    Dear all,
    I have a proportion (ratio between 0 and 1) as a dependent variable in my regression. This variable has lots of zeros in the distribution. I have two general questions about it:

    1. When I use a GLM model, family gaussian and link identity is the choice indicated by the AIC/BIC criteria. However, I cannot run the GLM with logit or probit family, because the function never hits the higher/lower points (concave shape). Is this a problem related to the number of observations? I have 1998-2010 years and only 1,127 obs.

    2. I ask how close the GLM with gaussian/identitiy choice is to the OLS regression. Since gaussian GLM assumes a linear relation between y and x, should I keep the OLS specification?

    Thanks for all.

  • #2
    GLM model, family Gaussian and link identity sounds like OLS to me.

    Logit, whether done directly or using GLM, is not fazed by exact zeros. That applies to proportions too.

    I can't follow your statement about not using logit or probit.

    Comment


    • #3
      Thank you, Mr. Cox, for your help. Actually, when I said that I could not use probit or logit, it is because when I choose family(logit) or family(probit), the software gives me the message:
      data not suitable for nonstandard family-link combination
      So, that is why I chose the gaussian/identity combination in glm. However, I would like to understand if a glm model with these characteristics (gaussian and identity) really improves the results compared to an OLS model.

      Comment


      • #4
        Your descriptions are not clear. You should show the actual code you tried to run that is producing these messages, and then perhaps somebody can troubleshoot it.

        That said, there may be a clue here where you say:
        I cannot run the GLM with logit or probit family
        -logit- and -probit- are not families. They are links. I cannot reproduce the particular error message you are getting, but when I try to run -glm- specifying -logit- or -probit- in the -family()- option, it does halt with an error message (though not the one you show). These have to be specified in the -link- option.

        Comment


        • #5
          Thank Mr. Schechter, for your answer.
          I run the following code:
          glm y x controls i.year if subsample==1, link(logit) cluster(country) rob
          Y: is a proportion, numbers between 0 and 1. There are lots of zeros.
          X: explanatory variables, not proportions. Most of them are continuous, one of the is a dummy.
          And when I run the code above, the message is:

          cannot compute an improvement -- discontinuous region encountered
          data not suitable for nonstandard family-link combination
          What is going on?

          Comment


          • #6
            Neyla: In addition to Nick's and Clyde's comments I'd perhaps add that you appear to be trying to estimate a so-called fractional regression model. In this case the glm command
            Code:
            glm y x, family(binomial) link(probit) robust
            should give results identical to those given by the fracreg command
            Code:
            fracreg probit y x
            with analogous commands for logit links.

            As Nick points out, having zeros among your dependent variable's values is not a problem. The key thing to keep in mind is that in this glm/fracreg framework you are estimating only the conditional mean of the fractional outcome, not any features of its conditional probability structure.

            Comment


            • #7
              Thanks, Mr. Mullahy. The glm code I posted above using the link(logit) has family(binomial) as default (standard choice). That is why I did not type it.
              I did not use fracreg, because my version of Stata is not the most recent one, so I do not have access to this update.
              Also, I still do not know why the code does not fit.

              Comment


              • #8
                Actually it appears that if you don't specify family, you will get Gaussian, not binomial. So at least try specifying binomial and see what happens.

                But my hunch is that this is not main the problem you are encountering. Are you able to display the results of
                Code:
                sum y, d
                sum x controls i.year if y==0
                sum x controls i.year if y>0

                Comment


                • #9
                  Re #5: I have to say I'm stumped. You have not specified a non-standard family-link combination. The other message about a discontinuous region encountered often arises in any model that is estimated by maximum likelihood. Sometimes the likelihood function is just difficult to work with: it may contain local minima, ridges, or even discontinuities. When that happens, algorithms trying to find the maximum may have difficulty moving out of those regions. There isn't always a solution to this problem. Among the approaches that sometimes help:

                  1. Use the -difficult- option. Of try a different technique. (see -help maximize- and go to the technique() option to see the choices available).

                  2. Another approach is to simplify the model. Usually the first step here is to sit the -iterate()- option to a number near the point where Stata is encountering difficulties. Stata will then carry out that number of iterations, then stop and show interim results. These interim results are not valid estimates of your model. But you may be able to see that some particular variable (or set of variables) is the source of the problem: the coefficients or standard errors are blatantly ridiculous, or are missing values. Removing those variables from the model usually gets you back to a well behaved model, though perhaps one that is lacking some variables you think are important.

                  3. Sometimes approach #2 doesn't clearly identify a culprit variable. In that case, another approach to simplifying the model is to just start with a single predictor and run the model. Then add another and run again. Keep going until you find the variable that causes the estimation to fail.

                  Comment

                  Working...
                  X