Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • GLM Model (log link, poisson family) predicting values > 1 with binary outcome but only with particular independent variable

    Hello,

    Thanks in advance for reviewing my question. I am trying to run a glm model (below) with a restricted cubic spline, but it is predicting values >1 for my binary outcome. I am running the exact same model with other independent continuous spline variables and am having no issues. Just with this particular variable, I am having difficulties. I thought it was because the values are small (between 0 and 0.0013, so I multiplied by 10k and still the same issue. I attached a graph of the predicted probabilities and a sample of the data. I cannot think of a reason why this is happening, so any help is appreciated!! e2sfca30km is the variable giving me difficulties, and kmspline1-6 is the restricted cubic spline from e2sfca30km. I have tried this model with 3-7 splines.

    E2sfca30km is a measure at the level of census block group so there are only 9k values for 580k individuals in the dataset. 1.5% of the variable = 0. Skewness = 0.555, kurtosis 4.17.

    Code:
    glm apncu_cat2 kmspline*, fam(poisson) link(log) vce(robust)
    Code:
    clear
    input float(apncu_cat2 kmspline1 kmspline2 kmspline3 kmspline4 kmspline5 kmspline6 e2sfca30km)
    0 .00028818857  .00004457869  5.113746e-06  1.444405e-07             0             0 .00028818857
    0 .00026230657 .000033261917  2.703544e-06  7.405989e-09             0             0 .00026230657
    0 .00023844297  .00002469184  1.291617e-06             0             0             0 .00023844297
    0  .0004159302   .0001384533  .00003760846  9.969267e-06 2.1287353e-06 1.2599354e-07  .0004159302
    0  .0004119772  .00013444884 .000035937806  9.285647e-06  1.888368e-06  9.170773e-08  .0004119772
    1 .00028010557  .00004080611  4.250433e-06   7.49899e-08             0             0 .00028010557
    0 .00027958507  .00004057076  4.198444e-06   7.15074e-08             0             0 .00027958507
    0 .00025103986 .000029003764 1.9509584e-06 1.3559164e-10             0             0 .00025103986
    0 .00028010557  .00004080611  4.250433e-06   7.49899e-08             0             0 .00028010557
    0  .0002661811 .000034817243  3.001255e-06  1.457813e-08             0             0  .0002661811
    0 .00026700884  .00003515568  3.067567e-06 1.6548881e-08             0             0 .00026700884
    0  .0004536351   .0001806384   .0000561545  .00001819453  5.471376e-06  9.093207e-07  .0004536351
    1  .0005572831   .0003282382  .00012796781  .00005453524  .00002325828  7.007693e-06  .0005572831
    0  .0003217929 .000062755695  9.952177e-06  8.650636e-07  6.584936e-10             0  .0003217929
    0 .00026771924 .000035447887 3.1252505e-06 1.8375584e-08             0             0 .00026771924
    0 .00026278905 .000033453016  2.739492e-06 8.1290095e-09             0             0 .00026278905
    1    .00039412   .0001173194   .0000290054  6.584881e-06  1.025755e-06 1.1155322e-08    .00039412
    0  .0003506344  .00008182417 .000015973721  2.301527e-06  9.386544e-08             0  .0003506344
    0  .0002377031  .00002445282  1.258366e-06             0             0             0  .0002377031
    0 .00026774168 .000035457142  3.127084e-06  1.843537e-08             0             0 .00026774168
    0  .0002678029 .000035482408  3.132092e-06 1.8599193e-08             0             0  .0002678029
    0  .0005172913  .00026659502  .00009716836  .00003847566 .000015118503 4.0678246e-06  .0005172913
    0  .0004609424  .00018962205  .00006028068    .000020139  6.338406e-06 1.1614068e-06  .0004609424
    1  .0001860779 .000011311463  7.507855e-08             0             0             0  .0001860779
    1 .00023370735  .00002318857 1.0886133e-06             0             0             0 .00023370735
    1 .00018509098 .000011122442  6.856325e-08             0             0             0 .00018509098
    1  .0004410896  .00016580876  .00004946164 .000015114076  4.144417e-06  5.508027e-07  .0004410896
    1 .00040353995  .00012616147 .000032538937  7.932543e-06  1.437277e-06  4.051566e-08 .00040353995
    1  .0007652474   .0006941111   .0003179769  .00015772582  .00007791635 .000027962524  .0007652474
    0 .00026773266  .00003545342  3.126347e-06 1.8411317e-08             0             0 .00026773266
    end
    label values apncu_cat2 apncu2
    label def apncu2 0 "Inadequate or Intermediate", modify
    label def apncu2 1 "Adequate or Adequate Plus", modify

    Click image for larger version

Name:	e2sfca30km issue.png
Views:	2
Size:	60.0 KB
ID:	1565556
    Attached Files

  • #2
    What you re doing is more simply known as Poisson regression (see below for the numerical equivalence of the two commands).
    Poisson regression is
    1) Not really meant for binary variables. Not that the computer will break if you fit it, but for binary variables logit and probit are more appropriate.
    2) There is nothing in the Poisson regression to restrict the predictions between 0 and 1, so I do not see nothing unusual in what you re reporting. I do not see a problem here at all.

    Code:
    . glm apncu_cat2 kmspline*, fam(poisson) link(log) vce(robust) nolog
    
    Generalized linear models                         No. of obs      =         30
    Optimization     : ML                             Residual df     =         24
                                                      Scale parameter =          1
    Deviance         =  14.28542523                   (1/df) Deviance =   .5952261
    Pearson          =  24.01640992                   (1/df) Pearson  =   1.000684
    
    Variance function: V(u) = u                       [Poisson]
    Link function    : g(u) = ln(u)                   [Log]
    
                                                      AIC             =   1.476181
    Log pseudolikelihood = -16.14271262               BIC             =  -67.34331
    
    ------------------------------------------------------------------------------
                 |               Robust
      apncu_cat2 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
       kmspline1 |    1585174    1949363     0.81   0.416     -2235506     5405855
       kmspline2 |   -8458836   1.01e+07    -0.83   0.404    -2.83e+07    1.14e+07
       kmspline3 |   2.37e+07   2.79e+07     0.85   0.397    -3.11e+07    7.84e+07
       kmspline4 |  -1.84e+07   2.19e+07    -0.84   0.401    -6.14e+07    2.45e+07
       kmspline5 |    2302166    7076294     0.33   0.745    -1.16e+07    1.62e+07
       kmspline6 |    1998675    4742274     0.42   0.673     -7296011    1.13e+07
           _cons |       -201   249.9726    -0.80   0.421    -690.9374    288.9373
    ------------------------------------------------------------------------------
    
    . poisson apncu_cat2 kmspline*, vce(robust) nolog
    
    Poisson regression                              Number of obs     =         30
                                                    Wald chi2(5)      =          .
                                                    Prob > chi2       =          .
    Log pseudolikelihood = -16.142713               Pseudo R2         =     0.1862
    
    ------------------------------------------------------------------------------
                 |               Robust
      apncu_cat2 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
       kmspline1 |    1585174    1949363     0.81   0.416     -2235507     5405855
       kmspline2 |   -8458836   1.01e+07    -0.83   0.404    -2.83e+07    1.14e+07
       kmspline3 |   2.37e+07   2.79e+07     0.85   0.397    -3.11e+07    7.84e+07
       kmspline4 |  -1.84e+07   2.19e+07    -0.84   0.401    -6.14e+07    2.45e+07
       kmspline5 |    2302166    7076294     0.33   0.745    -1.16e+07    1.62e+07
       kmspline6 |    1998675    4742274     0.42   0.673     -7296011    1.13e+07
           _cons |       -201   249.9726    -0.80   0.421    -690.9374    288.9373
    ------------------------------------------------------------------------------
    
    .

    Comment


    • #3
      The only problem I see here is that in principle you should not be using the Poisson regression model. You should use a model designed for binary outcome, something like probit or logit.

      Comment


      • #4
        I have used it previously to model relative risks with common outcomes per https://stats.idre.ucla.edu/stata/fa...ohort-studies/ & Zou G. A Modified Poisson Regression Approach to Prospective Studies with Binary Data. Am J Epidemiol 2004; 159(7):702-6 ; the outcome in question is 71.2% of the dataset. I've never had this issue with any other variable, so I guess I was just trying to understand why changing the independent predictor would have this substantial of an impact on the results of the model

        Comment


        • #5
          I guess my bigger question is what characteristics of a continuous variable would lead to the model not working like it has previously and is supposed to?

          Comment


          • #6
            You keep on saying that Poisson regression is "supposed to" give you predictions between 0 and 1, and no, it is not supposed to do this. It is a different model, it is not designed to obey the 0 to 1 bounds.

            Otherwise you can do whatever you want, it is a free world. You can fit Poisson regression to your binary data, you can also fit a Linear probability model, and the predictions are pretty close:

            Code:
            . qui poisson apncu_cat2 kmspline*, vce(robust) nolog
            
            . predict yhatthepoison
            (option n assumed; predicted number of events)
            
            . qui reg apncu_cat2 kmspline*, vce(robust) noheader
            
            . predict yhattheremedy
            (option xb assumed; fitted values)
            
            . summ yhatthepoison yhattheremedy
            
                Variable |        Obs        Mean    Std. Dev.       Min        Max
            -------------+---------------------------------------------------------
            yhatthepoi~n |         30          .3    .2942041   .0549956   1.086324
            yhattherem~y |         30          .3    .2928146   .0281442   1.077846
            
            . pwcorr yhatthepoison yhattheremedy, sig
            
                         | yhatth~n yhatth~y
            -------------+------------------
            yhatthepoi~n |   1.0000 
                         |
                         |
            yhattherem~y |   0.9953   1.0000 
                         |   0.0000
            The Poisson and the Linear Probability model give you virtually the same predictions.

            As to what characteristics of a regressor can take you out of range that you feel should be obeyed like 0 to 1, you said it: continuous. Continuous regressors have the tendency to take you out of the range of your actual dependent variable. I have not really thought through this because I do not think it is an interesting question, but I guess the wider is the range of your continuous regressor, the more likely it is to take you out of the range of your dependent variable.

            Binary variables have the tendency not to take you out of the range of your dependent variable.

            Originally posted by sarah minion View Post
            I guess my bigger question is what characteristics of a continuous variable would lead to the model not working like it has previously and is supposed to?

            Comment


            • #7
              Ok, thank you for your responses

              Comment


              • #8
                Regarding the predictors, I conflated the continuity of the regressor with its boundedness. What I wanted to say is:
                1) if you are running a regression of 0/1 variable on variables which are also 0/1, be it binary, or continuous but with the limited range of 0 to 1, in my experience it is unlikely for the predictions to come out of the range of the dependent variable, and when they come out, they come out by little like in your example.
                2) if you run a regression of 0/1 variable on predictor which is with a huge range out of 0 to 1, then the predictions are very likely to be out of 0 to 1, and to be out by a lot.

                Continuity of the regressors might have something to do with it too. But I think the range of the regressors being much larger than the range of the outcome is the main factor.

                Comment

                Working...
                X