Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Standard error of predicted probabilities

    Using Stata 13.1 under Windows 7E. I know this has been covered many times here and at Stata, but I can't find anything that applies directly to my situation. I need to compute standard errors of probabilties. The module I'm using utilizes glm, but it does not support prediction like predict or predcitnl (returns an error). I can get regression coefficients via r(table) and the covariance matrix in e(V), but I can't quite figure out how to assemble everything. Not sure how to recover the design matrix. Can anyone point me in the right direction?

  • #2
    "The module I'm using" is left unnamed as well as unsourced. In a previous post at http://www.statalist.org/forums/foru...-probabilities you named apc_ie but did not explain where it comes from.

    It really is hard to comment given such little information. To echo an active member in another thread, why should we have to guess? There are probably of the order of a few thousand user-written programs, and it really is important to be specific.

    Comment


    • #3
      Yes, it's apc_ie (http://econpapers.repec.org/software...de/s456754.htm), but the most germane point here is that it is based on glm, as I mentioned. I can get the probabilities for each age, period and cohort category using the familiar p=e^z/(1+e^z) where z is the sum of the constant and appropriate covariate. However, getting the confidence interval for that probability is more challenging. All of the programs/commands I've found depend on prediction being available, which it's not (at least I get an error when I try saying it cannot find the first variable name in the first column of r(table)). Now, this evening I came across this post that says there is an undocumented way to obtain the predictions (http://www.stata.com/statalist/archi.../msg00029.html), so I'll try this and see what happens. If that doesn't work, I'll need guidance on how to assemble CIs from the constituent components.

      Comment


      • #4
        If you know the formula for the prediction based on xb calculations, you should be able to use
        margins with the expression() option to get a confidence interval to go with your prediction.

        Here is a simplified example use a dataset referenced in the help file for glm:

        Code:
        webuse lbw
        
        * fit a logit model via -glm- with a single predictor
        glm low age, family(binomial) link(logit)
        
        * predicted mean for binomial/logit is a probability
        predict double pr, mu
        
        * now use xb to reproduce the previous predicted probability
        predict double xb, eta
        gen double my_pr = 1/(1+exp(-xb))
        
        * verify they are equivalent
        assert reldif(my_pr, pr) < 1e-9
        
        * compute predicted probability at age=20
        margins, predict(mu) at(age=20)
        
        * reconstruct the same using an expression of xb
        margins, expression(1/(1+exp(-xb()))) at(age=20)

        Comment


        • #5
          I took a closer look at the thread I reference above and it turns out that prediction after estimation using apc_ie does not work. I confirmed this with Jeff's code which yielded:

          . predict double pr, mu
          option mu not allowed
          r(198);

          So, I'm going to need another option. Maybe this is the way: http://stats.stackexchange.com/quest...logistic-regre. The r(table) is organized with the age, period and cohort coefficients first and the intercept last. Same with the covariance matrix. The key here is to obtain .

          So it seems for the first age group I could




          (and its column vector) would shift to [0 -2.5 0 ... 0 1] for the next group, and so on. Just need to take the square root of each scalar for the standard error. Am I on firm ground theoretically? My probability would be , and the CI would be p1.96*se, but I'm not sure I have my se on the probability scale. I need help with that.

          I'll have to read up on how to save the coefficients to a matrix and matrix operations in Stata. I know how to write r(table) to a matrix, but not particular elements. Just want to make sure I'm on the right track before spending hours trying something out.

          Comment


          • #6
            What do you get when you type

            Code:
            ereturn list
            after running the command? In particular, what (if anything) is the value of e(predict)?

            Skimming through the code, it looks to me like a shell for glm with a few added ereturns. But, it may not be ereturning the value needed for the predict command. I suspect it should be the case that the ereturned list includes

            Code:
            e(predict) : "glim_p"
            If you aren't seeing something like that in the ereturns, then I would suggest editing the program code so that, shortly after the glm command, you added the line

            Code:
            ereturn local predict "glim_p"
            Obviously save the original in case I am totally wrong about this.

            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            StataNow Version: 19.5 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment


            • #7
              I just downloaded the program and, as I suspected, the predict program was not being ereturned. But, after I added it, I got this error after running one of the example problems provided:

              Code:
              . predict double pr, mu
              variable age_0 not found
              r(111);
              The program is creating a bunch of temporary variables that disappear when the program finishes; ergo the predict command can't work.

              I think what you would need to do is add all the predict commands you want to the program itself; make sure you do so before it exits.

              This is obviously kind of a clunky approach and will be a pain if you want to run a bunch of different models. But, the program was clearly not written with your purposes in mind, so unless you can figure out how to rewrite the program you may have to go this route. (If it was me and my life depended on it, I might write a customized predict command that regenerated all the temporary variables as needed.)

              Incidentally, I was looking at the code for the companion program, apc_cglim.ado, when I wrote before. It seems to be much simpler. I think the glm predict command would work for it, but I am not as sure about the much more complicated apc_ie. However you do this, make sure you believe the results.
              Last edited by Richard Williams; 15 Oct 2014, 20:40.
              -------------------------------------------
              Richard Williams, Notre Dame Dept of Sociology
              StataNow Version: 19.5 MP (2 processor)

              EMAIL: [email protected]
              WWW: https://www3.nd.edu/~rwilliam

              Comment


              • #8
                My mistake. apc_ie does not leave behind the variables it uses
                in the call to glm, thus predict cannot compute the linear
                prediction that you would need to reproduce the predicted probability.

                However, I image Bill can use nlcom to compute the predicted
                probability of interest, and get an SE estimate along with a CI.

                Comment


                • #9
                  I think nlcom might be helpful if I can figure out how to use it. I tried the following after looking at the help:


                  Code:
                  nlcom _b[_cons]+_b[age_1]
                  
                  _nl_1: _b[_cons]+_b[age_1]
                  
                  ------------------------------------------------------------------------------
                  X_100 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
                  -------------+----------------------------------------------------------------
                  _nl_1 | -4.965012 .1967324 -20.15 0.000 -5.3506 -4.579423
                  ------------------------------------------------------------------------------

                  So it runs. However, I want probabilities, so the question is how to obtain that. I think I need an inverse logit, but I can't get it to run:


                  Code:
                  . nlcom(p: invlogit( _b[age1] _b[_cons]))
                  
                  [age1]_b[_cons] not found
                  r(111);
                  I'm not sure what I'm doing wrong here. Any insight appreciated.

                  A couple of other questions that come to mind:

                  1) What if my APC analysis is restricted to a subset, say a particular range of years or one set of several imputations. Is nlcom following the same restrictions or do I need to impose them as well? If so how? It's not clear in the documentation.

                  2) Suppose I want to save the resulting probabilities and SEs from nlcom to plot. I presume these can be recovered from r(b) and r(V), correct?

                  Comment


                  • #10
                    Try

                    Code:
                    nlcom (p: invlogit(_b[_cons]+_b[age_1]))
                    I assume the linear predictor you plug into invlogit() will have some value,
                    other than 1, for the age_1 variable. Suppose we want the predicted
                    probability assuming age_1=-1.1 (as in one of your previous posts),
                    then the call to nlcom would be

                    Code:
                    nlcom (p: invlogit(_b[_cons]+_b[age_1]*(-1.1)))
                    This all assumes that setting all other predictors to zero is reasonable.

                    Comment


                    • #11

                      Originally posted by Bill Smith View Post
                      1) What if my APC analysis is restricted to a subset, say a particular range of years or one set of several imputations. Is nlcom following the same restrictions or do I need to impose them as well? If so how? It's not clear in the documentation.
                      nlcom works with the fitted parameters, their corresponding variance estimates, and the formula you supply.

                      Originally posted by Bill Smith View Post
                      2) Suppose I want to save the resulting probabilities and SEs from nlcom to plot. I presume these can be recovered from r(b) and r(V), correct?
                      Yes.

                      Comment


                      • #12
                        Ok, I see how to sum inside the parentheses. Thank you. I'm not sure I understand what's being represented in the equation, however. Correct me if I'm wrong, but I thought _b[_cons] and _b[age_1] were model coefficients. The apc_ie module creates 0/1 dummies for each age, period and cohort category. So, the logit for age_1 would be:


                        Code:
                        logit=_b[_cons]+_b[age_1]*1+_b[age_2]*0+..._b[period_1]*0+....+_b[cohort]*0+...
                        An age of -1.1 would make no sense. That was a supposed coefficient-C' above is not the design matrix, but the matrix of coefficients. I don't see what the problem would be with 0/1 dummies.

                        Comment


                        • #13
                          Ran my data and did some maual checks. Everytjing looks good with this method. Just nned to write code to save the results. That shouldn't be too hard. Thanks for the help.

                          Comment


                          • #14
                            Bill, please post your final solution in case this issue ever comes up again.
                            -------------------------------------------
                            Richard Williams, Notre Dame Dept of Sociology
                            StataNow Version: 19.5 MP (2 processor)

                            EMAIL: [email protected]
                            WWW: https://www3.nd.edu/~rwilliam

                            Comment


                            • #15
                              Just saw this. Solution was pretty simple. Jeff was spot on. The apc_ie module will generate a table of results, e(b), as a result of glm that contain the parameter estimates. So if you have ages 0, 5 and 10, for example, you can access those by specifying _b[age_0], _b[age_5] and _b[age_10]. Period and cohort are similar. The constant is _b[_cons]. Then, it's just applying nlcom. The apc_ie module places each age, period and cohort value into a category, so it's like a 0/1 dummy variable. To obtain the probability that the outcome is obtained when age=0 simply set the dummy for age_0=1 and all other dummies to zero. Thus

                              Code:
                              nlcom (p: invlogit(_b[_cons]+_b[age_0]))
                              will yield the desired probability for age=0 and its standard error. Likewise, for age=5

                              Code:
                              nlcom (p: invlogit(_b[_cons]+_b[age_5]))
                              And so on.

                              Comment

                              Working...
                              X