Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • intreg - how to test for normality and heteroskedasticity?

    Hello,

    are there specific tests to test for normality or heteroskedasticity when using intreg?

    I found the command "tobcm" as test for normality in a Tobit regression and the command "bctobit" as a test for heteroskedasticity in a Tobit regression. Are these tests applicable for intreg as well?

    Both test only work for left censoring (at zero). Is there any test that takes into account both left and right censoring? and in the ideal case is also applicable for intreg


    Many thanks in advance!




    tobcm implements a conditional moment test for testing the null hypothesis that the disturbances in a tobit model have a normal distribution. This test was derived by Skeels and Vella (1999), who built on work by Newey (1985) and Tauchen (1985). tobcm also implements the bootstrap method described by Drukker (2002).

    bctobit computes the LM-statistic for testing the tobit specification, against the alternative of a model that is non-linear in the regressors and contains an error term that can be heteroskedastic and non-normally distributed. The test is carried out by taking a Box-Cox transformation of the dependent variable [y^(lambda)-1]/lambda and testing whether the parameter lambda=1. A rejection of the null suggests that the Tobit specification is unsuitable, as an alternative value for lambda would be required to return the linearity, homoskedasticity and normality assumptions that are necessary for consistent estimation. Critical values are obtained via the parametric bootstrap, where the regressors are assumed to be stochastic

  • #2
    The Stata manual suggests comparing the results of intreg and oprobit. If the log likelihoods are very different, normality may be a problem, but you may be able to save the intreg model by logging variables or doing other transformations.

    My brief notes on this are at https://www3.nd.edu/~rwilliam/xsoc73994/intreg2.pdf .
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    Stata Version: 17.0 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

    Comment


    • #3
      Many thanks for your quick and helpful response. I hadn't understood that the comparison would give me insights about the normal distribution. Is there also a way to test for Heteroskedasticity?

      Comment


      • #4
        Sara: I'm also puzzling over how this comparison yields insights about normality. I think the idea is that, under normality, the estimated cut points should line up pretty well with the assigned cut points, but I'll think more about it.

        You can specify a het(x1 x2 ... xk) option with intreg and estimate an exponential model of heteroskedasticity. Then you can use a likelihood ratio test. Use 2 times the difference in the log likelihoods: 2*(unrestricted_LFF - restricted_LLF) and use it as a chi-square with k degrees of freedom.

        Comment


        • #5
          Hello Jeff,

          thanks a lot for your response.
          I am a novice in the use of stata and econometrics in general. This is why I would like to check whether I did it right.
          Here is what I did:
          (For better readability i have omitted the industrydummies in the following presentation)

          Code:
          intreg lmneup ulmneup sciencepartner innoexp size compenv costs [industry-dummies]
          estimates store no1
          
          intreg lmneup ulmneup sciencepartner innoexp size compenv costs [industry-dummie], het(sciencepartner innoausgaben size innoexp costs)
          estimates store no2
          
          lrtest no1 no2
          Likelihood-ratio test                                 LR chi2(5)  =     55.37
          (Assumption: no1 nested in no2)                       Prob > chi2 =    0.0000

          As far as I understood the significance of this test implies that the exponential model of heteroskedasticity fits better, thus my model does not meet the assumption of homoskedasticity. Am I right?
          The test automatically used k(=5) degrees of freedom. What do you mean by "use 2 times the difference in the log likelihoods: 2*(unrestricted_LFF - restricted_LLF)"? Is this something I have to specify or is this done automatically by the test as well?

          In order to understand the het option I read the Stata manual concerning Heteroskedastic linear regression.
          Derived from this I assume that the results of the exponential model of heteroskedasticity - more specifically the part under "lnsigma"- already show me which variables are responsible for the heteroskedasticity, namely those which are significant. Is this right?


          I appreciate your help very much. Many thanks.

          Comment


          • #6
            Never mind about how my comment about how to compute the LR test. You figured it out better than I: let Stata do it. So, yes, homoskedasticity is strong rejected. But you don't abandon intreg. It just means that you should use the estimates from the more general model. I could say more if you show the actual coefficient estimates.

            How many industry dummies are there? You could put those in the het() part, too.

            Comment


            • #7
              Thanks for your support. There are 21 industry dummies. I've done it again now and also put the industry dummies in the het part.
              Here are my results:

              Code:
                 intreg lmneup ulmneup sciencepartner innoexp size compenv costs zweig1 zweig2 zweig3 zweig4 zweig5 zweig6 zweig7 zweig8 zweig9 zweig10 zweig11 zweig12 zweig13 zweig14 zweig15 zweig16 zweig18 zweig19 zweig20 zweig21
              
              Fitting constant-only model:
              
              Iteration 0:   log likelihood = -1603.6993  
              Iteration 1:   log likelihood = -1344.6035  
              Iteration 2:   log likelihood = -1310.3811  
              Iteration 3:   log likelihood = -1310.1688  
              Iteration 4:   log likelihood = -1310.1688  
              
              Fitting full model:
              
              Iteration 0:   log likelihood = -1561.8312  
              Iteration 1:   log likelihood = -1288.4557  
              Iteration 2:   log likelihood = -1261.5641  
              Iteration 3:   log likelihood = -1261.1702  
              Iteration 4:   log likelihood = -1261.1699  
              Iteration 5:   log likelihood = -1261.1699  
              
              Interval regression                             Number of obs     =        947
                                                              LR chi2(25)       =      98.00
              Log likelihood = -1261.1699                     Prob > chi2       =     0.0000
              
              --------------------------------------------------------------------------------
                             |      Coef.   Std. Err.              z    P>|z|     [95% Conf. Interval]
              ---------------+----------------------------------------------------------------
                  sciencep.|   3.756685   .7812141     4.81   0.000     2.225534    5.287837
                    innoexp |   .1223108   .1127243     1.09   0.278    -.0986247    .3432463
                           size |  -.6006468   .6809885    -0.88   0.378     -1.93536    .7340661
                   compevn |   -1.41107   4.667241    -0.30   0.762    -10.55869    7.736554
                         costs |    1.36628   1.048698     1.30   0.193    -.6891301    3.421691
                      zweig1 |  -14.57751   9.717829    -1.50   0.134    -33.62411    4.469082
                      zweig2 |  -.6921685   5.404573    -0.13   0.898    -11.28494      9.9006
                      zweig3 |   5.087677   4.857019     1.05   0.295    -4.431905    14.60726
                      zweig4 |    2.41649   5.860157     0.41   0.680    -9.069207    13.90219
                      zweig5 |   .7620257   5.064506     0.15   0.880    -9.164223    10.68827
                      zweig6 |   2.538926   5.885897     0.43   0.666    -8.997219    14.07507
                      zweig7 |   8.892196   6.161178     1.44   0.149    -3.183491    20.96788
                      zweig8 |  -2.185709   4.943613    -0.44   0.658    -11.87501    7.503594
                      zweig9 |   4.009841   4.015889     1.00   0.318    -3.861156    11.88084
                     zweig10 |   5.981377   4.482225     1.33   0.182    -2.803623    14.76638
                     zweig11 |  -2.807874   6.056627    -0.46   0.643    -14.67865    9.062898
                     zweig12 |  -4.565757   4.850311    -0.94   0.347    -14.07219    4.940677
                     zweig13 |  -15.29615   10.66499    -1.43   0.152    -36.19914    5.606848
                     zweig14 |  -15.23866   8.375824    -1.82   0.069    -31.65498    1.177651
                     zweig15 |  -18.65266   6.830155    -2.73   0.006    -32.03952   -5.265804
                     zweig16 |  -10.62626    5.76803    -1.84   0.065    -21.93139    .6788714
                     zweig18 |  -5.920499   6.151192    -0.96   0.336    -17.97661    6.135615
                     zweig19 |   .6001012   4.891965     0.12   0.902    -8.987974    10.18818
                     zweig20 |  -12.89342   6.528487    -1.97   0.048    -25.68902   -.0978247
                     zweig21 |  -6.363586   7.130654    -0.89   0.372    -20.33941    7.612239
                       _cons |  -7.630273   6.131401    -1.24   0.213     -19.6476    4.387052
              ---------------+----------------------------------------------------------------
                    /lnsigma |   3.132421     .04385    71.43   0.000     3.046476    3.218365
              ---------------+----------------------------------------------------------------
                       sigma |   22.92942   1.005455                      21.04107    24.98724
              --------------------------------------------------------------------------------
                         591  left-censored observations
                           0     uncensored observations
                           6 right-censored observations
                         350       interval observations
              
              . estimates store m1
              
              .  intreg lmneup ulmneup sciencepartner innoexp size compenv costs zweig1 zweig2 zweig3 zweig4 zweig5 zweig6 zweig7 zweig8 zweig9 zweig10 zweig11 zweig12 zweig13 zweig14 zweig15 zweig16 zweig18 zweig19 zweig20 zweig21, het(sciencepartner innoexp size compenv costs zweig1 zweig2 zweig3 zweig4 zweig5 zweig6 zweig7 zweig8 zweig9 zweig10 zweig11 zweig12 zweig13 zweig14 zweig15 zweig16 zweig18 zweig19 zweig20 zweig21)
              
              Fitting full model:
              
              Iteration 0:   log likelihood = -10629.046  (not concave)
              Iteration 1:   log likelihood = -6497.2094  (not concave)
              Iteration 2:   log likelihood = -1505.5265  (not concave)
              Iteration 3:   log likelihood = -1362.9154  (not concave)
              Iteration 4:   log likelihood = -1292.3793  
              Iteration 5:   log likelihood = -1221.0462  
              Iteration 6:   log likelihood = -1207.4486  
              Iteration 7:   log likelihood = -1205.7615  
              Iteration 8:   log likelihood = -1205.6998  
              Iteration 9:   log likelihood = -1205.6983  
              Iteration 10:  log likelihood = -1205.6983  
              
              Interval regression                             Number of obs     =        947
                                                              Wald chi2(25)     =      90.39
              Log likelihood = -1205.6983                     Prob > chi2       =     0.0000
              
              --------------------------------------------------------------------------------
                             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
              ---------------+----------------------------------------------------------------
              model          |
              sciencepart |   3.790491   .7313526     5.18   0.000     2.357066    5.223916
                  innoexp |   .0535521    .093061     0.58   0.565     -.128844    .2359483
                        size |   1.410997   .6577003     2.15   0.032     .1219279    2.700066
                 compenv|   7.841492   6.013922     1.30   0.192    -3.945578    19.62856
                       costs|   .5267223    .970996     0.54   0.588    -1.376395    2.429839
                      zweig1 |    1.11798   8.360903     0.13   0.894    -15.26909    17.50505
                      zweig2 |   12.39252    6.19088     2.00   0.045     .2586215    24.52642
                      zweig3 |   12.52402   6.457947     1.94   0.052    -.1333229    25.18136
                      zweig4 |    5.48907    8.87373     0.62   0.536    -11.90312    22.88126
                      zweig5 |   8.350297   6.851908     1.22   0.223    -5.079196    21.77979
                      zweig6 |   13.63902   6.249389     2.18   0.029     1.390439    25.88759
                      zweig7 |   16.66807   6.491637     2.57   0.010     3.944695    29.39145
                      zweig8 |   7.787226   6.316265     1.23   0.218    -4.592426    20.16688
                      zweig9 |   9.846366   6.195789     1.59   0.112    -2.297157    21.98989
                     zweig10 |   10.63104   6.554325     1.62   0.105    -2.215205    23.47728
                     zweig11 |   3.787774   7.872584     0.48   0.630    -11.64221    19.21775
                     zweig12 |   3.159504   7.289769     0.43   0.665    -11.12818    17.44719
                     zweig13 |    13.3289   6.292103     2.12   0.034     .9966071     25.6612
                     zweig14 |  -2.494406    12.7042    -0.20   0.844    -27.39419    22.40538
                     zweig15 |  -27.58568   24.64178    -1.12   0.263    -75.88268    20.71132
                     zweig16 |   9.583642    6.50055     1.47   0.140    -3.157203    22.32449
                     zweig18 |  -19.75297   18.38035    -1.07   0.283    -55.77779    16.27185
                     zweig19 |   9.377699   6.952508     1.35   0.177    -4.248966    23.00436
                     zweig20 |  -20.00957   19.69782    -1.02   0.310    -58.61659    18.59746
                     zweig21 |   8.917929   8.085382     1.10   0.270    -6.929128    24.76499
                       _cons |  -30.69488   8.697847    -3.53   0.000    -47.74235   -13.64742
              ---------------+----------------------------------------------------------------
              lnsigma        |
              sciencepart  |  -.0420818   .0380513    -1.11   0.269    -.1166609    .032497
                      innoexp|   .0030973   .0052842     0.59   0.558    -.0072595    .0134541
                           size |  -.1805327   .0334142    -5.40   0.000    -.2460234    -.115042
                   compenv |  -.3595463   .2430017    -1.48   0.139     -.835821    .1167283
                         costs |   .0611707   .0545769     1.12   0.262     -.045798    .1681393
                      zweig1 |  -.9596453   .4486655    -2.14   0.032    -1.839014   -.0802771
                      zweig2 |   -.943934   .2582397    -3.66   0.000    -1.450075   -.4377935
                      zweig3 |  -.5268276    .227677    -2.31   0.021    -.9730663   -.0805889
                      zweig4 |   .0540816   .2900988     0.19   0.852    -.5145015    .6226647
                      zweig5 |  -.3939927   .2425254    -1.62   0.104    -.8693337    .0813483
                      zweig6 |  -.9201712   .2680818    -3.43   0.001    -1.445602   -.3947406
                      zweig7 |  -.7293387   .2737441    -2.66   0.008    -1.265867   -.1928101
                      zweig8 |  -.5349117      .2395    -2.23   0.026    -1.004323   -.0655003
                      zweig9 |  -.2824018   .1972669    -1.43   0.152    -.6690378    .1042341
                     zweig10 |  -.1195678   .2179304    -0.55   0.583    -.5467036    .3075681
                     zweig11 |  -.1933591   .3012817    -0.64   0.521    -.7838604    .3971422
                     zweig12 |  -.2877742   .2524264    -1.14   0.254    -.7825208    .2069724
                     zweig13 |   -1.94136   .7559041    -2.57   0.010    -3.422905   -.4598152
                     zweig14 |  -.6637757   .4948871    -1.34   0.180    -1.633737    .3061852
                     zweig15 |   .2014341   .4872493     0.41   0.679     -.753557    1.156425
                     zweig16 |  -1.112253   .3051469    -3.64   0.000     -1.71033   -.5141765
                     zweig18 |    .469321   .3931613     1.19   0.233     -.301261    1.239903
                     zweig19 |  -.2818065   .2404365    -1.17   0.241    -.7530533    .1894403
                     zweig20 |    .084855   .4197141     0.20   0.840    -.7377695    .9074795
                     zweig21 |  -.6711068   .3929349    -1.71   0.088    -1.441245    .0990314
                       _cons |   4.397328   .3154628    13.94   0.000     3.779033    5.015624
              --------------------------------------------------------------------------------
                         591  left-censored observations
                           0     uncensored observations
                           6 right-censored observations
                         350       interval observations
              
              . estimates store m2
              
              . lrtest m1 m2
              
              Likelihood-ratio test                                 LR chi2(25) =    110.94
              (Assumption: m1 nested in m2)                         Prob > chi2 =    0.0000

              What do you mean by "It just means that you should use the estimates from the more general model"?

              Thank you for your support.
              Last edited by Sara Final; 10 Sep 2019, 14:31.

              Comment


              • #8
                When you use the het() option, you're allowing for heteroskedasticity, right? So that's the more general model. That's the main point of the het() option: to allow heteroskedasticity. That you reject homoskedasticity means that you should opt for the estimates that allow for heteroskedasticity. It's no different from finding that, say, when I add a variable z or x^2 or z*x to a regression, and it is statistically significant, I would tend to include that variable or variables.

                The coefficients/significance on sciencepartner is stable in magnitude and significance. The others bounce around, but innoexp is insignificant in both cases. What variables are you mainly interested in?

                Comment


                • #9
                  Thanks for the explanation.
                  I am interested in the relation between sciencepartner and my dependent variable.
                  Is there anything I have to do about the fact that they "bounce around"? Or does this just kind of prove that the results are biased when I do not take into account the heteroskedasticity?
                  I tried the same with another independent variable instead of sciencepartner. In this case the coefficient differs: in the original model the coeffient of this independent variable is 3.297860 and in the model with het option it is 1,881348 (both are significant with p<0.05). Does this simply mean that I would overestimate the influence if I use the model without het? Thus in my thesis I only interpret the results of the model with het, is that right?
                  Do I have to interpret the part below lnsigma? If so, what does it tell me except that some variables are responsible for heteroskedasticity.
                  I want to add an interaction term. Can I also add the interaction term in the het part?

                  A more general question: why does Stata also allow to simpy add vce(robust)? As far as I understood it is kind of common pracitice to control for heteroskedasticity by using robust standard errors. However it seems as if is not really a good idea to do this since if the assumption of homoskedasticity is not met it may be that the estimators are not consistent and one does not solve this by adding vce(robust). Am I right? So I guess using the het option is a better way to deal with heteroscedasticity?


                  Sorry for this amount of questions. Once again, thank you very much!

                  Comment


                  • #10
                    If sciencepartner is you variable of interest, it's hard to imagine the results could be much better. Whether you account for heteroskedasticity or not, the estimate is very similar. Plus, the standard error actually gets smaller, but not in a crazy way. What happens on the control variables is not that important. It's clear heteroskedasticity is related to size, which can explain why it changes sign and becomes significant.

                    That's a very good observation about "robust" in this context with intreg. In my view, Stata should not allow this -- or at least provide a warning -- and I'm usually in favor of computing robust standard errors. If one has to use the "robust" option then one is admitting that the underlying assumptions for interval regression -- normality, homoskedasticity -- are wrong, and this causes inconsistency in the parameter estimates. Using the "robust" option just gives robust standard errors for inconsistent parameter estimates. If this were oprobit rather than intreg, one could argue that oprobit is just approximating response probabilities, and so using a robust variance matrix is justified. But not when there is data censoring: any kind of misspecification causes coefficient estimates to be inconsistent. So report the results that allow heteroskedasticity, as it's clearly in the data. You might mention which factors affect the variance. The variance shrinks as size increases, which may be worth noting.

                    Jeff

                    Comment


                    • #11
                      Thank you, your explanation helps me a lot.

                      I just wondered, do I have to include all variables in the het option?
                      I noticed, that if I leave out some of those variables which are not signifcant in the part below Insigma it sometimes has an impact on whether my variable of interest is significant or not.

                      Have a nice weekend!


                      And of course, I would still be grateful if somebody could tell me how to test for normality.

                      Comment


                      • #12
                        IMO, statistical tests of normality are usually not helpful. Normality of the errors is most important when samples are small. But when samples are small, tests of normality are under-powered. As n increases, normality of the errors becomes less important (see Jeff Wooldridge's econometrics textbook, for example), but at the same time, tests of normality become increasingly powerful. In a nutshell, tests of normality fail to detect important departures from normality when samples are small, but throw up the red flag of non-normality when non-normality doesn't really matter when n is large. To me, this is almost the perfect example of two things being at cross-purposes. YMMV.
                        --
                        Bruce Weaver
                        Email: [email protected]
                        Web: http://sites.google.com/a/lakeheadu.ca/bweaver/
                        Version: Stata/MP 18.0 (Windows)

                        Comment


                        • #13
                          Thank you for your response. I guess with more than 900 observations this would count for my sample. I also read something like this. However, I was uncertain since the normality assumption seems to be more important for intreg than for OLS. But if I can argue like that, it makes it easier for me, of course.
                          Have a nice weekend.

                          Comment


                          • #14
                            I'll have to disagree with Bruce in cases where the data have been censored. In such cases, the assumed distribution can be important. This is not an issue of whether there are enough observations to apply the central limit theorem. The issue is that, even with a lot of observations, nonnormality can cause the estimated coefficients to be badly biased. If we were talking about OLS, probit, Tobit, and the like, where the y we want to explain is the y we observe, then the normality assumption is less critical. As we know, OLS works fine with almost any distribution (ruling out very fat tails) when you have a sufficient number of observations.

                            With interval regression, we can have a lot of data, and if the population is not normally distributed, the intreg command can produce badly biased estimators. In a sense, it gets worse with more data because you're more precisely estimating the wrong parameters. Same is true of any true data censoring scheme. The censoring is very costly: without it, we could do OLS and would not worry about normality.

                            Having said that, testing normality is not so easy. One way is to nest the normal in the Pearson family, and then derive the score test. I don't know that this has been done with interval regression.

                            Comment


                            • #15
                              Thank you, I am very grateful for your explanations.
                              If I understand you correctly, normality is less improtant for Tobit than for intreg? I assumed that the assumptions for intreg and tobit would be the same as intreg is a generalization of tobit and I thought both models deal with censoring.
                              My supervisor from university seems to be of the opinion that a tobit model can also deal with interval data.
                              I stayed with intreg because my research gave me the impression that tobit is right when the censored dependent variable is continuous (apart from the censored points), and intreg is used when the censored dependent variable is present as interval data. My dependent variable looks like this:

                              0 x=0
                              1 0<x<5
                              2 5<=x<10
                              3 10<=x<15
                              4 15<=x<20
                              5 20<=x<30
                              6 30<=x<50
                              7 50<=x<75
                              8 75<=x<=100


                              I read a paper in which the authors deal with the same dependent variable as I do with the difference that the information is not reduced to interval data but their dependent variable is continuos.
                              The authors justify the use of a tobit model with double censoring due to the fact that the variable ranges from 0 to 100. This is why I assumed a Tobit model would be right and because my variable is not continuos I chose intreg.

                              Would you agree that I could use a tobit model as well?
                              I have assumed that intreg is intended for exactly this kind of data.
                              However if it is okay to use tobit and I got it right in your last post that the normality assumption is less important for tobit, then maybe tobit would be a good alternative.

                              I would be very interested in your opinion about this.
                              Many thanks.

                              Comment

                              Working...
                              X