intreg - how to test for normality and heteroskedasticity?

Sara Final

Join Date: Aug 2019

Posts: 17
#1

intreg - how to test for normality and heteroskedasticity?

09 Sep 2019, 13:25

Hello,

are there specific tests to test for normality or heteroskedasticity when using intreg?

I found the command "tobcm" as test for normality in a Tobit regression and the command "bctobit" as a test for heteroskedasticity in a Tobit regression. Are these tests applicable for intreg as well?

Both test only work for left censoring (at zero). Is there any test that takes into account both left and right censoring? and in the ideal case is also applicable for intreg

Many thanks in advance!

tobcm implements a conditional moment test for testing the null hypothesis that the disturbances in a tobit model have a normal distribution. This test was derived by Skeels and Vella (1999), who built on work by Newey (1985) and Tauchen (1985). tobcm also implements the bootstrap method described by Drukker (2002).

bctobit computes the LM-statistic for testing the tobit specification, against the alternative of a model that is non-linear in the regressors and contains an error term that can be heteroskedastic and non-normally distributed. The test is carried out by taking a Box-Cox transformation of the dependent variable [y^(lambda)-1]/lambda and testing whether the parameter lambda=1. A rejection of the null suggests that the Tobit specification is unsuitable, as an alternative value for lambda would be required to return the linearity, homoskedasticity and normality assumptions that are necessary for consistent estimation. Critical values are obtained via the parametric bootstrap, where the regressors are assumed to be stochastic
Tags: None
Richard Williams

Join Date: Apr 2014

Posts: 5008
#2

09 Sep 2019, 15:33

The Stata manual suggests comparing the results of intreg and oprobit. If the log likelihoods are very different, normality may be a problem, but you may be able to save the intreg model by logging variables or doing other transformations.

My brief notes on this are at https://www3.nd.edu/~rwilliam/xsoc73994/intreg2.pdf .

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Sara Final

Join Date: Aug 2019

Posts: 17
#3

10 Sep 2019, 06:15

Many thanks for your quick and helpful response. I hadn't understood that the comparison would give me insights about the normal distribution. Is there also a way to test for Heteroskedasticity?
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2174
#4

10 Sep 2019, 06:38

Sara: I'm also puzzling over how this comparison yields insights about normality. I think the idea is that, under normality, the estimated cut points should line up pretty well with the assigned cut points, but I'll think more about it.

You can specify a het(x1 x2 ... xk) option with intreg and estimate an exponential model of heteroskedasticity. Then you can use a likelihood ratio test. Use 2 times the difference in the log likelihoods: 2*(unrestricted_LFF - restricted_LLF) and use it as a chi-square with k degrees of freedom.
Comment
Sara Final

Join Date: Aug 2019

Posts: 17
#5

10 Sep 2019, 11:55

Hello Jeff,

thanks a lot for your response.
I am a novice in the use of stata and econometrics in general. This is why I would like to check whether I did it right.
Here is what I did:
(For better readability i have omitted the industrydummies in the following presentation)

Code:

intreg lmneup ulmneup sciencepartner innoexp size compenv costs [industry-dummies] estimates store no1 intreg lmneup ulmneup sciencepartner innoexp size compenv costs [industry-dummie], het(sciencepartner innoausgaben size innoexp costs) estimates store no2 lrtest no1 no2 Likelihood-ratio test LR chi2(5) = 55.37 (Assumption: no1 nested in no2) Prob > chi2 = 0.0000

As far as I understood the significance of this test implies that the exponential model of heteroskedasticity fits better, thus my model does not meet the assumption of homoskedasticity. Am I right?
The test automatically used k(=5) degrees of freedom. What do you mean by "use 2 times the difference in the log likelihoods: 2*(unrestricted_LFF - restricted_LLF)"? Is this something I have to specify or is this done automatically by the test as well?

In order to understand the het option I read the Stata manual concerning Heteroskedastic linear regression.
Derived from this I assume that the results of the exponential model of heteroskedasticity - more specifically the part under "lnsigma"- already show me which variables are responsible for the heteroskedasticity, namely those which are significant. Is this right?

I appreciate your help very much. Many thanks.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2174
#6

10 Sep 2019, 13:06

Never mind about how my comment about how to compute the LR test. You figured it out better than I: let Stata do it. So, yes, homoskedasticity is strong rejected. But you don't abandon intreg. It just means that you should use the estimates from the more general model. I could say more if you show the actual coefficient estimates.

How many industry dummies are there? You could put those in the het() part, too.
Comment

Sara Final

Join Date: Aug 2019
Posts: 17

10 Sep 2019, 14:27

Thanks for your support. There are 21 industry dummies. I've done it again now and also put the industry dummies in the het part.
Here are my results:

Code:

   intreg lmneup ulmneup sciencepartner innoexp size compenv costs zweig1 zweig2 zweig3 zweig4 zweig5 zweig6 zweig7 zweig8 zweig9 zweig10 zweig11 zweig12 zweig13 zweig14 zweig15 zweig16 zweig18 zweig19 zweig20 zweig21

Fitting constant-only model:

Iteration 0:   log likelihood = -1603.6993  
Iteration 1:   log likelihood = -1344.6035  
Iteration 2:   log likelihood = -1310.3811  
Iteration 3:   log likelihood = -1310.1688  
Iteration 4:   log likelihood = -1310.1688  

Fitting full model:

Iteration 0:   log likelihood = -1561.8312  
Iteration 1:   log likelihood = -1288.4557  
Iteration 2:   log likelihood = -1261.5641  
Iteration 3:   log likelihood = -1261.1702  
Iteration 4:   log likelihood = -1261.1699  
Iteration 5:   log likelihood = -1261.1699  

Interval regression                             Number of obs     =        947
                                                LR chi2(25)       =      98.00
Log likelihood = -1261.1699                     Prob > chi2       =     0.0000

--------------------------------------------------------------------------------
               |      Coef.   Std. Err.              z    P>|z|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
    sciencep.|   3.756685   .7812141     4.81   0.000     2.225534    5.287837
      innoexp |   .1223108   .1127243     1.09   0.278    -.0986247    .3432463
             size |  -.6006468   .6809885    -0.88   0.378     -1.93536    .7340661
     compevn |   -1.41107   4.667241    -0.30   0.762    -10.55869    7.736554
           costs |    1.36628   1.048698     1.30   0.193    -.6891301    3.421691
        zweig1 |  -14.57751   9.717829    -1.50   0.134    -33.62411    4.469082
        zweig2 |  -.6921685   5.404573    -0.13   0.898    -11.28494      9.9006
        zweig3 |   5.087677   4.857019     1.05   0.295    -4.431905    14.60726
        zweig4 |    2.41649   5.860157     0.41   0.680    -9.069207    13.90219
        zweig5 |   .7620257   5.064506     0.15   0.880    -9.164223    10.68827
        zweig6 |   2.538926   5.885897     0.43   0.666    -8.997219    14.07507
        zweig7 |   8.892196   6.161178     1.44   0.149    -3.183491    20.96788
        zweig8 |  -2.185709   4.943613    -0.44   0.658    -11.87501    7.503594
        zweig9 |   4.009841   4.015889     1.00   0.318    -3.861156    11.88084
       zweig10 |   5.981377   4.482225     1.33   0.182    -2.803623    14.76638
       zweig11 |  -2.807874   6.056627    -0.46   0.643    -14.67865    9.062898
       zweig12 |  -4.565757   4.850311    -0.94   0.347    -14.07219    4.940677
       zweig13 |  -15.29615   10.66499    -1.43   0.152    -36.19914    5.606848
       zweig14 |  -15.23866   8.375824    -1.82   0.069    -31.65498    1.177651
       zweig15 |  -18.65266   6.830155    -2.73   0.006    -32.03952   -5.265804
       zweig16 |  -10.62626    5.76803    -1.84   0.065    -21.93139    .6788714
       zweig18 |  -5.920499   6.151192    -0.96   0.336    -17.97661    6.135615
       zweig19 |   .6001012   4.891965     0.12   0.902    -8.987974    10.18818
       zweig20 |  -12.89342   6.528487    -1.97   0.048    -25.68902   -.0978247
       zweig21 |  -6.363586   7.130654    -0.89   0.372    -20.33941    7.612239
         _cons |  -7.630273   6.131401    -1.24   0.213     -19.6476    4.387052
---------------+----------------------------------------------------------------
      /lnsigma |   3.132421     .04385    71.43   0.000     3.046476    3.218365
---------------+----------------------------------------------------------------
         sigma |   22.92942   1.005455                      21.04107    24.98724
--------------------------------------------------------------------------------
           591  left-censored observations
             0     uncensored observations
             6 right-censored observations
           350       interval observations

. estimates store m1

.  intreg lmneup ulmneup sciencepartner innoexp size compenv costs zweig1 zweig2 zweig3 zweig4 zweig5 zweig6 zweig7 zweig8 zweig9 zweig10 zweig11 zweig12 zweig13 zweig14 zweig15 zweig16 zweig18 zweig19 zweig20 zweig21, het(sciencepartner innoexp size compenv costs zweig1 zweig2 zweig3 zweig4 zweig5 zweig6 zweig7 zweig8 zweig9 zweig10 zweig11 zweig12 zweig13 zweig14 zweig15 zweig16 zweig18 zweig19 zweig20 zweig21)

Fitting full model:

Iteration 0:   log likelihood = -10629.046  (not concave)
Iteration 1:   log likelihood = -6497.2094  (not concave)
Iteration 2:   log likelihood = -1505.5265  (not concave)
Iteration 3:   log likelihood = -1362.9154  (not concave)
Iteration 4:   log likelihood = -1292.3793  
Iteration 5:   log likelihood = -1221.0462  
Iteration 6:   log likelihood = -1207.4486  
Iteration 7:   log likelihood = -1205.7615  
Iteration 8:   log likelihood = -1205.6998  
Iteration 9:   log likelihood = -1205.6983  
Iteration 10:  log likelihood = -1205.6983  

Interval regression                             Number of obs     =        947
                                                Wald chi2(25)     =      90.39
Log likelihood = -1205.6983                     Prob > chi2       =     0.0000

--------------------------------------------------------------------------------
               |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
model          |
sciencepart |   3.790491   .7313526     5.18   0.000     2.357066    5.223916
    innoexp |   .0535521    .093061     0.58   0.565     -.128844    .2359483
          size |   1.410997   .6577003     2.15   0.032     .1219279    2.700066
   compenv|   7.841492   6.013922     1.30   0.192    -3.945578    19.62856
         costs|   .5267223    .970996     0.54   0.588    -1.376395    2.429839
        zweig1 |    1.11798   8.360903     0.13   0.894    -15.26909    17.50505
        zweig2 |   12.39252    6.19088     2.00   0.045     .2586215    24.52642
        zweig3 |   12.52402   6.457947     1.94   0.052    -.1333229    25.18136
        zweig4 |    5.48907    8.87373     0.62   0.536    -11.90312    22.88126
        zweig5 |   8.350297   6.851908     1.22   0.223    -5.079196    21.77979
        zweig6 |   13.63902   6.249389     2.18   0.029     1.390439    25.88759
        zweig7 |   16.66807   6.491637     2.57   0.010     3.944695    29.39145
        zweig8 |   7.787226   6.316265     1.23   0.218    -4.592426    20.16688
        zweig9 |   9.846366   6.195789     1.59   0.112    -2.297157    21.98989
       zweig10 |   10.63104   6.554325     1.62   0.105    -2.215205    23.47728
       zweig11 |   3.787774   7.872584     0.48   0.630    -11.64221    19.21775
       zweig12 |   3.159504   7.289769     0.43   0.665    -11.12818    17.44719
       zweig13 |    13.3289   6.292103     2.12   0.034     .9966071     25.6612
       zweig14 |  -2.494406    12.7042    -0.20   0.844    -27.39419    22.40538
       zweig15 |  -27.58568   24.64178    -1.12   0.263    -75.88268    20.71132
       zweig16 |   9.583642    6.50055     1.47   0.140    -3.157203    22.32449
       zweig18 |  -19.75297   18.38035    -1.07   0.283    -55.77779    16.27185
       zweig19 |   9.377699   6.952508     1.35   0.177    -4.248966    23.00436
       zweig20 |  -20.00957   19.69782    -1.02   0.310    -58.61659    18.59746
       zweig21 |   8.917929   8.085382     1.10   0.270    -6.929128    24.76499
         _cons |  -30.69488   8.697847    -3.53   0.000    -47.74235   -13.64742
---------------+----------------------------------------------------------------
lnsigma        |
sciencepart  |  -.0420818   .0380513    -1.11   0.269    -.1166609    .032497
        innoexp|   .0030973   .0052842     0.59   0.558    -.0072595    .0134541
             size |  -.1805327   .0334142    -5.40   0.000    -.2460234    -.115042
     compenv |  -.3595463   .2430017    -1.48   0.139     -.835821    .1167283
           costs |   .0611707   .0545769     1.12   0.262     -.045798    .1681393
        zweig1 |  -.9596453   .4486655    -2.14   0.032    -1.839014   -.0802771
        zweig2 |   -.943934   .2582397    -3.66   0.000    -1.450075   -.4377935
        zweig3 |  -.5268276    .227677    -2.31   0.021    -.9730663   -.0805889
        zweig4 |   .0540816   .2900988     0.19   0.852    -.5145015    .6226647
        zweig5 |  -.3939927   .2425254    -1.62   0.104    -.8693337    .0813483
        zweig6 |  -.9201712   .2680818    -3.43   0.001    -1.445602   -.3947406
        zweig7 |  -.7293387   .2737441    -2.66   0.008    -1.265867   -.1928101
        zweig8 |  -.5349117      .2395    -2.23   0.026    -1.004323   -.0655003
        zweig9 |  -.2824018   .1972669    -1.43   0.152    -.6690378    .1042341
       zweig10 |  -.1195678   .2179304    -0.55   0.583    -.5467036    .3075681
       zweig11 |  -.1933591   .3012817    -0.64   0.521    -.7838604    .3971422
       zweig12 |  -.2877742   .2524264    -1.14   0.254    -.7825208    .2069724
       zweig13 |   -1.94136   .7559041    -2.57   0.010    -3.422905   -.4598152
       zweig14 |  -.6637757   .4948871    -1.34   0.180    -1.633737    .3061852
       zweig15 |   .2014341   .4872493     0.41   0.679     -.753557    1.156425
       zweig16 |  -1.112253   .3051469    -3.64   0.000     -1.71033   -.5141765
       zweig18 |    .469321   .3931613     1.19   0.233     -.301261    1.239903
       zweig19 |  -.2818065   .2404365    -1.17   0.241    -.7530533    .1894403
       zweig20 |    .084855   .4197141     0.20   0.840    -.7377695    .9074795
       zweig21 |  -.6711068   .3929349    -1.71   0.088    -1.441245    .0990314
         _cons |   4.397328   .3154628    13.94   0.000     3.779033    5.015624
--------------------------------------------------------------------------------
           591  left-censored observations
             0     uncensored observations
             6 right-censored observations
           350       interval observations

. estimates store m2

. lrtest m1 m2

Likelihood-ratio test                                 LR chi2(25) =    110.94
(Assumption: m1 nested in m2)                         Prob > chi2 =    0.0000

What do you mean by "It just means that you should use the estimates from the more general model"?

Thank you for your support.

Last edited by Sara Final; 10 Sep 2019, 14:31.

Comment

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2174
#8

10 Sep 2019, 15:25

When you use the het() option, you're allowing for heteroskedasticity, right? So that's the more general model. That's the main point of the het() option: to allow heteroskedasticity. That you reject homoskedasticity means that you should opt for the estimates that allow for heteroskedasticity. It's no different from finding that, say, when I add a variable z or x^2 or z*x to a regression, and it is statistically significant, I would tend to include that variable or variables.

The coefficients/significance on sciencepartner is stable in magnitude and significance. The others bounce around, but innoexp is insignificant in both cases. What variables are you mainly interested in?
Comment
Sara Final

Join Date: Aug 2019

Posts: 17
#9

11 Sep 2019, 02:10

Thanks for the explanation.
I am interested in the relation between sciencepartner and my dependent variable.
Is there anything I have to do about the fact that they "bounce around"? Or does this just kind of prove that the results are biased when I do not take into account the heteroskedasticity?
I tried the same with another independent variable instead of sciencepartner. In this case the coefficient differs: in the original model the coeffient of this independent variable is 3.297860 and in the model with het option it is 1,881348 (both are significant with p<0.05). Does this simply mean that I would overestimate the influence if I use the model without het? Thus in my thesis I only interpret the results of the model with het, is that right?
Do I have to interpret the part below lnsigma? If so, what does it tell me except that some variables are responsible for heteroskedasticity.
I want to add an interaction term. Can I also add the interaction term in the het part?

A more general question: why does Stata also allow to simpy add vce(robust)? As far as I understood it is kind of common pracitice to control for heteroskedasticity by using robust standard errors. However it seems as if is not really a good idea to do this since if the assumption of homoskedasticity is not met it may be that the estimators are not consistent and one does not solve this by adding vce(robust). Am I right? So I guess using the het option is a better way to deal with heteroscedasticity?

Sorry for this amount of questions. Once again, thank you very much!
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2174
#10

11 Sep 2019, 15:57

If sciencepartner is you variable of interest, it's hard to imagine the results could be much better. Whether you account for heteroskedasticity or not, the estimate is very similar. Plus, the standard error actually gets smaller, but not in a crazy way. What happens on the control variables is not that important. It's clear heteroskedasticity is related to size, which can explain why it changes sign and becomes significant.

That's a very good observation about "robust" in this context with intreg. In my view, Stata should not allow this -- or at least provide a warning -- and I'm usually in favor of computing robust standard errors. If one has to use the "robust" option then one is admitting that the underlying assumptions for interval regression -- normality, homoskedasticity -- are wrong, and this causes inconsistency in the parameter estimates. Using the "robust" option just gives robust standard errors for inconsistent parameter estimates. If this were oprobit rather than intreg, one could argue that oprobit is just approximating response probabilities, and so using a robust variance matrix is justified. But not when there is data censoring: any kind of misspecification causes coefficient estimates to be inconsistent. So report the results that allow heteroskedasticity, as it's clearly in the data. You might mention which factors affect the variance. The variance shrinks as size increases, which may be worth noting.

Jeff
Comment
Sara Final

Join Date: Aug 2019

Posts: 17
#11

13 Sep 2019, 10:35

Thank you, your explanation helps me a lot.

I just wondered, do I have to include all variables in the het option?
I noticed, that if I leave out some of those variables which are not signifcant in the part below Insigma it sometimes has an impact on whether my variable of interest is significant or not.

Have a nice weekend!

And of course, I would still be grateful if somebody could tell me how to test for normality.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1133
#12

13 Sep 2019, 13:22

IMO, statistical tests of normality are usually not helpful. Normality of the errors is most important when samples are small. But when samples are small, tests of normality are under-powered. As n increases, normality of the errors becomes less important (see Jeff Wooldridge's econometrics textbook, for example), but at the same time, tests of normality become increasingly powerful. In a nutshell, tests of normality fail to detect important departures from normality when samples are small, but throw up the red flag of non-normality when non-normality doesn't really matter when n is large. To me, this is almost the perfect example of two things being at cross-purposes. YMMV.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
1 like
Comment
Sara Final

Join Date: Aug 2019

Posts: 17
#13

14 Sep 2019, 10:49

Thank you for your response. I guess with more than 900 observations this would count for my sample. I also read something like this. However, I was uncertain since the normality assumption seems to be more important for intreg than for OLS. But if I can argue like that, it makes it easier for me, of course.
Have a nice weekend.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2174
#14

14 Sep 2019, 21:03

I'll have to disagree with Bruce in cases where the data have been censored. In such cases, the assumed distribution can be important. This is not an issue of whether there are enough observations to apply the central limit theorem. The issue is that, even with a lot of observations, nonnormality can cause the estimated coefficients to be badly biased. If we were talking about OLS, probit, Tobit, and the like, where the y we want to explain is the y we observe, then the normality assumption is less critical. As we know, OLS works fine with almost any distribution (ruling out very fat tails) when you have a sufficient number of observations.

With interval regression, we can have a lot of data, and if the population is not normally distributed, the intreg command can produce badly biased estimators. In a sense, it gets worse with more data because you're more precisely estimating the wrong parameters. Same is true of any true data censoring scheme. The censoring is very costly: without it, we could do OLS and would not worry about normality.

Having said that, testing normality is not so easy. One way is to nest the normal in the Pearson family, and then derive the score test. I don't know that this has been done with interval regression.
Comment
Sara Final

Join Date: Aug 2019

Posts: 17
#15

15 Sep 2019, 07:10

Thank you, I am very grateful for your explanations.
If I understand you correctly, normality is less improtant for Tobit than for intreg? I assumed that the assumptions for intreg and tobit would be the same as intreg is a generalization of tobit and I thought both models deal with censoring.
My supervisor from university seems to be of the opinion that a tobit model can also deal with interval data.
I stayed with intreg because my research gave me the impression that tobit is right when the censored dependent variable is continuous (apart from the censored points), and intreg is used when the censored dependent variable is present as interval data. My dependent variable looks like this:

0 x=0
1 0<x<5
2 5<=x<10
3 10<=x<15
4 15<=x<20
5 20<=x<30
6 30<=x<50
7 50<=x<75
8 75<=x<=100

I read a paper in which the authors deal with the same dependent variable as I do with the difference that the information is not reduced to interval data but their dependent variable is continuos.
The authors justify the use of a tobit model with double censoring due to the fact that the variable ranges from 0 to 100. This is why I assumed a Tobit model would be right and because my variable is not continuos I chose intreg.

Would you agree that I could use a tobit model as well?
I have assumed that intreg is intended for exactly this kind of data.
However if it is okay to use tobit and I got it right in your last post that the normality assumption is less important for tobit, then maybe tobit would be a good alternative.

I would be very interested in your opinion about this.
Many thanks.
Comment

Announcement

intreg - how to test for normality and heteroskedasticity?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment