Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Insufficient observations error while conducting boxtid test

    Hello. I have a binary logistic regression with a unique independent variable X (takes value of 5% or 10% throughout the data) and several control variables. I employed the linktest command and found that the hatsquare is significant. I want to understand which variable has specification error. Hence, i tried to conduct boxtid test. However, I am not able to obtain results (See below). What am I doing wrong? My data has over 240k observations.


    . boxtid logit Default X RiskScore ib(4).loanpurpose LoanAmount Term Currency LoanRate Country Age IS POA GG

    Iteration 0: Deviance = 216852.8
    Iteration 1: Deviance = 215421.6 (change = -1431.212)
    Iteration 2: Deviance = 213699.6 (change = -1722.006)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (step sign changed)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (step sign changed)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    Iteration 3: Deviance = 211170.3 (change = -2529.258)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (step sign changed)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    Iteration 4: Deviance = 210316.7 (change = -853.6188)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (step sign changed)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    Iteration 5: Deviance = 210208.9 (change = -107.7756)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (step sign changed)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    Iteration 6: Deviance = 210203 (change = -5.926789)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (step sign changed)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    Iteration 7: Deviance = 210202.9 (change = -.1063913)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (step sign changed)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    Iteration 8: Deviance = 210202.9 (change = -.0432337)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (step sign changed)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (step sign changed)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    Iteration 9: Deviance = 210202.8 (change = -.0260674)
    (unprofitable step attempted, step length divided by 10)
    Iteration 10: Deviance = 210202.8 (change = -.0158125)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (step sign changed)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    Iteration 11: Deviance = 210202.8 (change = -.0096134)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (step sign changed)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    Iteration 12: Deviance = 210202.8 (change = -.0058551)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (step sign changed)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    Iteration 13: Deviance = 210202.8 (change = -.0035711)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (step sign changed)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    Iteration 14: Deviance = 210202.8 (change = -.0021805)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (step sign changed)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    Iteration 15: Deviance = 210202.8 (change = -.0013325)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (step sign changed)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (step sign changed)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    (unprofitable step attempted, step length divided by 10)
    Iteration 16: Deviance = 210202.8 (change = -.0008149)
    -> gen double IRisk__1 = RiskScore^-407.4082-0 if e(sample)
    -> gen double IRisk__2 = RiskScore^-407.4082*ln(RiskScore)-0 if e(sample)
    -> gen double IInit__1 = X^-1.3901-5235.317933 if e(sample)
    -> gen double IInit__2 = X^-1.3901*ln(X)+32249.79247 if e(sample)
    (where: X = LoanAmount/100000)
    -> gen double ITerm__1 = X^-2216.5352-. if e(sample)
    -> gen double ITerm__2 = X^-2216.5352*ln(X)-. if e(sample)
    (where: X = Term/10)
    -> gen double ILoan__1 = LoanRate^91.5829-9.50365e-78 if e(sample)
    -> gen double ILoan__2 = LoanRate^91.5829*ln(LoanRate)+1.84038e-77 if e(sample)
    -> gen double IAge__1 = X^0.8537-.741059683 if e(sample)
    -> gen double IAge__2 = X^0.8537*ln(X)+.2601346182 if e(sample)
    (where: X = Age/10)
    -> gen double X__1 = X-.05 if e(sample)

    [Total iterations: 80]
    insufficient observations
    r(2001);


  • #2
    boxtid is a community-contributed command, as you are asked to explain (FAQ Advice #12). The most recent public version appears to be on SSC.

    I've never used it and can't give useful advice on it, except to express an opinion that there is a fairly simple and often effective method to identify problematic predictors which is to check for extreme skewness, including incidence of outliers.

    A predictor with just two distinct values is just what it is. Any transformation whatsoever that preserves two distinct values is equivalent to a linear rescaling and will affect nothing of importance, except possibly by reversing signs.

    Comment


    • #3
      Hi Nick! What type of transformation would you recommend for predictor variable with two distinct values? About 65% of data takes the value of 10% and rest 5%. Is this what is causing the linktest hatsquare to be significant?
      I ran a sktest with all the variables.
      Skewness and kurtosis tests for normality
      ----- Joint test -----
      Variable | Obs Pr(skewness) Pr(kurtosis) Adj chi2(2) Prob>chi2
      --------------------+-----------------------------------------------------------------

      X | 246,049 0.0000 . . .
      RiskScore | 246,049 0.0000 0.0000 . .
      loanamount | 246,049 0.0000 0.0000 . .
      logloanrate | 246,049 0.0000 0.0000 . .
      logterm | 246,049 0.0000 0.0000 . .
      Currency | 246,049 0.0000 0.0000 . .
      Country | 246,049 0.0000 . . .
      Age | 246,049 0.0000 0.0000 . .
      IS | 246,049 0.0000 . . .
      POA | 246,049 0.0000 0.0000 . .
      GG | 246,049 0.0000 . . .

      I also used leastlikely command to check for outliers. Following are the results.

      . leastlikely

      Outcome: 0

      +----------+
      | Prob |
      |----------|
      154728. | .001599 |
      182696. | .0090593 |
      202167. | .009136 |
      211366. | .0090951 |
      215258. | .0091217 |
      +----------+

      Outcome: 1

      +----------+
      | Prob |
      |----------|
      20972. | .0170998 |
      82510. | .0110673 |
      115591. | .0069606 |
      127989. | .019971 |
      144310. | .019971 |
      |----------|
      144313. | .019971 |
      149706. | .019971 |
      238328. | .019971 |
      +----------+
      The test suggests these observations are outliers, but they are merely a minority in my large dataset.

      I am not sure what I should do to ensure the assumptions of logistic regression are not violated, essentially fix linktest results as well.

      Comment


      • #4
        My point is that no transformation can help if a predictor (or outcome for that matter) has just two distinct values.

        In your case, the distribution is slightly negative skew, so that alone might suggest some possible transformations to pull values in.

        Here I used squaring and exponentiation to make the point that in practice for just two values any transformation is linear and so doesn't affect shape. Skewness and kurtosis are unchanged. I used moments from SSC for my convenience, but summarize, detail would show you the same results.

        Code:
        . clear 
        
        . set obs 100 
        Number of observations (_N) was 0, now 100.
        
        . gen twoval = cond(_n <= 65, 0.1, 0.05)
        
        . 
        . gen twoval_sq = twoval^2 
        
        . gen twoval_exp = exp(twoval)
        
        . 
        . moments twoval* 
        
        -----------------------------------------------------------
           n = 100 |       mean          SD    skewness    kurtosis
        -----------+-----------------------------------------------
            twoval |      0.083       0.024      -0.629       1.396
         twoval_sq |      0.007       0.004      -0.629       1.396
        twoval_exp |      1.086       0.026      -0.629       1.396
        -----------------------------------------------------------
        Here is a graph of 0.05 and 0.1 and their squares -- making the geometric point that the transformation is in effect only a linear rescaling for the data you have, which are all that matter. This is just school mathematics that two distinct points in the plane define a single straight line through them.

        Click image for larger version

Name:	trans.png
Views:	1
Size:	30.1 KB
ID:	1686718


        Another way to see this is that your two values define a histogram with two spikes. If you transform the values, you are just changing the labels on the magnitude axis; the distribution shape and the skewness and kurtosis remain the same.

        (The only exception to all that is that if you flip the values round, the skewness changes sign if not zero, but the transformation is equally linear, therefore useless.)

        A bigger deal is that you appear to think that it's an ideal condition that your predictors are multivariate normal. Not so. First off, that's impossible, as -- extending the point just made -- no indicator or binary variable can be Gaussian (normal), or even transformed to be nearer Gaussian. If multivariate normality (Gaussianity) were needed, then all use of indicator variables as predictors would be out of order and just about all econometrics or regression texts are in serious error, but on the contrary indicators as predictors are fine in principle and mighty helpful in practice.

        There is a condition -- over-emphasised but sometimes important -- that normal or Gaussian errors are ideal for linear regression. But you're doing logit modelling and that's irrelevant.

        Quite where the idea that predictors should be multivariate normal comes from I don't know. Can you cite a textbook to that effect?

        I am even a little guilty -- as once helping with the code for the Doornik-Hansen test -- but I would argue that tests for multivariate normality are fairly useless in practice, and it is usually more or less impossible to tell which variables are being awkward in a fit. There isn't really a shortcut (that I know) other than looking at each predictor in turn and considering whether a transformation (a) is possible in principle (b) in practice will improve your results.

        I can't comment on leastlikely, which is a community-contributed command I've never used.



        Comment


        • #5
          Hi Nick,


          I actually want to test for model specification error. My linktest results reveal that hatsquare is significant. I just want to know how to fix the issue. I figured initially that linear transformation of variables would help fix the issue. I have tried including interaction terms that are in line with the theory, but it is not working. I tried including a square root transformation for all continuous variable that are right skewed. But, even that does not seem to fix the issue. I do acknowledge that there might be some omitted variables that affects my dependent variable, but I have included the variables I have the access to data. Both hat and hat square values are significant. I also want to point out that I have over 240k observations. Is linktest valid for large sample size? Could the large sample size be the reason for a significant hatsquare value in linktest?

          Comment


          • #6
            Sorry, but I can't help further. I have made just one point at length -- that attempting to transform a binary predictor is futile. I may not have made myself clear, but I don't have other ways to explain it.

            I know a fair amount about transformations but now your first idea that

            linear transformation of variables would help fix the issue
            is deeply puzzling. It contradicts your interest in Box-Tidwell. Linear transformation can sometimes be convenient (e.g. in producing manageable numbers), but I am not aware that it ever solves a specification problem.

            Otherwise, interaction terms, other transformations, other predictors -- all could be relevant as explaining or helping to improve your problems.

            But I have no way of advising which direction may or may not be useful. Posting the dataset is here impractical and its fruitful modelling would seem to demand a long conversation with a supervisor or mentor who can sit with you, look at your data thoroughly, and discuss what models make sense.

            Comment

            Working...
            X