Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Should I use the Jackknife option with linear regression?

    Dear Stata users,

    I am using linear regression in Stata 13.1 to see if two continuous non normally distributed variables significantly correlate or not. When I plot the variables in a two way scatter graph, there are some significant outlying values. I read that Jackknife regression is more robust than linear regression with non-normal data.

    My output with this code for linear regression

    Code:
     reg rpfdefpop stordur
    is

    Code:
     
          Source |       SS       df       MS              Number of obs =      18
    -------------+------------------------------           F(  1,    16) =    1.17
           Model |  2315.69913     1  2315.69913           Prob > F      =  0.2948
        Residual |  31578.8436    16  1973.67773           R-squared     =  0.0683
    -------------+------------------------------           Adj R-squared =  0.0101
           Total |  33894.5428    17  1993.79663           Root MSE      =  44.426
    
    ------------------------------------------------------------------------------
       rpfdefpop |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         stordur |   .6813429   .6290176     1.08   0.295    -.6521148    2.014801
           _cons |   48.69781   13.05285     3.73   0.002       21.027    76.36862
    ------------------------------------------------------------------------------
    but when I use the jackknife regression option

    Code:
      reg rpfdefpop stordur, vce(jackknife)
    is
    Code:
      
    Linear regression                               Number of obs      =        18
                                                    Replications       =        18
                                                    F(   1,     17)    =      5.78
                                                    Prob > F           =    0.0279
                                                    R-squared          =    0.0683
                                                    Adj R-squared      =    0.0101
                                                    Root MSE           =   44.4261
    
    ------------------------------------------------------------------------------
                 |              Jackknife
       rpfdefpop |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         stordur |   .6813429   .2833695     2.40   0.028     .0834856      1.2792
           _cons |   48.69781   12.82814     3.80   0.001     21.63279    75.76283
    ------------------------------------------------------------------------------
    Using the jackknife option appears to show that the correlation is now statistically significant (p=0.028) whereas it wasn't using the standard linear regression (p=0.295). I can't find an answer on the forums as to which is the correct approach.

    Many thanks for your help

  • #2
    Andrew:
    the (possibly oversold) issue of normality in OLS refers to residuals only.
    You do not report if you have performed OLS postestimation tests, such as -estat hettest-; hence, we cannot say anything about the dispersion of your residuals.
    As a closing-out remark, significant or not, I would not trust that much the outcome of an OLS performed on such a limited sample size.
    Last edited by Carlo Lazzaro; 25 Feb 2016, 04:20.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Why not post the data? Or minimally show a scatter plot?

      It may be that the apparent outliers suggest e.g. a suitable transformation which could be much more sensible for your data.

      Note that writing of "significant outlying values" may suggest to some that you carried out a significance test, but you do not say what that was.

      If you mean something like notable, prominent, striking, etc. when you say "significant", as I guess, then there are lots of good informal words that will serve your purpose better.

      Note that jackknifing gives you the same regression, but just different standard errors and P-values. If the original regression was a bad idea, jackknifing won't fix it.
      Last edited by Nick Cox; 25 Feb 2016, 04:19.

      Comment


      • #4
        Thank you very much for your help. The data is from a pilot study hence the numbers are very small. I wasn't going to attribute too much significance to any result I obtained. Sorry for the imprecise language regarding outliers, I haven't done a significance test. I meant striking / notable instead of significant for values for patients 16 & 18.



        Click image for larger version

Name:	rpf dep pop.png
Views:	1
Size:	10.4 KB
ID:	1328287
        16 & 18.
        Patient Rpfdefpos (%) Stordur (days)
        1 98.9 5
        2 0 7
        3 68.3 6
        4 87.4 4
        5 99.4 2
        6 33.5 8
        7 0 6
        8 0 6
        9 81.1 1
        10 0 5
        11 97.4 9
        12 99.9 7
        13 0 13
        14 98.4 7
        15 68.4 15
        16 98.4 42
        17 0 9
        18 97.4 71

        Comment


        • #5
          Thanks for posting the data, but now it does seem to me that your data don't suit linear regression at all, and not just because of possible outliers. Evidently your response is bounded and there also seem to be some groups, e.g. some patients with zero whatever it is and some with almost 100%,

          http://www.stata-journal.com/sjpdf.h...iclenum=st0147

          gives some hints, but some trials with your data and a logit model for continuous proportions were not, as the scatter plot suggests, not especially promising either.

          Comment


          • #6
            Ignoring from the fundamental problems that Nick points out, here is the answer to your original question : the first regress model that you fit assumes a constant standard deviation (homoskedasticity) and uses the assumption to estimate standard errors. The jackknife does not make this assumption. It estimates non-parametric standard errors. Therefore, if, as here, the SD is not constant, one would expect a difference.
            Steve Samuels
            Statistical Consulting
            [email protected]

            Stata 14.2

            Comment


            • #7
              Thank you very much for advice.

              Comment

              Working...
              X