Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Diagnostic help: Why does my model perform so well with so few degrees of freedom?

    Dear List,

    I'm working to better understand why a model I've specified performs as well as it does (Adj R-squares of 0.72, F-test p-value of 0.034) despite having only 5 degrees of freedom and 11 observations. I know that with limited number of observations and 5 regressors (only 3 that are significant), I should trim my model -- however, eliminating the NS variables increases the AIC value so I'm scratching my head a bit.

    Given that a Hettest shows heteroskedasticity, I ran the models with Robust Standard Errors. VIF statistics are low (all regressors below 3). My outputs for the full and trimmed models are below -- are there any other tests I can use to ensure my results are not spurious?

    Thanks!

    -nick
    Click image for larger version

Name:	OLS_SanG_robust.jpg
Views:	1
Size:	152.4 KB
ID:	1053327

    Click image for larger version

Name:	OLS_SanG_robust_trimmed.jpg
Views:	1
Size:	128.5 KB
ID:	1053328


  • #2
    Have you examined the pairwise relationships of the 6 variables using graph matrix? I've always found that sort of plot helpful. Your question let me to search out how to accomplish it using Stata. So thanks for motivating me to move further up the Stata learning curve!

    Without knowing what your variables mean, it's difficult to comment, but I've gotten similar-looking results when I inadvertently had a serious trend in both my dependent variable and more than one independent variable. So the bulk of the "variance" I was explaining was due to simple growth over time. I note from your previous posts you had plots of wind power capacity against time, which is what suggested this line of thought.

    If indeed your observations have a natural time associated with them, then including time as another variable in the graph matrix might shed some light on this. But the graph matrix should be useful in any event.

    Comment


    • #3
      William,

      Thanks for your suggestion: it turns out that yes, multicollinearity was the issue! In this case, it wasn't serial correlation -- but just correlation between two of my regressors.

      I actually used the pwcorr command as it creates a correlation matrix with significant correlations indicated with a star. This was kind of a basic issue, but something about the relationship of Wind Class (class) and Distance to Sub (km2sub), combined with the small sample size and the order of my variables ramped up the effects.

      Thanks again!

      -nick

      Comment


      • #4
        Well, OK. But also, don't forget that with only 11 data points and 5 regressors, the null value of R2 with random data is pretty high. Here's a quick simulation of 1000 replications of linear regression where the outcome and all 5 predictors are randomly drawn from an independent multivariate normal distribution.:

        Code:
        . set seed 1234
        
        .
        . capture program drop one_sample
        
        . program define one_sample
          1.         drop _all
          2.         set obs 11
          3.         drawnorm y x1 x2 x3 x4 x5
          4.         regress y x*
          5.         exit
          6. end
        
        .
        . simulate e(r2), reps(1000): one_sample
        
              command:  one_sample
               _sim_1:  e(r2)
        
        Simulations (1000)
        ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
        ..................................................    50
        ..................................................   100
        ..................................................   150
        ..................................................   200
        ..................................................   250
        ..................................................   300
        ..................................................   350
        ..................................................   400
        ..................................................   450
        ..................................................   500
        ..................................................   550
        ..................................................   600
        ..................................................   650
        ..................................................   700
        ..................................................   750
        ..................................................   800
        ..................................................   850
        ..................................................   900
        ..................................................   950
        ..................................................  1000
        
        . summ _sim_1, detail
        
                                    e(r2)
        -------------------------------------------------------------
              Percentiles      Smallest
         1%      .085188       .0582661
         5%     .1694865       .0595024
        10%     .2267868       .0610797       Obs                1000
        25%     .3448229       .0617407       Sum of Wgt.        1000
        
        50%     .4998428                      Mean           .4993655
                                Largest       Std. Dev.      .2021811
        75%     .6517619       .9252375
        90%      .773283        .930222       Variance       .0408772
        95%      .823353       .9432014       Skewness      -.0183547
        99%     .9049431       .9905573       Kurtosis       2.192964
        So, as you can see, the expected R2 is essentially 0.5, and you get results as big as 0.65 about 1 time in 4.

        Comment

        Working...
        X