Diagnostic help: Why does my model perform so well with so few degrees of freedom?

Nick Cain

Join Date: May 2014

Posts: 84
#1

Diagnostic help: Why does my model perform so well with so few degrees of freedom?

13 Mar 2015, 12:22

Dear List,

I'm working to better understand why a model I've specified performs as well as it does (Adj R-squares of 0.72, F-test p-value of 0.034) despite having only 5 degrees of freedom and 11 observations. I know that with limited number of observations and 5 regressors (only 3 that are significant), I should trim my model -- however, eliminating the NS variables increases the AIC value so I'm scratching my head a bit.

Given that a Hettest shows heteroskedasticity, I ran the models with Robust Standard Errors. VIF statistics are low (all regressors below 3). My outputs for the full and trimmed models are below -- are there any other tests I can use to ensure my results are not spurious?

Thanks!

-nick
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

13 Mar 2015, 19:52

Have you examined the pairwise relationships of the 6 variables using graph matrix? I've always found that sort of plot helpful. Your question let me to search out how to accomplish it using Stata. So thanks for motivating me to move further up the Stata learning curve!

Without knowing what your variables mean, it's difficult to comment, but I've gotten similar-looking results when I inadvertently had a serious trend in both my dependent variable and more than one independent variable. So the bulk of the "variance" I was explaining was due to simple growth over time. I note from your previous posts you had plots of wind power capacity against time, which is what suggested this line of thought.

If indeed your observations have a natural time associated with them, then including time as another variable in the graph matrix might shed some light on this. But the graph matrix should be useful in any event.
1 like
Comment
Nick Cain

Join Date: May 2014

Posts: 84
#3

14 Mar 2015, 00:56

William,

Thanks for your suggestion: it turns out that yes, multicollinearity was the issue! In this case, it wasn't serial correlation -- but just correlation between two of my regressors.

I actually used the pwcorr command as it creates a correlation matrix with significant correlations indicated with a star. This was kind of a basic issue, but something about the relationship of Wind Class (class) and Distance to Sub (km2sub), combined with the small sample size and the order of my variables ramped up the effects.

Thanks again!

-nick
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

14 Mar 2015, 14:51

Well, OK. But also, don't forget that with only 11 data points and 5 regressors, the null value of R² with random data is pretty high. Here's a quick simulation of 1000 replications of linear regression where the outcome and all 5 predictors are randomly drawn from an independent multivariate normal distribution.:

Code:

. set seed 1234

.
. capture program drop one_sample

. program define one_sample
  1.         drop _all
  2.         set obs 11
  3.         drawnorm y x1 x2 x3 x4 x5
  4.         regress y x*
  5.         exit
  6. end

.
. simulate e(r2), reps(1000): one_sample

      command:  one_sample
       _sim_1:  e(r2)

Simulations (1000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
..................................................   100
..................................................   150
..................................................   200
..................................................   250
..................................................   300
..................................................   350
..................................................   400
..................................................   450
..................................................   500
..................................................   550
..................................................   600
..................................................   650
..................................................   700
..................................................   750
..................................................   800
..................................................   850
..................................................   900
..................................................   950
..................................................  1000

. summ _sim_1, detail

                            e(r2)
-------------------------------------------------------------
      Percentiles      Smallest
 1%      .085188       .0582661
 5%     .1694865       .0595024
10%     .2267868       .0610797       Obs                1000
25%     .3448229       .0617407       Sum of Wgt.        1000

50%     .4998428                      Mean           .4993655
                        Largest       Std. Dev.      .2021811
75%     .6517619       .9252375
90%      .773283        .930222       Variance       .0408772
95%      .823353       .9432014       Skewness      -.0183547
99%     .9049431       .9905573       Kurtosis       2.192964

So, as you can see, the expected R²is essentially 0.5, and you get results as big as 0.65 about 1 time in 4.

Announcement

Diagnostic help: Why does my model perform so well with so few degrees of freedom?

Comment

Comment

Comment