Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • OLS R^2 for large datasets inaccurate (lassogof)

    Dear community,

    I am using 'lassogof' to compare the performance of different Lasso, Elastic Net and standard OLS regressions with a test and training data set.

    The data set has 450k observations, and I am using around 50 coefficients. When comparing model performance, I realised that the lassogof table returns OLS R^2 results signifcantly higher, than reported in the OLS itself.

    To simplify and show the problem, I used dataex to save a sample of 500, with four random variables. In the simplified example I am getting a training R^2 of 0.0618 (adj. R^2 = 0.0476) in the regression output table when running the OLS, but a R^2 of 0.2946 in the lassogof table. All other models seem to report similar R^2 for the training sample in the model output, and the lassogof table. In my complete model, the discrepancy is even stronger, with lassogof returning an R^2 in the 0.9 for OLS, but the OLS model itself just being around 0.2.

    Based on the documentation I assumed I can use lassogof also to compare the performance of Lasso, Elastic Net with standard OLS regressions. Is there a reason the R^2 differ so signifcantly?

    Thank you very much for your help,

    Andrés

    Code I am running:

    Code:
    splitsample , generate(sample) split(.75 .25) rseed(12345)
    regress open n_sib brth_or sex edu if sample == 1
    estimates store test_ols
    lasso linear open n_sib brth_or sex edu if sample == 1
    estimates store test_lasso
    lassogof test_ols test_lasso, over(sample) postselection
    Dataset:
    Code:
    clear
    
    input byte(n_sib brth_or sex edu) double open byte pick
    3 4 0 6 3.7 1
    4 1 0 2 3.2 1
    2 3 0 .   4 1
    0 1 0 . 3.7 1
    1 1 0 . 4.9 1
    . . 0 4   4 1
    2 2 0 . 4.7 1
    2 3 0 . 4.2 1
    6 2 0 . 4.1 1
    2 2 0 1 4.3 1
    2 1 0 4 3.5 1
    . . 0 . 4.4 1
    3 2 0 . 3.9 1
    1 1 0 . 4.6 1
    2 3 0 4 3.6 1
    1 2 0 . 3.7 1
    2 1 0 1 3.4 1
    2 3 0 . 3.4 1
    1 2 0 . 3.6 1
    3 1 0 . 4.6 1
    2 3 0 5 3.7 1
    2 3 0 . 3.1 1
    1 1 0 5   3 1
    1 1 0 5 4.4 1
    2 3 0 5 3.3 1
    3 4 0 1 2.6 1
    3 4 0 5   4 1
    2 1 0 . 4.6 1
    3 2 0 4 3.4 1
    3 1 0 5 4.2 1
    2 1 0 . 3.9 1
    2 1 0 .   4 1
    2 2 0 2 2.9 1
    0 1 0 5 4.2 1
    1 1 0 6 3.6 1
    1 2 0 6 3.2 1
    3 1 0 6 3.5 1
    1 2 0 5 4.6 1
    2 3 0 6 3.3 1
    1 2 0 . 4.8 1
    0 1 0 6 3.4 1
    1 2 0 . 4.5 1
    3 3 0 4   5 1
    5 6 0 6 3.6 1
    1 1 0 6   4 1
    1 1 0 3 3.5 1
    2 3 0 5 2.7 1
    1 1 0 5 3.2 1
    0 1 0 6 3.5 1
    2 3 0 6 3.6 1
    3 3 0 2 3.6 1
    1 1 0 4 4.9 1
    1 2 0 4 3.4 1
    1 1 0 6 3.9 1
    2 1 0 5 3.5 1
    1 1 0 6 4.2 1
    1 2 0 6 4.2 1
    1 1 0 6 4.1 1
    1 1 0 5   4 1
    1 1 0 5   4 1
    2 1 0 6 4.2 1
    1 2 0 5 3.5 1
    2 1 0 5 3.4 1
    1 2 0 4 2.8 1
    1 2 0 2 3.4 1
    2 2 0 6 3.1 1
    1 1 0 6   4 1
    2 2 0 2 2.1 1
    2 3 0 5 3.3 1
    2 2 0 2 3.5 1
    2 1 0 6 4.3 1
    3 3 0 2 2.2 1
    0 1 0 4 4.6 1
    6 4 0 3 2.6 1
    1 1 0 5 4.3 1
    1 1 0 6 3.1 1
    1 1 0 6 4.1 1
    5 6 0 3 4.8 1
    0 1 0 2 4.2 1
    1 2 0 6 3.9 1
    1 1 0 2 3.1 1
    2 2 0 2 3.1 1
    2 1 0 2 3.7 1
    2 2 0 . 2.6 1
    1 1 0 5 3.2 1
    2 3 0 5 4.3 1
    1 2 0 5 3.9 1
    2 1 0 5 3.6 1
    1 2 0 2 4.4 1
    0 1 0 2 3.9 1
    3 4 0 5   4 1
    2 3 0 5 2.9 1
    2 1 0 4 3.9 1
    1 2 0 3 3.7 1
    1 2 0 2 2.4 1
    4 3 0 6   4 1
    1 1 0 2 3.6 1
    4 1 0 6 3.9 1
    1 2 0 2 3.2 1
    0 1 0 4 3.6 1
    1 1 0 5 4.1 1
    1 2 0 6 3.9 1
    0 1 0 2 3.6 1
    3 4 0 4 2.7 1
    1 1 0 4 4.2 1
    0 1 0 6 4.3 1
    1 1 0 5 3.1 1
    0 1 0 6 4.4 1
    1 1 0 2 3.4 1
    1 1 0 4 3.7 1
    2 3 0 5 4.2 1
    5 2 0 5 4.6 1
    1 1 0 6 4.2 1
    2 2 0 6 4.4 1
    1 1 0 6 4.1 1
    2 2 0 5 3.5 1
    1 2 0 4 2.6 1
    2 3 0 5 3.4 1
    3 2 0 5 4.4 1
    2 2 0 5 4.9 1
    2 3 0 5 2.9 1
    1 1 0 5   4 1
    1 1 0 4 3.3 1
    0 1 0 5 3.3 1
    1 2 0 6   4 1
    6 6 0 2   3 1
    1 2 0 5 3.5 1
    2 2 0 4 4.2 1
    3 3 0 5 4.4 1
    2 1 0 2 3.3 1
    1 1 0 5 3.8 1
    3 3 0 3 4.3 1
    1 1 0 5 4.7 1
    3 1 0 2   3 1
    2 2 0 5 4.2 1
    2 2 0 2 3.5 1
    2 1 0 5 3.7 1
    2 2 0 6   4 1
    1 1 0 5 4.2 1
    0 1 0 6 3.9 1
    0 1 0 6   4 1
    3 1 0 1 4.2 1
    0 1 0 . 3.3 1
    1 2 0 2 3.1 1
    1 1 0 2 4.5 1
    4 3 0 5 3.6 1
    1 1 0 . 3.8 1
    2 3 0 . 2.7 1
    5 3 0 4 4.8 1
    1 2 0 . 4.9 1
    1 2 0 5 4.7 1
    1 2 0 4 4.3 1
    2 3 0 2 3.2 1
    1 1 0 5 2.7 1
    1 1 0 5 4.7 1
    2 3 0 . 3.9 1
    6 5 0 . 4.4 1
    1 2 0 . 4.9 1
    1 2 0 . 4.4 1
    0 1 0 5 3.9 1
    2 2 0 5 4.6 1
    4 3 0 6   5 1
    2 2 0 5 3.5 1
    4 5 0 . 4.7 1
    4 2 0 . 3.1 1
    1 2 0 5 3.4 1
    1 1 0 5 2.7 1
    2 3 0 6 3.9 1
    3 3 0 2 2.8 1
    3 2 0 2 3.3 1
    3 3 0 6 3.6 1
    1 1 0 3 3.6 1
    1 2 0 6 3.9 1
    2 1 0 6 3.5 1
    1 1 0 5 3.2 1
    1 2 0 5 2.2 1
    1 1 0 5   4 1
    0 1 0 4 3.8 1
    0 1 0 . 2.9 1
    1 2 0 5 4.6 1
    1 2 0 5 4.1 1
    1 2 0 . 4.2 1
    1 1 0 5 4.8 1
    1 2 0 4 4.8 1
    3 4 0 3 4.5 1
    1 2 0 6 4.3 1
    2 2 0 5 2.9 1
    1 2 0 4 3.3 1
    3 3 0 1 3.6 1
    1 1 0 6 3.1 1
    1 2 0 4 3.6 1
    6 1 0 . 4.3 1
    1 1 0 5 3.6 1
    4 1 0 5 4.9 1
    1 2 0 4 4.2 1
    1 2 0 . 3.6 1
    0 1 0 . 4.2 1
    2 3 0 . 4.3 1
    4 1 0 . 4.6 1
    2 3 0 4 3.4 1
    3 4 0 .   4 1
    2 2 0 . 2.9 1
    0 1 0 . 3.2 1
    1 2 0 . 3.5 1
    1 1 0 .   4 1
    2 3 0 . 3.2 1
    0 1 0 5 2.7 1
    1 1 0 . 3.5 1
    2 2 0 . 3.4 1
    1 1 0 . 3.5 1
    6 6 0 . 4.4 1
    2 3 0 . 3.5 1
    1 1 0 . 3.6 1
    1 2 0 . 2.7 1
    1 2 0 4 3.5 1
    1 2 0 . 4.3 1
    1 1 0 . 3.5 1
    1 1 0 . 3.9 1
    1 1 0 4 4.6 1
    0 1 0 6 4.4 1
    2 2 0 5 3.4 1
    3 3 0 5 3.9 1
    1 1 0 . 4.2 1
    0 1 0 . 3.6 1
    1 2 0 6 4.6 1
    1 1 0 6 2.7 1
    2 2 0 5 4.9 1
    5 1 0 6 3.8 1
    1 2 0 . 4.2 1
    2 3 0 6   4 1
    3 2 0 3 2.3 1
    . . 0 6 3.5 1
    2 1 0 . 4.6 1
    1 1 0 . 4.7 1
    1 1 0 . 3.3 1
    0 1 0 . 4.3 1
    1 2 0 . 3.7 1
    0 1 0 . 3.7 1
    1 2 0 4 3.7 1
    . . 0 . 3.9 1
    0 1 0 . 2.8 1
    2 2 0 4 4.2 1
    1 2 0 . 4.1 1
    1 1 0 . 4.3 1
    1 2 0 . 3.6 1
    2 3 0 . 3.5 1
    2 1 0 . 4.6 1
    1 2 0 . 3.3 1
    5 4 0 . 3.1 1
    0 1 0 . 3.6 1
    1 1 0 . 4.7 1
    1 2 0 . 4.9 1
    2 1 0 4 3.1 1
    1 1 0 . 3.5 1
    1 1 0 . 3.6 1
    . . 0 . 3.8 1
    1 1 0 5   5 1
    3 2 0 4   3 1
    3 4 0 4 4.4 1
    . . 0 . 4.8 1
    2 3 0 . 3.8 1
    2 1 0 . 3.1 1
    1 2 0 . 4.5 1
    2 2 0 5 3.6 1
    5 1 0 4 3.5 1
    1 1 0 4 3.8 1
    1 1 0 5 4.2 1
    2 1 0 5 3.4 1
    2 2 0 5 3.5 1
    1 2 0 6 3.6 1
    2 2 0 2 4.2 1
    4 5 0 . 4.1 1
    2 3 0 4 4.5 1
    1 2 0 5 3.1 1
    1 1 0 6 3.3 1
    1 1 0 6 3.1 1
    6 6 0 4 3.9 1
    5 1 0 4   4 1
    1 1 0 6 4.1 1
    0 1 0 5 4.7 1
    1 2 0 6 3.2 1
    1 2 0 5 2.7 1
    1 2 0 4 3.7 1
    1 2 0 5 3.5 1
    1 1 0 5 2.6 1
    2 3 0 3 2.2 1
    . . 0 6 3.8 1
    0 1 0 6 4.4 1
    6 4 0 4 2.9 1
    2 1 0 6 2.9 1
    0 1 0 4 2.4 1
    2 3 0 5 2.2 1
    . . 0 2 4.3 1
    0 1 0 4   3 1
    2 1 0 5 3.3 1
    1 2 0 2 3.6 1
    4 1 0 2 4.5 1
    1 2 0 6 4.3 1
    1 1 0 5 3.6 1
    1 1 0 6 3.9 1
    0 1 0 6 3.7 1
    1 2 0 . 4.1 1
    3 2 0 2 3.3 1
    0 1 0 5 2.9 1
    1 1 0 3 4.1 1
    2 1 0 2 3.2 1
    2 1 0 2 3.2 1
    1 2 0 5 3.2 1
    2 2 0 2 3.6 1
    2 2 0 2 3.3 1
    3 3 0 6 3.3 1
    1 2 0 6 3.6 1
    1 1 0 2 2.7 1
    . . 0 2 3.9 1
    1 1 0 2 3.5 1
    2 3 0 2 2.7 1
    . . 0 5 3.6 1
    1 1 0 2   4 1
    3 4 0 4 4.4 1
    1 2 0 5 2.2 1
    0 2 0 5 2.6 1
    2 3 0 6 2.9 1
    2 1 0 5 4.8 1
    2 1 0 3   3 1
    . . 0 2 3.8 1
    1 2 0 5   4 1
    2 1 0 2 3.8 1
    3 2 0 1 4.3 1
    3 4 0 2 4.2 1
    . . 0 5 3.6 1
    2 1 0 2   3 1
    2 1 0 2 4.3 1
    3 1 0 2 3.7 1
    0 1 0 5 4.7 1
    2 1 0 1 3.4 1
    . . 0 5   4 1
    2 2 0 3 3.5 1
    3 1 0 1 3.4 1
    5 1 0 1 3.2 1
    4 1 1 . 3.6 1
    1 2 1 2 3.6 1
    1 1 1 . 3.7 1
    1 1 1 5 3.7 1
    3 1 1 5   3 1
    6 5 1 2 3.8 1
    2 3 1 . 4.3 1
    3 3 1 3 3.7 1
    1 2 1 4 3.8 1
    3 1 1 4   3 1
    0 1 1 5 4.7 1
    3 1 1 . 3.6 1
    2 2 1 5 3.6 1
    3 3 1 . 2.6 1
    0 1 1 4   4 1
    1 1 1 4 4.4 1
    1 1 1 3 3.5 1
    3 3 1 3 3.9 1
    4 4 1 1 3.8 1
    1 1 1 2 3.1 1
    0 3 1 . 4.4 1
    2 1 1 . 3.9 1
    2 2 1 5 3.2 1
    1 1 1 5 3.1 1
    2 2 1 5 4.3 1
    2 1 1 5 3.1 1
    1 1 1 5 3.1 1
    1 2 1 6 3.9 1
    2 2 1 5 4.2 1
    1 2 1 6 4.1 1
    0 1 1 2   3 1
    2 2 1 6   3 1
    2 3 1 2 3.7 1
    1 2 1 5 3.3 1
    1 1 1 5 3.7 1
    2 1 1 5 2.1 1
    1 1 1 5   4 1
    2 3 1 2 4.2 1
    0 1 1 2 2.6 1
    3 4 1 6 4.3 1
    5 1 1 3 3.4 1
    1 2 1 5 4.6 1
    2 1 1 4 3.8 1
    0 1 1 2   4 1
    2 3 1 2 3.5 1
    3 1 1 4 3.7 1
    2 2 1 6 3.9 1
    2 3 1 2   4 1
    1 1 1 4 4.6 1
    . . 1 2 3.7 1
    0 1 1 1 2.9 1
    3 3 1 3 4.6 1
    2 1 1 5 2.9 1
    2 1 1 5 2.6 1
    1 2 1 6 3.7 1
    3 4 1 2 3.9 1
    2 2 1 6 4.6 1
    1 1 1 . 2.7 1
    1 1 1 . 3.9 1
    2 1 1 . 3.6 1
    2 2 1 2 3.3 1
    1 2 1 . 4.3 1
    1 2 1 . 3.7 1
    3 4 1 5   4 1
    0 1 1 6   5 1
    0 1 1 5 3.8 1
    2 6 1 5 4.1 1
    5 6 1 5 2.9 1
    2 1 1 6 4.1 1
    0 1 1 2 3.5 1
    1 2 1 2 3.5 1
    3 2 1 4 3.7 1
    2 2 1 5 3.6 1
    1 1 1 . 4.6 1
    6 4 1 6 4.4 1
    5 5 1 2 2.2 1
    2 2 1 6 4.6 1
    5 5 1 1 4.4 1
    1 1 1 6 4.1 1
    6 1 1 3 2.2 1
    2 2 1 . 4.2 1
    0 1 1 . 4.4 1
    0 1 1 4 4.6 1
    3 2 1 . 4.1 1
    1 1 1 .   4 1
    6 1 1 .   3 1
    2 3 1 . 3.6 1
    3 2 1 4 3.9 1
    1 1 1 . 2.9 1
    2 1 1 .   4 1
    1 1 1 5 4.2 1
    1 2 1 . 3.9 1
    3 1 1 1 3.4 1
    1 2 1 4 4.3 1
    3 2 1 6 3.5 1
    0 1 1 . 2.9 1
    1 1 1 . 3.7 1
    3 2 1 4 4.1 1
    4 3 1 . 2.8 1
    0 1 1 4 3.6 1
    1 1 1 5 3.5 1
    1 1 1 4 3.4 1
    5 4 1 5 4.2 1
    1 1 1 5 3.9 1
    1 2 1 1 3.7 1
    2 1 1 4 4.9 1
    1 2 1 . 3.6 1
    1 1 1 4 4.1 1
    3 2 1 4 3.8 1
    1 1 1 . 3.4 1
    . . 1 4 3.8 1
    3 2 1 5 4.2 1
    2 2 1 . 4.3 1
    2 2 1 . 4.1 1
    3 1 1 . 4.6 1
    1 1 1 . 2.5 1
    1 1 1 . 2.9 1
    0 1 1 5 2.6 1
    1 1 1 5 4.1 1
    1 2 1 4 3.9 1
    2 1 1 5 3.1 1
    1 2 1 2 3.7 1
    4 3 1 . 3.8 1
    3 1 1 2 3.6 1
    1 1 1 . 4.3 1
    5 6 1 6 3.7 1
    4 5 1 6 3.8 1
    1 4 1 5 4.8 1
    1 2 1 5 2.2 1
    1 2 1 6 3.8 1
    0 1 1 . 4.3 1
    6 6 1 . 3.7 1
    1 1 1 5 4.6 1
    1 2 1 3   4 1
    . . 1 5 3.8 1
    . . 1 6   3 1
    0 1 1 5 4.6 1
    1 1 1 6 3.4 1
    1 2 1 6 3.5 1
    2 3 1 6 4.3 1
    0 1 1 5 3.4 1
    0 2 1 1 2.9 1
    4 2 1 6 4.3 1
    1 2 1 2 2.6 1
    2 2 1 5 4.2 1
    2 2 1 2 3.4 1
    3 3 1 2 4.3 1
    1 2 1 2 2.2 1
    2 2 1 6 4.9 1
    1 2 1 5 4.1 1
    1 2 1 1 2.8 1
    3 1 1 5 4.5 1
    2 2 1 4 3.5 1
    1 2 1 6 4.5 1
    3 3 1 3 3.3 1
    4 4 1 2 3.3 1
    1 1 1 5   5 1
    2 1 1 6 4.6 1
    3 1 1 1 2.8 1
    1 2 1 2 3.2 1
    2 2 1 . 4.1 1
    end
    label values n_sib N_SIB
    label def N_SIB 0 " 0", modify
    label def N_SIB 1 " 1", modify
    label def N_SIB 2 " 2", modify
    label def N_SIB 3 " 3", modify
    label def N_SIB 4 " 4", modify
    label def N_SIB 5 " 5", modify
    label def N_SIB 6 " 6 or more", modify
    label values brth_or BRTH_OR
    label def BRTH_OR 1 " First born", modify
    label def BRTH_OR 2 " Second born", modify
    label def BRTH_OR 3 " Third born", modify
    label def BRTH_OR 4 " Fourth born", modify
    label def BRTH_OR 5 " Fifth born", modify
    label def BRTH_OR 6 " Sixth or subsequent born", modify
    label values sex SEX
    label def SEX 0 "female", modify
    label def SEX 1 "male", modify
    label values edu EDU
    label def EDU 1 " Did not complete GCSE / CSE / O-Levels", modify
    label def EDU 2 " Completed GCSE / CSE / O-Levels", modify
    label def EDU 3 " Completed post-16 vocational course", modify
    label def EDU 4 " A-Levels", modify
    label def EDU 5 " Undergraduate degree", modify
    label def EDU 6 " Postgraduate degree", modify
    Last edited by Andres Gvirtz; 18 Apr 2021, 17:48.

  • #2
    I am not able to find neither of the commands that you are using:

    Code:
    . findit lassogof
    
    . findit splitsample
    do not return anything.

    Comment


    • #3
      @Joro: which version of Stata do you have? I can find both. I have Stata version 16.1 recently updated
      Both -findit- and -help- work

      Comment


      • #4
        Aaah, this means that both are native to Stata commands that have been introduced in Stata 16.

        I assumed that they are user contributed commands, and for user contributed commands it does not matter which version of Stata you have, you should see them in any version by -findit- and -search-.

        I am using Stata 15.1.

        My idea was to look at the code to see how these R-squares are calculated.

        But if these are official commands, they probably have good documentation online as well.


        Originally posted by Eric de Souza View Post
        @Joro: which version of Stata do you have? I can find both. I have Stata version 16.1 recently updated
        Both -findit- and -help- work

        Comment


        • #5
          Originally posted by Joro Kolev View Post
          Aaah, this means that both are native to Stata commands that have been introduced in Stata 16.
          Indeed! Sorry for not specifying, they are both additions to Stata 16.

          I followed the approach shown e.g. by Liu, where his OLS estimates are compared with Lasso and Co. using lassogof - and performance seems to be in line with the other models' performances, also shown on Stata Blog.

          The documentation of the new package is here, and it specifies that the command should be able to be used after simple regressions, too...

          Thank you both for your responses and for looking into this!

          Andrés

          Comment


          • #6
            Does somebody have an idea, what possible reasons there could be for the lassogof R^2 for OLS being so much higher, than for the actual OLS? In my full dataset I am getting an R^2 of 94% instead of 17%...

            Comment


            • #7
              I've never worked with lasso. I just came in to point out to Joro that it was indeed a part of Stata (apparently only from version 16 on, as he discovered)

              Comment


              • #8
                Hi Andrés,
                I have the same problem of a much higher R-squared in the lassogof for OLS than without the lassogof. Were you able to figure this out?
                Best regards,
                Lukas

                Comment


                • #9
                  Hi,

                  I was able to figure out the reason for the high delta between the OLS model fit and the reported OLS lassogof model fit, and just wanted to provide it in case somebody faces the same problem and finds this thread...(shoutout to Miguel, a senior statistician with Stata who was super helpful!)

                  The problem is caused by not having specified that there are missing values in the data when running -lassogof-. While there are several ways to go about only including complete observations a straightforward way is to run the regression on the full sample, and subsequently create a -touse- variable from the e(sample).

                  So to adjust the originally posted code:

                  Code:
                  regress open n_sib brth_or sex edu
                  generate touse = e(sample)
                  splitsample, generate(sample) split(.75 .25) rseed(12345)
                  regress open n_sib brth_or sex edu if sample == 1
                  estimates store test_ols
                  qui lasso linear open n_sib brth_or sex edu if sample == 1
                  estimates store test_lasso
                  lassogof test_ols test_lasso if touse, over(sample) postselection
                  Hope this is helpful,

                  Andrés

                  Comment

                  Working...
                  X