Hello All,
This is my first post on Statalist, so please do point out to me if I am not following the posting guidelines as accurately as it is encouraged in this forum.
I would like to predict the unemployment duration of U.S. citizens, using the U.S. Current Population Survey. To that end, I am running the following lasso using an adaptive selection method:
lasso linear logdur1 logcpsannual1 age age2 hours1 i.multjob i.sex_new ///
i.married i.educ_new i.hispanic i.race_new i.region i.occ_new i.ind_new ///
age##i.educ_new age##hours1 age##i.sex_new multjob##i.sex_new ///
hours1##i.sex_new i.region##i.occ_new i.region##i.ind_new i.occ##i.ind_new, ///
selection(adaptive, steps(6)) rseed(1234)
This yields the following output:
| No. of Out-of- CV mean
| nonzero sample prediction
ID | Description lambda coef. R-squared error
---------+----------------------------------------------------------------
360 | first lambda .575668 0 0.0001 1.884722
415 | lambda before .003451 134 0.1926 1.522009
* 416 | selected lambda .0031445 134 0.1926 1.521993
417 | lambda after .0028651 134 0.1926 1.522009
437 | last lambda .0004457 134 0.1918 1.523534
Is this reported out-of-sample R-squared the one that is relevant to evaluate model performance? If this was the case, it seems like the selected model doesn't perfom too badly? Or do I need to split the sample and use the command "lassogof" to assess the out-of-sample performance of my model (the reported r^2s differ)?
Thank you very much for your help and guidance!
Best,
Chantal
---
Data example using "dataex"
clear
input float(logdur1 logcpsannual1) byte age float(age2 hours1) byte multjob float(sex_new married) long educ_new float hispanic long race_new byte region long(occ_new ind_new)
0 11.23443 52 2704 40 1 1 1 4 0 3 42 8 6
2.0794415 . 55 3025 40 1 0 0 8 0 3 42 23 18
1.3862944 . 33 1089 40 1 0 0 8 1 3 42 17 16
3.555348 10.58446 26 676 40 1 0 0 2 0 2 42 11 8
1.7917595 11.838346 35 1225 40 1 0 1 3 0 2 42 6 5
0 9.128696 36 1296 40 1 0 0 8 1 3 42 2 4
1.3862944 . 58 3364 40 1 0 1 1 0 3 42 13 13
1.0986123 8.014336 61 3721 . 1 1 1 7 0 3 41 18 10
1.3862944 . 21 441 35 1 1 0 8 0 1 11 23 17
1.3862944 . 31 961 35 1 1 0 8 0 3 31 19 13
.6931472 8.866317 50 2500 20 1 1 1 1 0 2 42 19 10
2.0794415 10.002427 37 1369 40 1 0 0 7 0 3 41 16 18
2.484907 10.651123 58 3364 40 1 0 1 7 0 3 41 1 9
1.3862944 . 25 625 . 1 0 0 2 0 1 22 6 7
1.3862944 11.145195 62 3844 40 1 1 0 4 0 1 32 16 8
This is my first post on Statalist, so please do point out to me if I am not following the posting guidelines as accurately as it is encouraged in this forum.
I would like to predict the unemployment duration of U.S. citizens, using the U.S. Current Population Survey. To that end, I am running the following lasso using an adaptive selection method:
lasso linear logdur1 logcpsannual1 age age2 hours1 i.multjob i.sex_new ///
i.married i.educ_new i.hispanic i.race_new i.region i.occ_new i.ind_new ///
age##i.educ_new age##hours1 age##i.sex_new multjob##i.sex_new ///
hours1##i.sex_new i.region##i.occ_new i.region##i.ind_new i.occ##i.ind_new, ///
selection(adaptive, steps(6)) rseed(1234)
This yields the following output:
| No. of Out-of- CV mean
| nonzero sample prediction
ID | Description lambda coef. R-squared error
---------+----------------------------------------------------------------
360 | first lambda .575668 0 0.0001 1.884722
415 | lambda before .003451 134 0.1926 1.522009
* 416 | selected lambda .0031445 134 0.1926 1.521993
417 | lambda after .0028651 134 0.1926 1.522009
437 | last lambda .0004457 134 0.1918 1.523534
Is this reported out-of-sample R-squared the one that is relevant to evaluate model performance? If this was the case, it seems like the selected model doesn't perfom too badly? Or do I need to split the sample and use the command "lassogof" to assess the out-of-sample performance of my model (the reported r^2s differ)?
Thank you very much for your help and guidance!
Best,
Chantal
---
Data example using "dataex"
clear
input float(logdur1 logcpsannual1) byte age float(age2 hours1) byte multjob float(sex_new married) long educ_new float hispanic long race_new byte region long(occ_new ind_new)
0 11.23443 52 2704 40 1 1 1 4 0 3 42 8 6
2.0794415 . 55 3025 40 1 0 0 8 0 3 42 23 18
1.3862944 . 33 1089 40 1 0 0 8 1 3 42 17 16
3.555348 10.58446 26 676 40 1 0 0 2 0 2 42 11 8
1.7917595 11.838346 35 1225 40 1 0 1 3 0 2 42 6 5
0 9.128696 36 1296 40 1 0 0 8 1 3 42 2 4
1.3862944 . 58 3364 40 1 0 1 1 0 3 42 13 13
1.0986123 8.014336 61 3721 . 1 1 1 7 0 3 41 18 10
1.3862944 . 21 441 35 1 1 0 8 0 1 11 23 17
1.3862944 . 31 961 35 1 1 0 8 0 3 31 19 13
.6931472 8.866317 50 2500 20 1 1 1 1 0 2 42 19 10
2.0794415 10.002427 37 1369 40 1 0 0 7 0 3 41 16 18
2.484907 10.651123 58 3364 40 1 0 1 7 0 3 41 1 9
1.3862944 . 25 625 . 1 0 0 2 0 1 22 6 7
1.3862944 11.145195 62 3844 40 1 1 0 4 0 1 32 16 8