, - Statalist

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#16

28 Jul 2021, 08:46

It's worth pointing out that since your example is using a subset of the available data, this is no guarantee that the ranking of models will be the same on the full dataset. So while you point out a set of results in #12 on the first 23 observations, the results are not the same for the the full set (indeed, they are reversed).
Comment

Raymond Zhang

Join Date: Jan 2021
Posts: 349

#17

28 Jul 2021, 09:01

Code:

. vselect HongKong $m if _n<=23
,forward aicc
FORWARD variable selection
Information Criteria: AICC

------------------------------------------------------------------------------
Stage 0 reg HongKong  : AICC -79.50154
------------------------------------------------------------------------------
AICC  -77.03294 :              add  Australia
AICC  -86.81419 :              add    Austria
AICC  -78.03164 :              add     Canada
AICC  -77.55764 :              add    Denmark
AICC  -86.64687 :              add    Finland
AICC  -88.95512 :              add     France
AICC  -78.7386  :              add    Germany
AICC  -79.72499 :              add      Italy
AICC  -80.52933 :              add      Japan
AICC  -94.42247 :              add      Korea
AICC  -76.91501 :              add     Mexico
AICC  -82.09375 :              add Netherlands
AICC  -95.61311 :              add NewZealand
AICC  -84.96626 :              add     Norway
AICC  -82.12414 :              add Switzerland
AICC  -79.8867  :              add UnitedKingdom
AICC  -76.959   :              add UnitedStates
AICC  -104.7755 :              add  Singapore
AICC  -81.99067 :              add Philippines
AICC  -86.88326 :              add  Indonesia
AICC  -91.14591 :              add   Malaysia
AICC  -94.03874 :              add   Thailand
AICC  -97.04108 :              add     Taiwan
AICC  -92.12839 :              add      China
------------------------------------------------------------------------------
Stage 1 reg HongKong Singapore : AICC -104.7755
------------------------------------------------------------------------------
AICC  -101.8373 :              add  Australia
AICC  -107.8348 :              add    Austria
AICC  -102.6033 :              add     Canada
AICC  -104.0572 :              add    Denmark
AICC  -105.3381 :              add    Finland
AICC  -104.2061 :              add     France
AICC  -102.7059 :              add    Germany
AICC  -104.1808 :              add      Italy
AICC  -103.56   :              add      Japan
AICC  -102.1829 :              add      Korea
AICC  -108.2187 :              add     Mexico
AICC  -102.0082 :              add Netherlands
AICC  -102.3372 :              add NewZealand
AICC  -109.651  :              add     Norway
AICC  -104.0735 :              add Switzerland
AICC  -101.825  :              add UnitedKingdom
AICC  -103.6342 :              add UnitedStates
AICC  -104.6456 :              add Philippines
AICC  -108.71   :              add  Indonesia
AICC  -103.5364 :              add   Malaysia
AICC  -103.9806 :              add   Thailand
AICC  -108.3669 :              add     Taiwan
AICC  -102.5077 :              add      China
------------------------------------------------------------------------------
Stage 2 reg HongKong Singapore Norway : AICC  -109.651
------------------------------------------------------------------------------
AICC  -106.5823 :              add  Australia
AICC  -113.0898 :              add    Austria
AICC  -108.0889 :              add     Canada
AICC  -106.8492 :              add    Denmark
AICC  -113.2672 :              add    Finland
AICC  -109.2639 :              add     France
AICC  -107.1662 :              add    Germany
AICC  -114.9335 :              add      Italy
AICC  -106.3464 :              add      Japan
AICC  -106.6777 :              add      Korea
AICC  -115.6392 :              add     Mexico
AICC  -108.6539 :              add Netherlands
AICC  -107.1905 :              add NewZealand
AICC  -108.1643 :              add Switzerland
AICC  -107.0108 :              add UnitedKingdom
AICC  -108.4776 :              add UnitedStates
AICC  -106.4664 :              add Philippines
AICC  -109.0035 :              add  Indonesia
AICC  -106.5358 :              add   Malaysia
AICC  -108.1255 :              add   Thailand
AICC  -112.3847 :              add     Taiwan
AICC  -113.1466 :              add      China
------------------------------------------------------------------------------
Stage 3 reg HongKong Singapore Norway Mexico : AICC -115.6392
------------------------------------------------------------------------------
AICC  -117.3871 :              add  Australia
AICC  -121.3745 :              add    Austria
AICC  -112.1584 :              add     Canada
AICC  -112.7327 :              add    Denmark
AICC  -115.4046 :              add    Finland
AICC  -120.4502 :              add     France
AICC  -112.4867 :              add    Germany
AICC  -120.929  :              add      Italy
AICC  -115.2615 :              add      Japan
AICC  -112.6187 :              add      Korea
AICC  -114.4535 :              add Netherlands
AICC  -111.9902 :              add NewZealand
AICC  -125.2123 :              add Switzerland
AICC  -120.0586 :              add UnitedKingdom
AICC  -115.1779 :              add UnitedStates
AICC  -112.2661 :              add Philippines
AICC  -113.5576 :              add  Indonesia
AICC  -112.2259 :              add   Malaysia
AICC  -116.3357 :              add   Thailand
AICC  -113.752  :              add     Taiwan
AICC  -118.9856 :              add      China
------------------------------------------------------------------------------
Stage 4 reg HongKong Singapore Norway Mexico Switzerland : AICC -125.2123
------------------------------------------------------------------------------
AICC  -122.878  :              add  Australia
AICC  -121.8737 :              add    Austria
AICC  -121.7456 :              add     Canada
AICC  -121.0401 :              add    Denmark
AICC  -121.0402 :              add    Finland
AICC  -121.926  :              add     France
AICC  -121.6194 :              add    Germany
AICC  -123.4017 :              add      Italy
AICC  -121.0409 :              add      Japan
AICC  -121.0157 :              add      Korea
AICC  -121.4898 :              add Netherlands
AICC  -121.0146 :              add NewZealand
AICC  -121.0075 :              add UnitedKingdom
AICC  -121.2367 :              add UnitedStates
AICC  -121.4974 :              add Philippines
AICC  -122.6062 :              add  Indonesia
AICC  -122.0947 :              add   Malaysia
AICC  -121.5994 :              add   Thailand
AICC  -123.0975 :              add     Taiwan
AICC  -121.026  :              add      China

Final Model

      Source |       SS           df       MS      Number of obs   =        23
-------------+----------------------------------   F(4, 18)        =     52.41
       Model |  .034930802         4  .008732701   Prob > F        =    0.0000
    Residual |  .002999111        18  .000166617   R-squared       =    0.9209
-------------+----------------------------------   Adj R-squared   =    0.9034
       Total |  .037929913        22  .001724087   Root MSE        =    .01291

------------------------------------------------------------------------------
    HongKong | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   Singapore |     0.7185     0.0715    10.05   0.000       0.5683      0.8687
      Norway |     0.4170     0.1120     3.72   0.002       0.1817      0.6522
      Mexico |     0.3462     0.0674     5.14   0.000       0.2047      0.4878
 Switzerland |    -0.6630     0.1767    -3.75   0.001      -1.0341     -0.2918
       _cons |    -0.0435     0.0074    -5.89   0.000      -0.0590     -0.0280
------------------------------------------------------------------------------

The result indicates 4 predictors, the AICC value is -125.2123.But it is not the smallest AICC value.Just as I show in #6, when there are 6 variables,the AICC value is -131.03782 <-125.2123.
So the result above is not the model with smallest AICC.

Best regards.

Raymond Zhang
Stata 17.0,MP

Comment

Richard Williams

Join Date: Apr 2014

Posts: 5012
#18

28 Jul 2021, 09:03

You don't examine all the subsets, but that is because it already thinks it has found the best subset. Why would you want to keep going?

In any event, if Stata cannot give you what you want, maybe you should just use the R routine that can.

Or, maybe you can hack vselect so that it doesn't stop once it thinks it has a "winner."

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Raymond Zhang

Join Date: Jan 2021

Posts: 349
#19

28 Jul 2021, 09:13

Originally posted by Richard Williams View Post

You don't examine all the subsets, but that is because it already thinks it has found the best subset. Why would you want to keep going?

In any event, if Stata cannot give you what you want, maybe you should just use the R routine that can.

Or, maybe you can hack vselect so that it doesn't stop once it thinks it has a "winner."

because I want to replicate the result of the paper.If I use different subset ,the results are so different.
The most difficult thing of the paper is to find the model with the smallest AIC or AICC value.

Best regards.

Raymond Zhang
Stata 17.0,MP
Comment
daniel klein

Join Date: Mar 2014

Posts: 3876
#20

28 Jul 2021, 09:26

Originally posted by Raymond Zhang View Post

But it is not the smallest AICC value.Just as I show in #6, when there are 6 variables,the AICC value is -131.03782 <-125.2123.
So the result above is not the model with smallest AICC.

Neither is the 6 predictor model with AICC = -131.03782 the one with the smallest AICC. The models with 20 or more predictors all have smaller AICC values.

Last edited by daniel klein; 28 Jul 2021, 09:31.
1 like
Comment
Raymond Zhang

Join Date: Jan 2021

Posts: 349
#21

28 Jul 2021, 09:42

Originally posted by daniel klein View Post

Neither is the 6 predictor model with AICC = -131.03782 the one with the smallest AICC. The models with 20 or more predictors all have smaller AICC values.

Dear @daniel klein,Yes,the models with 20 or more predictors all have smaller AICC values.But according to the paper, when length(possible.ctrls) + 3 >= length(time.pretr)) ，the max number of variables is smaller than length(time.pretr)-3.In the example above, 24+3>23,so the max number of variables is smaller than 20.So I want to find the smallest AICC when the max number of variables is 20.So we don't consider more than 20 variables.

Best regards.

Raymond Zhang
Stata 17.0,MP
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5012
#22

28 Jul 2021, 09:48

Is the data you posted the same data used in the paper you cite?

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#23

28 Jul 2021, 09:51

In the time you've been puzzling through how to use existing commands, this solution could have been brute-forced for some or all subsets of variables since you're still in that "scale" of computation. This isn't necessarily a "smart" approach, but when puzzling over software packages and programs and when trying to match with a published paper, sometimes it's nice to know exactly what the correct/real answer is. It's possible some software package authors have undetected bugs or don't quite behave the way you want them to, or the authors of the paper can make mistakes.

In any case, for the case of best subset regression with only 6 variables with your full dataset, a single model has the lowest AIC, AICC and BIC. The do-file is attached for educational purposes.

Code:

+----------------------------------------------------------------------------------------------------+ | v1 v2 v3 v4 v5 v6 aic aicc bic | |----------------------------------------------------------------------------------------------------| | Australia Mexico Norway Singapore Thailand Taiwan -335.82459 -333.05536 -321.04847 | +----------------------------------------------------------------------------------------------------+

Attached Files

best_subset_ex.do (1.6 KB, 1 view)
Comment
Raymond Zhang

Join Date: Jan 2021

Posts: 349
#24

28 Jul 2021, 10:10

Originally posted by Leonardo Guizzetti View Post

In the time you've been puzzling through how to use existing commands, this solution could have been brute-forced for some or all subsets of variables since you're still in that "scale" of computation. This isn't necessarily a "smart" approach, but when puzzling over software packages and programs and when trying to match with a published paper, sometimes it's nice to know exactly what the correct/real answer is. It's possible some software package authors have undetected bugs or don't quite behave the way you want them to, or the authors of the paper can make mistakes.

In any case, for the case of best subset regression with only 6 variables with your full dataset, a single model has the lowest AIC, AICC and BIC. The do-file is attached for educational purposes.

Code:

+----------------------------------------------------------------------------------------------------+ | v1 v2 v3 v4 v5 v6 aic aicc bic | |----------------------------------------------------------------------------------------------------| | Australia Mexico Norway Singapore Thailand Taiwan -335.82459 -333.05536 -321.04847 | +----------------------------------------------------------------------------------------------------+

Dear @Leonardo Guizzetti,Yes,with the full dataset,the best subset regression has 6 variables.But now I want to do placebo test.
I have to select the best subset regression when the observations is 23,24,25,...44,this means that I have to choose 22 best subset regressions and these subsets may be different.

Best regards.

Raymond Zhang
Stata 17.0,MP
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#25

28 Jul 2021, 10:13

Posts #12 and #6 tell us that for

Code:

regress HongKong Canada France Italy Norway UnitedStates Singapore if _n<23

the AIC value is -126.5042 and the AICC value is -116.2185.

I cannot reproduce that.

Code:

use "~/Downloads/HCW", clear
regress HongKong Canada France Italy Norway UnitedStates Singapore if _n<=23
estat ic
matrix S = r(S)
local rss = e(rss)
local aic_estat = S[1,5]
local aic  =  e(N)*ln(e(rss)/e(N)) + 2*(e(N)   ///
                                - e(df_r)) + (e(N) + e(N)*ln(2*_pi))
local aicc =  e(N)*ln(e(rss)/e(N)) + 2*(e(N) - ///
                                e(df_r)) + 2*(e(df_m)+2)*(e(df_m)+3)/(   ///
                                e(N)-(e(df_m) + 2) - 1)  + (e(N) + e(N)* ///
                                ln(2*_pi))
local aicc5 = e(N)*ln(e(rss)/e(N)) + 2*(e(N) - e(df_r)) + 2*(e(df_m)+2)*(e(df_m)+3)/(     ///
    e(N)-(e(df_m) + 2) - 1)  + (e(N) + e(N)* ln(2*_pi))

display _newline "RSS          from regress:" %9.6f `rss' _newline ///
        _newline "AIC         from estat ic:" %9.3f `aic_estat'    ///
        _newline "AIC  from vselect formula:" %9.3f `aic'          ///
        _newline "AICC from vselect formula:" %9.3f `aicc'         ///
        _newline "AICC  from post 5 formula:" %9.3f `aicc5'

Code:

RSS          from regress: 0.002171

AIC         from estat ic: -133.893
AIC  from vselect formula: -133.893
AICC from vselect formula: -123.607
AICC  from post 5 formula: -123.607

This calls into question the assertion that the vselect results presented in post #12 and #6 were created by the code used in post #6 applied to the dataset attached to post #6.

This would be easy for me to confirm were it not that vselect does not run on my copy of Stata 17.

Comment

Richard Williams

Join Date: Apr 2014

Posts: 5012
#26

28 Jul 2021, 10:14

Leonardo, if you know that you absolutely positively want 6 variables, that sounds good!

Raymond's commands always ended with

if _n<=23 Should the reg command in your do file do that too? (I don't understand why the if is in there anyway.)

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#27

28 Jul 2021, 10:18

Originally posted by Richard Williams View Post

Leonardo, if you know that you absolutely positively want 6 variables, that sounds good!

Raymond's commands always ended with

if _n<=23 Should the reg command in your do file do that too? (I don't understand why the if is in there anyway.)

The flexibility of that program I wrote is that you can use the -min()- and -max()- options of -tuples- (SSC) to specify the number of variables under consideration, so it could be from 1 to 23 with minor modification. I simply chose 6 since that was the example he was going with. I also do not understand the significant of the restriction to the first 23 observations, so I chose to ignore that, but it could easily be added in to my program. I offer the program as a means to understand how one can work through the solutions in a brute force way, because it me it seems there is confusion over which software program/package produces the desired result, or indeed what the desired answer is. Indeed William confirms this suspicion in post #25.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#28

28 Jul 2021, 10:41

Let me try to clear up some confusion here. From a post above

There is a R package called pampe, which can solve this question. It also uses leaps and bounds algorithm,and it runs very quickly.

The leaps-and-bounds algorithm is a general purpose algorithm for finding the minimum of a function. Using two different programs that utilize leaps and bounds does not mean that they should produce the same result.

For a model with K independent variables, pampe apparently can find
the single model with minimum value of AIC

the single model with minimum value of AICC

the single model with minimum value of BIC

and note that these will not necessarily be the same model.

The vselect command, with the best option, tells us

For each predictor size k, the best model under each of the information criterions for that predictor size k is the model that minimizes RSS. All other terms are constant for the same predictor size. So at each predictor size, we can find the best model of that size by minimizing the RSS. This remarkable result can greatly simplify the variable selection process.

So it takes a different, and I would argue, better approach. For a model with K independent variables, it finds for each value of k=1(1)K the collection of k independent variables that minimizes the RSS. It then presents for that model the values of 5 information criteria: R²_ADJ, C_p, AIC, AIC_C, and BIC. The user can then choose their favored criterion and find the model that optimizes it. In the example presented in the SJ article, AIC_C and BIC are optimized by the 2-variable model, while R²_ADJ and AIC are optimized by the 3-variable model, and for C_P it depends on a further choice by the user.

So in effect vselect finds the "best" subset for each of 5 selection criteria simultaneously. This is cool.

The forward and backward options in vselect have nothing to do with leaps and bounds methodology, and are beyond the scope of this post.

My post #25 discusses my concerns about the examples and data shown for vselect. From this discussion we see that vselect supposedly did not choose the 6-variable model with the minimum RSS (since had it done so, that would have yielded the smallest AIC_C). Without an independent confirmation of the results presented in posts #12 and #6 I remain dubious of that assertion.

Last edited by William Lisowski; 28 Jul 2021, 10:44.
1 like
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5012
#29

28 Jul 2021, 11:09

I’m getting lost now and should probably keep quiet! Like Daniel, I am wondering if we need to use quantum computing to do what Raymond wants. Plus, if the R routine did it right, why not just use it?

Or, maybe write the authors and find out exactly what they did.

Maybe Raymond wants to extend the analysis. But if so, does he want something that can’t possibly be done unless he is willing to wait a few months for the results?

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#30

28 Jul 2021, 11:43

The frustrating thing for me is that vselect is good tool based on really interesting math getting bad publicity here.

The leaps-and-bounds approach to "best subset regression" was near magic back in the 1980's when I encountered it.

It incorporates a number of features that have I suspect largely been bypassed by cheap computing power.
A branching search of all possible subsets that goes through them in a systematic fashion that facilitates ...

Using a slick technique for obtaining (Z'Z)^-1 from (X'X)^-1 when Z is just X with one column omitted; this cuts the computational burden of inverting matrices on the way to obtaining the RSS for each subset significantly, as does ...

A bounding technique that recognizes that eliminating variables will increase RSS, so if you have a k-variable subset whose RSS exceeds the each of the (current) RSS values for 1, 2, ..., and k variable models, then nothing is to be gained by exploring the sub-subsets of this k-variable subset, which can radically cut the number of subsets to be searched, and ...

If you're fortunate enough to be implementing this in a recursive programming language you can get gloriously elegant code.
1 like
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment