Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    It's worth pointing out that since your example is using a subset of the available data, this is no guarantee that the ranking of models will be the same on the full dataset. So while you point out a set of results in #12 on the first 23 observations, the results are not the same for the the full set (indeed, they are reversed).

    Comment


    • #17
      Code:
      . vselect HongKong $m if _n<=23
      ,forward aicc
      FORWARD variable selection
      Information Criteria: AICC
      
      ------------------------------------------------------------------------------
      Stage 0 reg HongKong  : AICC -79.50154
      ------------------------------------------------------------------------------
      AICC  -77.03294 :              add  Australia
      AICC  -86.81419 :              add    Austria
      AICC  -78.03164 :              add     Canada
      AICC  -77.55764 :              add    Denmark
      AICC  -86.64687 :              add    Finland
      AICC  -88.95512 :              add     France
      AICC  -78.7386  :              add    Germany
      AICC  -79.72499 :              add      Italy
      AICC  -80.52933 :              add      Japan
      AICC  -94.42247 :              add      Korea
      AICC  -76.91501 :              add     Mexico
      AICC  -82.09375 :              add Netherlands
      AICC  -95.61311 :              add NewZealand
      AICC  -84.96626 :              add     Norway
      AICC  -82.12414 :              add Switzerland
      AICC  -79.8867  :              add UnitedKingdom
      AICC  -76.959   :              add UnitedStates
      AICC  -104.7755 :              add  Singapore
      AICC  -81.99067 :              add Philippines
      AICC  -86.88326 :              add  Indonesia
      AICC  -91.14591 :              add   Malaysia
      AICC  -94.03874 :              add   Thailand
      AICC  -97.04108 :              add     Taiwan
      AICC  -92.12839 :              add      China
      ------------------------------------------------------------------------------
      Stage 1 reg HongKong Singapore : AICC -104.7755
      ------------------------------------------------------------------------------
      AICC  -101.8373 :              add  Australia
      AICC  -107.8348 :              add    Austria
      AICC  -102.6033 :              add     Canada
      AICC  -104.0572 :              add    Denmark
      AICC  -105.3381 :              add    Finland
      AICC  -104.2061 :              add     France
      AICC  -102.7059 :              add    Germany
      AICC  -104.1808 :              add      Italy
      AICC  -103.56   :              add      Japan
      AICC  -102.1829 :              add      Korea
      AICC  -108.2187 :              add     Mexico
      AICC  -102.0082 :              add Netherlands
      AICC  -102.3372 :              add NewZealand
      AICC  -109.651  :              add     Norway
      AICC  -104.0735 :              add Switzerland
      AICC  -101.825  :              add UnitedKingdom
      AICC  -103.6342 :              add UnitedStates
      AICC  -104.6456 :              add Philippines
      AICC  -108.71   :              add  Indonesia
      AICC  -103.5364 :              add   Malaysia
      AICC  -103.9806 :              add   Thailand
      AICC  -108.3669 :              add     Taiwan
      AICC  -102.5077 :              add      China
      ------------------------------------------------------------------------------
      Stage 2 reg HongKong Singapore Norway : AICC  -109.651
      ------------------------------------------------------------------------------
      AICC  -106.5823 :              add  Australia
      AICC  -113.0898 :              add    Austria
      AICC  -108.0889 :              add     Canada
      AICC  -106.8492 :              add    Denmark
      AICC  -113.2672 :              add    Finland
      AICC  -109.2639 :              add     France
      AICC  -107.1662 :              add    Germany
      AICC  -114.9335 :              add      Italy
      AICC  -106.3464 :              add      Japan
      AICC  -106.6777 :              add      Korea
      AICC  -115.6392 :              add     Mexico
      AICC  -108.6539 :              add Netherlands
      AICC  -107.1905 :              add NewZealand
      AICC  -108.1643 :              add Switzerland
      AICC  -107.0108 :              add UnitedKingdom
      AICC  -108.4776 :              add UnitedStates
      AICC  -106.4664 :              add Philippines
      AICC  -109.0035 :              add  Indonesia
      AICC  -106.5358 :              add   Malaysia
      AICC  -108.1255 :              add   Thailand
      AICC  -112.3847 :              add     Taiwan
      AICC  -113.1466 :              add      China
      ------------------------------------------------------------------------------
      Stage 3 reg HongKong Singapore Norway Mexico : AICC -115.6392
      ------------------------------------------------------------------------------
      AICC  -117.3871 :              add  Australia
      AICC  -121.3745 :              add    Austria
      AICC  -112.1584 :              add     Canada
      AICC  -112.7327 :              add    Denmark
      AICC  -115.4046 :              add    Finland
      AICC  -120.4502 :              add     France
      AICC  -112.4867 :              add    Germany
      AICC  -120.929  :              add      Italy
      AICC  -115.2615 :              add      Japan
      AICC  -112.6187 :              add      Korea
      AICC  -114.4535 :              add Netherlands
      AICC  -111.9902 :              add NewZealand
      AICC  -125.2123 :              add Switzerland
      AICC  -120.0586 :              add UnitedKingdom
      AICC  -115.1779 :              add UnitedStates
      AICC  -112.2661 :              add Philippines
      AICC  -113.5576 :              add  Indonesia
      AICC  -112.2259 :              add   Malaysia
      AICC  -116.3357 :              add   Thailand
      AICC  -113.752  :              add     Taiwan
      AICC  -118.9856 :              add      China
      ------------------------------------------------------------------------------
      Stage 4 reg HongKong Singapore Norway Mexico Switzerland : AICC -125.2123
      ------------------------------------------------------------------------------
      AICC  -122.878  :              add  Australia
      AICC  -121.8737 :              add    Austria
      AICC  -121.7456 :              add     Canada
      AICC  -121.0401 :              add    Denmark
      AICC  -121.0402 :              add    Finland
      AICC  -121.926  :              add     France
      AICC  -121.6194 :              add    Germany
      AICC  -123.4017 :              add      Italy
      AICC  -121.0409 :              add      Japan
      AICC  -121.0157 :              add      Korea
      AICC  -121.4898 :              add Netherlands
      AICC  -121.0146 :              add NewZealand
      AICC  -121.0075 :              add UnitedKingdom
      AICC  -121.2367 :              add UnitedStates
      AICC  -121.4974 :              add Philippines
      AICC  -122.6062 :              add  Indonesia
      AICC  -122.0947 :              add   Malaysia
      AICC  -121.5994 :              add   Thailand
      AICC  -123.0975 :              add     Taiwan
      AICC  -121.026  :              add      China
      
      Final Model
      
            Source |       SS           df       MS      Number of obs   =        23
      -------------+----------------------------------   F(4, 18)        =     52.41
             Model |  .034930802         4  .008732701   Prob > F        =    0.0000
          Residual |  .002999111        18  .000166617   R-squared       =    0.9209
      -------------+----------------------------------   Adj R-squared   =    0.9034
             Total |  .037929913        22  .001724087   Root MSE        =    .01291
      
      ------------------------------------------------------------------------------
          HongKong | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
         Singapore |     0.7185     0.0715    10.05   0.000       0.5683      0.8687
            Norway |     0.4170     0.1120     3.72   0.002       0.1817      0.6522
            Mexico |     0.3462     0.0674     5.14   0.000       0.2047      0.4878
       Switzerland |    -0.6630     0.1767    -3.75   0.001      -1.0341     -0.2918
             _cons |    -0.0435     0.0074    -5.89   0.000      -0.0590     -0.0280
      ------------------------------------------------------------------------------
      The result indicates 4 predictors, the AICC value is -125.2123.But it is not the smallest AICC value.Just as I show in #6, when there are 6 variables,the AICC value is -131.03782 <-125.2123.
      So the result above is not the model with smallest AICC.
      Best regards.

      Raymond Zhang
      Stata 17.0,MP

      Comment


      • #18
        You don't examine all the subsets, but that is because it already thinks it has found the best subset. Why would you want to keep going?

        In any event, if Stata cannot give you what you want, maybe you should just use the R routine that can.

        Or, maybe you can hack vselect so that it doesn't stop once it thinks it has a "winner."
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        StataNow Version: 19.5 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://academicweb.nd.edu/~rwilliam/

        Comment


        • #19
          Originally posted by Richard Williams View Post
          You don't examine all the subsets, but that is because it already thinks it has found the best subset. Why would you want to keep going?

          In any event, if Stata cannot give you what you want, maybe you should just use the R routine that can.

          Or, maybe you can hack vselect so that it doesn't stop once it thinks it has a "winner."
          because I want to replicate the result of the paper.If I use different subset ,the results are so different.
          The most difficult thing of the paper is to find the model with the smallest AIC or AICC value.
          Best regards.

          Raymond Zhang
          Stata 17.0,MP

          Comment


          • #20
            Originally posted by Raymond Zhang View Post
            But it is not the smallest AICC value.Just as I show in #6, when there are 6 variables,the AICC value is -131.03782 <-125.2123.
            So the result above is not the model with smallest AICC.
            Neither is the 6 predictor model with AICC = -131.03782 the one with the smallest AICC. The models with 20 or more predictors all have smaller AICC values.
            Last edited by daniel klein; 28 Jul 2021, 09:31.

            Comment


            • #21
              Originally posted by daniel klein View Post

              Neither is the 6 predictor model with AICC = -131.03782 the one with the smallest AICC. The models with 20 or more predictors all have smaller AICC values.
              Dear @daniel klein,Yes,the models with 20 or more predictors all have smaller AICC values.But according to the paper, when length(possible.ctrls) + 3 >= length(time.pretr)) ,the max number of variables is smaller than length(time.pretr)-3.In the example above, 24+3>23,so the max number of variables is smaller than 20.So I want to find the smallest AICC when the max number of variables is 20.So we don't consider more than 20 variables.
              Best regards.

              Raymond Zhang
              Stata 17.0,MP

              Comment


              • #22
                Is the data you posted the same data used in the paper you cite?
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                StataNow Version: 19.5 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://academicweb.nd.edu/~rwilliam/

                Comment


                • #23
                  In the time you've been puzzling through how to use existing commands, this solution could have been brute-forced for some or all subsets of variables since you're still in that "scale" of computation. This isn't necessarily a "smart" approach, but when puzzling over software packages and programs and when trying to match with a published paper, sometimes it's nice to know exactly what the correct/real answer is. It's possible some software package authors have undetected bugs or don't quite behave the way you want them to, or the authors of the paper can make mistakes.

                  In any case, for the case of best subset regression with only 6 variables with your full dataset, a single model has the lowest AIC, AICC and BIC. The do-file is attached for educational purposes.

                  Code:
                    +----------------------------------------------------------------------------------------------------+
                    |        v1       v2       v3          v4         v5       v6          aic         aicc          bic |
                    |----------------------------------------------------------------------------------------------------|
                    | Australia   Mexico   Norway   Singapore   Thailand   Taiwan   -335.82459   -333.05536   -321.04847 |
                    +----------------------------------------------------------------------------------------------------+
                  Attached Files

                  Comment


                  • #24
                    Originally posted by Leonardo Guizzetti View Post
                    In the time you've been puzzling through how to use existing commands, this solution could have been brute-forced for some or all subsets of variables since you're still in that "scale" of computation. This isn't necessarily a "smart" approach, but when puzzling over software packages and programs and when trying to match with a published paper, sometimes it's nice to know exactly what the correct/real answer is. It's possible some software package authors have undetected bugs or don't quite behave the way you want them to, or the authors of the paper can make mistakes.

                    In any case, for the case of best subset regression with only 6 variables with your full dataset, a single model has the lowest AIC, AICC and BIC. The do-file is attached for educational purposes.

                    Code:
                    +----------------------------------------------------------------------------------------------------+
                    | v1 v2 v3 v4 v5 v6 aic aicc bic |
                    |----------------------------------------------------------------------------------------------------|
                    | Australia Mexico Norway Singapore Thailand Taiwan -335.82459 -333.05536 -321.04847 |
                    +----------------------------------------------------------------------------------------------------+
                    Dear @Leonardo Guizzetti,Yes,with the full dataset,the best subset regression has 6 variables.But now I want to do placebo test.
                    I have to select the best subset regression when the observations is 23,24,25,...44,this means that I have to choose 22 best subset regressions and these subsets may be different.
                    Best regards.

                    Raymond Zhang
                    Stata 17.0,MP

                    Comment


                    • #25
                      Posts #12 and #6 tell us that for
                      Code:
                      regress HongKong Canada France Italy Norway UnitedStates Singapore if _n<23
                      the AIC value is -126.5042 and the AICC value is -116.2185.

                      I cannot reproduce that.
                      Code:
                      use "~/Downloads/HCW", clear
                      regress HongKong Canada France Italy Norway UnitedStates Singapore if _n<=23
                      estat ic
                      matrix S = r(S)
                      local rss = e(rss)
                      local aic_estat = S[1,5]
                      local aic  =  e(N)*ln(e(rss)/e(N)) + 2*(e(N)   ///
                                                      - e(df_r)) + (e(N) + e(N)*ln(2*_pi))
                      local aicc =  e(N)*ln(e(rss)/e(N)) + 2*(e(N) - ///
                                                      e(df_r)) + 2*(e(df_m)+2)*(e(df_m)+3)/(   ///
                                                      e(N)-(e(df_m) + 2) - 1)  + (e(N) + e(N)* ///
                                                      ln(2*_pi))
                      local aicc5 = e(N)*ln(e(rss)/e(N)) + 2*(e(N) - e(df_r)) + 2*(e(df_m)+2)*(e(df_m)+3)/(     ///
                          e(N)-(e(df_m) + 2) - 1)  + (e(N) + e(N)* ln(2*_pi))
                      
                      display _newline "RSS          from regress:" %9.6f `rss' _newline ///
                              _newline "AIC         from estat ic:" %9.3f `aic_estat'    ///
                              _newline "AIC  from vselect formula:" %9.3f `aic'          ///
                              _newline "AICC from vselect formula:" %9.3f `aicc'         ///
                              _newline "AICC  from post 5 formula:" %9.3f `aicc5'
                      Code:
                      RSS          from regress: 0.002171
                      
                      AIC         from estat ic: -133.893
                      AIC  from vselect formula: -133.893
                      AICC from vselect formula: -123.607
                      AICC  from post 5 formula: -123.607
                      This calls into question the assertion that the vselect results presented in post #12 and #6 were created by the code used in post #6 applied to the dataset attached to post #6.

                      This would be easy for me to confirm were it not that vselect does not run on my copy of Stata 17.

                      Comment


                      • #26
                        Leonardo, if you know that you absolutely positively want 6 variables, that sounds good!

                        Raymond's commands always ended with

                        if _n<=23 Should the reg command in your do file do that too? (I don't understand why the if is in there anyway.)
                        -------------------------------------------
                        Richard Williams, Notre Dame Dept of Sociology
                        StataNow Version: 19.5 MP (2 processor)

                        EMAIL: [email protected]
                        WWW: https://academicweb.nd.edu/~rwilliam/

                        Comment


                        • #27
                          Originally posted by Richard Williams View Post
                          Leonardo, if you know that you absolutely positively want 6 variables, that sounds good!

                          Raymond's commands always ended with

                          if _n<=23 Should the reg command in your do file do that too? (I don't understand why the if is in there anyway.)
                          The flexibility of that program I wrote is that you can use the -min()- and -max()- options of -tuples- (SSC) to specify the number of variables under consideration, so it could be from 1 to 23 with minor modification. I simply chose 6 since that was the example he was going with. I also do not understand the significant of the restriction to the first 23 observations, so I chose to ignore that, but it could easily be added in to my program. I offer the program as a means to understand how one can work through the solutions in a brute force way, because it me it seems there is confusion over which software program/package produces the desired result, or indeed what the desired answer is. Indeed William confirms this suspicion in post #25.

                          Comment


                          • #28
                            Let me try to clear up some confusion here. From a post above

                            There is a R package called pampe, which can solve this question. It also uses leaps and bounds algorithm,and it runs very quickly.
                            The leaps-and-bounds algorithm is a general purpose algorithm for finding the minimum of a function. Using two different programs that utilize leaps and bounds does not mean that they should produce the same result.

                            For a model with K independent variables, pampe apparently can find
                            • the single model with minimum value of AIC
                            • the single model with minimum value of AICC
                            • the single model with minimum value of BIC
                            and note that these will not necessarily be the same model.

                            The vselect command, with the best option, tells us

                            For each predictor size k, the best model under each of the information criterions for that predictor size k is the model that minimizes RSS. All other terms are constant for the same predictor size. So at each predictor size, we can find the best model of that size by minimizing the RSS. This remarkable result can greatly simplify the variable selection process.
                            So it takes a different, and I would argue, better approach. For a model with K independent variables, it finds for each value of k=1(1)K the collection of k independent variables that minimizes the RSS. It then presents for that model the values of 5 information criteria: R2ADJ, Cp, AIC, AICC, and BIC. The user can then choose their favored criterion and find the model that optimizes it. In the example presented in the SJ article, AICC and BIC are optimized by the 2-variable model, while R2ADJ and AIC are optimized by the 3-variable model, and for CP it depends on a further choice by the user.

                            So in effect vselect finds the "best" subset for each of 5 selection criteria simultaneously. This is cool.

                            The forward and backward options in vselect have nothing to do with leaps and bounds methodology, and are beyond the scope of this post.

                            My post #25 discusses my concerns about the examples and data shown for vselect. From this discussion we see that vselect supposedly did not choose the 6-variable model with the minimum RSS (since had it done so, that would have yielded the smallest AICC). Without an independent confirmation of the results presented in posts #12 and #6 I remain dubious of that assertion.
                            Last edited by William Lisowski; 28 Jul 2021, 10:44.

                            Comment


                            • #29
                              I’m getting lost now and should probably keep quiet! Like Daniel, I am wondering if we need to use quantum computing to do what Raymond wants. Plus, if the R routine did it right, why not just use it?

                              Or, maybe write the authors and find out exactly what they did.

                              Maybe Raymond wants to extend the analysis. But if so, does he want something that can’t possibly be done unless he is willing to wait a few months for the results?
                              -------------------------------------------
                              Richard Williams, Notre Dame Dept of Sociology
                              StataNow Version: 19.5 MP (2 processor)

                              EMAIL: [email protected]
                              WWW: https://academicweb.nd.edu/~rwilliam/

                              Comment


                              • #30
                                The frustrating thing for me is that vselect is good tool based on really interesting math getting bad publicity here.

                                The leaps-and-bounds approach to "best subset regression" was near magic back in the 1980's when I encountered it.

                                It incorporates a number of features that have I suspect largely been bypassed by cheap computing power.
                                • A branching search of all possible subsets that goes through them in a systematic fashion that facilitates ...
                                • Using a slick technique for obtaining (Z'Z)-1 from (X'X)-1 when Z is just X with one column omitted; this cuts the computational burden of inverting matrices on the way to obtaining the RSS for each subset significantly, as does ...
                                • A bounding technique that recognizes that eliminating variables will increase RSS, so if you have a k-variable subset whose RSS exceeds the each of the (current) RSS values for 1, 2, ..., and k variable models, then nothing is to be gained by exploring the sub-subsets of this k-variable subset, which can radically cut the number of subsets to be searched, and ...
                                • If you're fortunate enough to be implementing this in a recursive programming language you can get gloriously elegant code.

                                Comment

                                Working...
                                X