Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • ,

    Dear all, I have 24 possible predictors, x1-x24, How can I select the model with the lowest AIC?
    I try to use the command --vselect--.But the result is not what I want.
    Any help will be appreciated.
    Best regards.

    Raymond Zhang
    Stata 17.0,MP

  • #2
    Raymond, we have covered this in length in private Email, here, and recently also here.

    Nothing stops you from asking the same question over and over again, but please tell people about relevant information. If this is another question, please explain how it differs from the ones that I have linked to.


    Your continuous search for programs (first tuples, then selectvars, now vselect) makes me believe that you might still not fully understand the problem. I will give this one more try: Your problem is not that you have the wrong command; your problem is not inefficient code; your problem is that you are looking at more than 16 million(!) combinations. If you really want the exact answer, then you either need to get into quantum computing or wait for the code to finish in about half a year. I suggest you look for an approximate answer instead. Define some "good enough" AIC value, given the substantive problem, and/or look into optimal stopping strategies to cut down on the number of models you need to run.

    Comment


    • #3
      I don't really understand lasso, but could this be a good use of it? To me, lasso sounds like stepwise selection on steroids, but various people have assured me it is not Satanic.
      -------------------------------------------
      Richard Williams, Notre Dame Dept of Sociology
      StataNow Version: 19.5 MP (2 processor)

      EMAIL: [email protected]
      WWW: https://academicweb.nd.edu/~rwilliam/

      Comment


      • #4
        On refreshing my memory of Mallows Cp, which is the criterion minimized by vselect, I encountered the statement in Wikipedia that it has been shown to be "equivalent to AIC for Gaussian linear regression". This is potentially useful because the leaps and bounds algorithm used by vselect for "all subsets regression" efficiently avoids needing to evaluate all possible subsets. So in theory vselect could perhaps work for the problem posed in post #1. Although even just 1% of 18 million subsets would be a lot of subsets.

        Whether vselect would work in a reasonable amount of time is an empirical question. Nothing was said in post #1 to describe the problem, the precise vselect command used, the results produced by the command, or why those results were not what was wanted.

        And in fact, the current version of vselect in SSC no longer successfully runs the example shown in the help file.
        Code:
        . vselect mpg weight trunk length foreign, best
        
        Response :             mpg
        Selected predictors:   weight foreign length trunk
        variable npred not found
                       st_view():  3598  Stata returned error
                  leaps_bounds():     -  function returned error
                         <istmt>:     -  function returned error
        r(3598);
        The similar gvselect also found in SSC fails in a similar fashion.

        Something of a shame; I encountered all subsets regression back in the day and thought it might be fun to mess about with, but I'm not about to debug a substatial chunk of Stata and Mata code in order to do so.

        Bottom line is, this looks like a dead end for the stated problem of minimizing AIC.

        The solution remains, as Daniel stated in post #2, to move beyond the very narrow problem statement in this and other topics: to find the subset of the 24 predictors that minimizes AIC. Lasso minimizes something else that doesn't seem to be related to AIC.

        Comment


        • #5
          Reviewing Richard's and William's answers, I have realized that I have misinterpreted what vselect actually does. Skimming through Lindsey and Sheather's (2010) paper, the authors state that they use algorithms that "examine at most m(m+1)/2 of the 2^m possible models" (p.654). The leaps and bounds algorithm, which vselect implements, seems to examine only m models. This approach reduces the brute-force 16 million models (which we have discussed repeatedly) to a maximum of 300 models! That is entirely feasible, probably within minutes or less.

          It is now up to me to apologize for misinterpreting the vselect command, and for getting stuck in Raymond's initially proposed brute-force solution instead of focussing on his more basic problem. It is up to Raymond to provide more details (e.g., why he needs to find the minimum AIC of a model, which model's AIC is to be minimized, etc.) and clarify why the result of vselect are not what he wants.


          Additional notes:

          vselect (from SSC) works for me on Windows 10, Stata 16.1.

          lasso seems to minimize BIC in Stata 17; BIC is pretty similar to AIC.


          Lindsey, C., Sheather, S. 2010. Variable selection in linear regression. The Stata Journal, 10(4), pp. 650--669.

          Comment


          • #6
            @daniel klein @William Lisowski @Richard WilliamsDear all,Thank you for your reply.now I want to replicate the paper of Hsiao et.(2012) which is attached below.The data is also attached.
            In this paper ,the most important thing is to find the model with the lowest AIC or AICC.There are 24 variables can be selected.
            Code:
            use HCW,clear
            global m Australia Austria Canada Denmark Finland France Germany ///
            Italy Japan Korea Mexico Netherlands NewZealand Norway ///
            Switzerland UnitedKingdom UnitedStates Singapore Philippines ///
            Indonesia Malaysia Thailand Taiwan China
            vselect HongKong $m if _n<=23 ,best
            
            Optimal models:
            
            # Preds R2ADJ C AIC AICC BIC
            1 .321052 1.04e+12 -88.07735 -86.81419 -85.80636
            2 .5167523 7.08e+11 -95.01988 -92.79765 -91.61339
            3 .6855855 4.37e+11 -104.0854 -100.556 -99.54344
            4 .8202188 2.37e+11 -116.1853 -110.9353 -110.5078
            5 .8385315 2.01e+11 -117.9708 -110.5042 -111.1579
            6 .8914771 1.27e+11 -126.5042 -116.2185 -118.5558
            7 .9159515 9.23e+10 -131.8667 -118.0205 -122.7827
            8 .931198 7.05e+10 -136.0572 -117.7239 -125.8378
            9 .9392075 5.79e+10 -138.6083 -114.6083 -127.2534
            10 .9400224 5.27e+10 -138.7597 -107.5597 -126.2693
            11 .9416132 4.70e+10 -139.3792 -98.93479 -125.7533
            12 .9466096 3.91e+10 -141.6289 -89.12894 -126.8675
            13 .9505877 3.26e+10 -143.8331 -75.26171 -127.9362
            14 .9507881 2.88e+10 -144.6356 -53.96896 -127.6032
            15 .9481881 2.66e+10 -144.5227 -22.1227 -126.3548
            16 .9594252 1.78e+10 -151.691 19.30897 -132.3876
            17 .9613106 1.42e+10 -154.9788 98.3545 -134.5399
            18 .9601842 1.17e+10 -157.451 262.549 -135.8766
            19 .957426 9.35e+09 -160.5272 763.4728 -137.8173
            20 .9607083 5.75e+09 -169.6982 . -145.8528
            21 .9758035 1.77e+09 -194.7911 -1298.791 -169.8102
            22 . 3.65e+08 -229.1342 -829.1342 -203.0178
            23 1.000928 6.79e+07 -265.7998 -699.1331 -238.5479
            24 1 25 . . .
            If you see the results above,you will find that when Preds=6,AICC=-116.2185,this AICC value is not the smallest value when Preds=6.
            Because there is another combination whose AICC value is
            Code:
            reg HongKong Australia Canada Finland France Norway Thailand if _n<=23
            dis e(N)*ln(e(rss)/e(N)) + 2*(e(N) - e(df_r)) + 2*(e(df_m)+2)*(e(df_m)+3)/(     ///
                e(N)-(e(df_m) + 2) - 1)  + (e(N) + e(N)* ln(2*_pi))
            
            -131.03782
            The method to compute AICC value if from the adofile of vselect.
            I found that when there are 6 variables(Preds=6),the smallest AICC value is not -116.2185,it is -131.03782.
            So what I want is the smallest AICC value of each Pred.There is a R package called pampe,which can solve this question.
            It also uses
            leaps and bounds algorithm,and it runs very quickly. I think Stata can also find the smallest AICC value is -131.03782 when Pred==6,not -116.2185.

            Sorry for my poor English.I wonder if I describe the question clearly.
            Attached Files
            Last edited by Raymond Zhang; 28 Jul 2021, 05:52.
            Best regards.

            Raymond Zhang
            Stata 17.0,MP

            Comment


            • #7
              The smallest AICC value when Preds==6 is from the result of R package pampe.
              Best regards.

              Raymond Zhang
              Stata 17.0,MP

              Comment


              • #8
                I am reading this on my phone so I might miss something right now. I will get back to this later, but for starters:

                Do you have missing values in any of your variables?
                Do you want to force HongKong into all the models?

                Comment


                • #9
                  To be clear, you think the R package is right, but vselect is wrong?

                  Nothing was attached in your posts.
                  -------------------------------------------
                  Richard Williams, Notre Dame Dept of Sociology
                  StataNow Version: 19.5 MP (2 processor)

                  EMAIL: [email protected]
                  WWW: https://academicweb.nd.edu/~rwilliam/

                  Comment


                  • #10
                    Originally posted by Richard Williams View Post
                    To be clear, you think the R package is right, but vselect is wrong?

                    Nothing was attached in your posts.
                    Sorry,Now I have attached the data and paper in #6.
                    Best regards.

                    Raymond Zhang
                    Stata 17.0,MP

                    Comment


                    • #11
                      Originally posted by daniel klein View Post
                      I am reading this on my phone so I might miss something right now. I will get back to this later, but for starters:

                      Do you have missing values in any of your variables?
                      Do you want to force HongKong into all the models?
                      @daniel klein There are no missing values in the data .and the variable HongKong is the dependent variable.the dependent var of all models is HongKong.
                      Best regards.

                      Raymond Zhang
                      Stata 17.0,MP

                      Comment


                      • #12
                        Originally posted by Richard Williams View Post
                        To be clear, you think the R package is right, but vselect is wrong?

                        Nothing was attached in your posts.
                        The smallest AICC value which gets from vselect is-116.2185 when there are 6 variables,these variables are " Canada France Italy Norway UnitedStates Singapore".But We can find other 6 variables
                        (Australia Canada Finland France Norway Thailand)and get the AICC value is -131.03782.
                        It is smaller than the result of vselect.

                        The results of vselect are below:
                        Code:
                        predictors for each model:
                        
                        1  :  Austria
                        2  :  Austria Canada
                        3  :  Finland France Korea
                        4  :  Austria France Korea Mexico
                        5  :  Austria Canada France NewZealand Norway
                        6  :  Canada France Italy Norway UnitedStates Singapore
                        7  :  Austria Canada France Italy Norway UnitedStates Singapore
                        8  :  Finland France Italy Mexico Norway Switzerland UnitedStates Singapore
                        9  :  Finland France Italy Mexico Norway Switzerland UnitedStates Singapore Philippines
                        10 :  Finland France Italy Japan Mexico Norway Switzerland UnitedStates Singapore Philippines
                        11 :  Austria Finland France Italy Japan Mexico Norway UnitedStates Singapore Philippines Thailand
                        12 :  Australia Denmark Finland France Germany Italy Mexico Norway Switzerland UnitedStates Singapore Philippines
                        13 :  Austria Canada Denmark Finland France Germany Italy Mexico Norway Switzerland UnitedStates Singapore Philippines
                        14 :  Canada Denmark Finland France Germany Italy Mexico NewZealand Norway Switzerland UnitedKingdom UnitedStates Singapore Phil
                        > ippines
                        15 :  Australia Canada Denmark Finland France Germany Italy Mexico Norway Switzerland UnitedKingdom UnitedStates Singapore Phili
                        > ppines China
                        16 :  Australia Canada Denmark Finland France Germany Italy Mexico NewZealand Norway Switzerland UnitedKingdom UnitedStates Sing
                        > apore Philippines China
                        17 :  Australia Canada Denmark Finland France Germany Italy Korea Mexico NewZealand Norway Switzerland UnitedKingdom UnitedState
                        > s Singapore Philippines China
                        18 :  Australia Canada Denmark Finland France Germany Italy Korea Mexico NewZealand Norway Switzerland UnitedKingdom UnitedState
                        > s Singapore Philippines Thailand China
                        19 :  Australia Canada Denmark Finland France Germany Italy Korea Mexico Netherlands NewZealand Norway Switzerland UnitedKingdom
                        >  UnitedStates Singapore Philippines Thailand China
                        20 :  Australia Austria Canada Denmark Finland France Germany Italy Korea Mexico Netherlands NewZealand Norway Switzerland Unite
                        > dKingdom UnitedStates Singapore Philippines Thailand China
                        21 :  Australia Austria Canada Denmark Finland France Germany Italy Japan Mexico Netherlands NewZealand Norway Switzerland Unite
                        > dKingdom UnitedStates Singapore Philippines Indonesia Taiwan China
                        22 :  Australia Austria Canada Denmark Finland France Germany Italy Japan Mexico Netherlands NewZealand Norway Switzerland Unite
                        > dKingdom UnitedStates Singapore Philippines Indonesia Malaysia Taiwan China
                        23 :  Australia Austria Canada Denmark Finland France Germany Italy Japan Mexico Netherlands NewZealand Norway Switzerland Unite
                        > dKingdom UnitedStates Singapore Philippines Indonesia Malaysia Thailand Taiwan China
                        24 :  Australia Austria Canada Denmark Finland France Germany Italy Japan Korea Mexico Netherlands NewZealand Norway Switzerland
                        >  UnitedKingdom UnitedStates Singapore Philippines Indonesia Malaysia Thailand Taiwan China
                        Best regards.

                        Raymond Zhang
                        Stata 17.0,MP

                        Comment


                        • #13
                          You do not tell us what is so magical about the model with exactly 6 predictors; perhaps that is in the paper, which I did not read.

                          Anyway,
                          vselect has a nmodel() option that lets you see more than 1 model for each number of predictors. Whteher this is helpful, I cannot tell. There are still 134,596 possible combinations of 6 predictors and specifying nmmodel(134596) will try to find the 134596 best models with 1 predcitor, 2 predictors, ..., 24 predictors. That is not what you want.


                          btw. I find it odd to have n=23 observations and try 24 predictos; perhaps this is just an example.

                          Comment


                          • #14
                            You keep referring to AICC but you did not tell vselect to use AICC when selecting variables. Do you maybe want this command instead?

                            Code:
                            vselect HongKong $m if _n<=23 , forward aicc


                            If possible, I would double-check the criteria used by the R routine, and make sure that it and vselect are using the same criteria.

                            The winning model is

                            Code:
                            . reg HongKong Singapore Norway Mexico Switzerland if _n<=23
                            
                                  Source |       SS           df       MS      Number of obs   =        23
                            -------------+----------------------------------   F(4, 18)        =     52.41
                                   Model |  .034930802         4  .008732701   Prob > F        =    0.0000
                                Residual |  .002999111        18  .000166617   R-squared       =    0.9209
                            -------------+----------------------------------   Adj R-squared   =    0.9034
                                   Total |  .037929913        22  .001724087   Root MSE        =    .01291
                            
                            ------------------------------------------------------------------------------
                                HongKong | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                            -------------+----------------------------------------------------------------
                               Singapore |   .7185142    .071504    10.05   0.000     .5682899    .8687384
                                  Norway |   .4169773   .1119801     3.72   0.002     .1817157    .6522388
                                  Mexico |   .3462425   .0673809     5.14   0.000     .2046804    .4878046
                             Switzerland |  -.6629759   .1766581    -3.75   0.001    -1.034121    -.291831
                                   _cons |  -.0434963     .00739    -5.89   0.000    -.0590222   -.0279704
                            ------------------------------------------------------------------------------
                            
                            . estat ic
                            
                            Akaike's information criterion and Bayesian information criterion
                            
                            -----------------------------------------------------------------------------
                                   Model |          N   ll(null)  ll(model)      df        AIC        BIC
                            -------------+---------------------------------------------------------------
                                       . |         23   41.05077   70.23115       5  -130.4623  -124.7848
                            -----------------------------------------------------------------------------
                            Note: BIC uses N = number of observations. See [R] BIC note.
                            -------------------------------------------
                            Richard Williams, Notre Dame Dept of Sociology
                            StataNow Version: 19.5 MP (2 processor)

                            EMAIL: [email protected]
                            WWW: https://academicweb.nd.edu/~rwilliam/

                            Comment


                            • #15
                              Richard Williams Dear Richard,if we use the option forward aicc, we can not examine all the subsets regression.
                              Best regards.

                              Raymond Zhang
                              Stata 17.0,MP

                              Comment

                              Working...
                              X