, - Statalist

Richard Williams

Join Date: Apr 2014

Posts: 5012
#31

28 Jul 2021, 12:43

When in doubt, I find that it is usually best just to believe whatever William Lisowski says, ;-) in this case, regardless of what the original authors did, why can’t you use something better if you have it? It could even be a selling point for the paper.

For that Matter, maybe Lasso is worth checking into.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment

daniel klein

Join Date: Mar 2014
Posts: 3876

#32

28 Jul 2021, 13:29

Here is how I understand the situation:

A brute-force approach, using tuples (SSC), is not feasible. You might be able to run the roughly 100,000 models with exactly 6 predictors in a somewhat reasonable time; you will not be able to run the full 16+ million models with every combination of 24 predictors.
If Raymond wants that one specific method/algorithm that the original authors' have used, and if that specific method/algorithm is implemented R, Raymond should use R.
If Raymond wants that one specific method/algorithm in Stata, Raymond probably needs to implement it; vselect might or might one be a good starting point.
vselect is [Edit: supposed to be] great at what it does; what vselect does might arguably be "better" in some sense than what the original authors' and/or the R package does.

At this point, there is nothing left for me to contribute. I am bailing out.

Edit:

Because William Lisowski was wondering, there seems to be something wrong with vselect (*! version 1.2.0 21nov2014, from SSC). The code in #6 applied to the dataset in #6 yields the claimed rsults:

Code:

. global m Australia Austria Canada Denmark Finland France Germany ///
> Italy Japan Korea Mexico Netherlands NewZealand Norway ///
> Switzerland UnitedKingdom UnitedStates Singapore Philippines ///
> Indonesia Malaysia Thailand Taiwan China

. 
. vselect HongKong $m if _n<=23 ,best

Response :             HongKong
Selected predictors:   Australia Austria Canada Denmark Finland France Germany Italy Japan Korea Mexico Netherlands NewZealand Norway Switzerland UnitedKingdom UnitedStates Singapore Philippines I
> ndonesia Malaysia Thailand Taiwan China

Optimal models: 

   # Preds     R2ADJ         C       AIC      AICC       BIC
         1   .321052  1.05e+12 -88.07735 -86.81419 -85.80636
(output omitted)
         6  .8914771  1.27e+11 -126.5042 -116.2185 -118.5558
         7  .9159515  9.25e+10 -131.8667 -118.0205 -122.7827
(output omitted)
        24         1        25         .         .         .

predictors for each model:

1  :  Austria
2  :  Austria Canada
(output omitted)
6  :  Canada France Italy Norway UnitedStates Singapore
7  :  Austria Canada France Italy Norway UnitedStates Singapore
(output omitted)
24 :  Australia Austria Canada Denmark Finland France Germany Italy Japan Korea Mexico Netherlands NewZealand Norway Switzerland UnitedKingdom UnitedStates Singapore Philippines Indonesia Malaysia
>  Thailand Taiwan China

However, starting the process with the selected predictors from step 6, we get

Code:

. vselect HongKong Canada France Italy Norway UnitedStates Singapore if _n<=23 , best

Response :             HongKong
Selected predictors:   Norway Canada Singapore France UnitedStates Italy

Optimal models: 

   # Preds     R2ADJ         C       AIC      AICC       BIC
         1  .6890542   63.9664 -106.0386 -104.7755 -103.7677
(output omitted)
         6  .9212951         7 -133.8931 -123.6074 -125.9447

predictors for each model:

1  :  Singapore
(output omitted)
6  :  Norway Canada Singapore France UnitedStates Italy

I do not know anything about the underlying algorithm, but it seems strange that the same models produce different information criteria. Also, those C value in the first run look suspiciously large to me.

Last edited by daniel klein; 28 Jul 2021, 14:16.

Comment

Raymond Zhang

Join Date: Jan 2021

Posts: 349
#33

28 Jul 2021, 14:06

Originally posted by Richard Williams View Post

Leonardo, if you know that you absolutely positively want 6 variables, that sounds good!

Raymond's commands always ended with

if _n<=23 Should the reg command in your do file do that too? (I don't understand why the if is in there anyway.)

The reason I use the if is that I want to do placebo test, I will also run other models.In the paper ,44 is the policy time. Placebo test means i should run the as if policy happened at other pretreat periods,such as period 24,period 25....,period 43. In #12,I just show one of them.

Code:

regress HongKong $m if _n<=24 regress HongKong $m if _n<=25 regress HongKong $m if _n<=26 ...... regress HongKong $m if _n<=43

Last edited by Raymond Zhang; 28 Jul 2021, 14:08.

Best regards.

Raymond Zhang
Stata 17.0,MP
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#34

28 Jul 2021, 15:58

daniel klein at #32 -

Many thanks for un-bailing long enough to take the time to test vselect. I wish I could have tested this earlier, it would have saved a lot of time.

I agree that there is a problem in vselect 1.2.0 (to the best of my knowledge the latest version).

This doesn't blunt my enthusiasm for the theory of leaps-and-bounds best subset regression.
Comment
Raymond Zhang

Join Date: Jan 2021

Posts: 349
#35

28 Jul 2021, 22:09

Originally posted by Richard Williams View Post

Is the data you posted the same data used in the paper you cite?

Yes,the data is the same data in the paper .

Best regards.

Raymond Zhang
Stata 17.0,MP
Comment
Raymond Zhang

Join Date: Jan 2021

Posts: 349
#36

29 Jul 2021, 04:42

Originally posted by Leonardo Guizzetti View Post

In the time you've been puzzling through how to use existing commands, this solution could have been brute-forced for some or all subsets of variables since you're still in that "scale" of computation. This isn't necessarily a "smart" approach, but when puzzling over software packages and programs and when trying to match with a published paper, sometimes it's nice to know exactly what the correct/real answer is. It's possible some software package authors have undetected bugs or don't quite behave the way you want them to, or the authors of the paper can make mistakes.

In any case, for the case of best subset regression with only 6 variables with your full dataset, a single model has the lowest AIC, AICC and BIC. The do-file is attached for educational purposes.

Code:

+----------------------------------------------------------------------------------------------------+ | v1 v2 v3 v4 v5 v6 aic aicc bic | |----------------------------------------------------------------------------------------------------| | Australia Mexico Norway Singapore Thailand Taiwan -335.82459 -333.05536 -321.04847 | +----------------------------------------------------------------------------------------------------+

Dear @Leonardo Guizzetti,
Thank you for your code.In your code,you use tuples command.But just as Daniel said,when I have 24 variables.It will take a long long long time
to select the best model.it will run 2^24=16777216 regressions.It is impossible fo Stata to run them

Best regards.

Raymond Zhang
Stata 17.0,MP
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5012
#37

29 Jul 2021, 06:56

Originally posted by Raymond Zhang View Post

Dear @Leonardo Guizzetti,
Thank you for your code.In your code,you use tuples command.But just as Daniel said,when I have 24 variables.It will take a long long long time
to select the best model.it will run 2^24=16777216 regressions.It is impossible fo Stata to run them

Which brings us back to a question that has been asked several times: Are you trying to do the impossible? If the R program you have referred to does what you want, why not use it?

If you can't brute force it, why not use vselect or maybe lasso?

It remains unclear to me why you think you can do this. But can't you come close enough with some of the other routines for variable selection?

Or, write to the authors and ask them how they did it. Maybe they will send you their code or at least say what software you used.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#38

29 Jul 2021, 07:59

I'm checking out of this thread as I can't help further. Some quick math suggests that one full iteration through all possible combinations (up to 20 selected variables, specified in #20) can be feasibly run in just under 48 hours on my machine with that code example from earlier. However, if you insist on having to do to re-run the whole procedure where one more observation is added in each iteration, you're looking at an achievable yet impractical solution. At least, I can think of better things to do than fully occupy my computer for a few months.
Comment

Raymond Zhang

Join Date: Jan 2021
Posts: 349

#39

29 Jul 2021, 09:00

@Leonardo Guizzetti @William Lisowski @Richard Williams @daniel klein Thank you very much for all of your discussions on this issue, and I have benefited a lot.
In the upcoming 2021 Chinese Stata Conference, a Chinese Student will show a Stata command to replicate the paper which is attached in #6.
I am very curious about which kind of method he will use to select the best subset regressions.

Code:

2021 Chinese Stata Conference | Stata

Code:

 3:20–4:20
Regression control method and Stata applicationAbstract: The regression control method (Hsiao, Ching, and Wan 2012) has become an important method for evaluating policy effects using panel data. This presentation will introduce the basic principles of the regression control method, including the use of information criteria or lasso to select cross-sectional units, and the addition
			of covariates to the regression control method. Then, through the community-contributed command, the specific operation of the regression control method is introduced in detail with classic cases, including the perfect drawing function and placebo test.
			(Read less)
			Yan Guanpeng
			Shandong University

Best regards.

Raymond Zhang
Stata 17.0,MP

Comment

Richard Williams

Join Date: Apr 2014

Posts: 5012
#40

29 Jul 2021, 09:26

The abstract says "This presentation will introduce the basic principles of the regression control method, including the use of information criteria or lasso to select cross-sectional units, and the addition
of covariates to the regression control method." The conference is less than a month away, so I suggest you wait or else ask for an advance copy. We're just kind of spinning our wheels here, and the author may have come up with a better solution.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
1 like
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment