How to find optimal bandwidth?

Kathrin Me

Join Date: Sep 2021

Posts: 54
#1

How to find optimal bandwidth?

19 Oct 2021, 01:08

Hi all,

when plotting

Code:

twoway (lpoly logexptot bin if D==0, degree (1) kernel(epanechnikov))(lpoly logexptot bin if D==1, degree (1) kernel(epanechnikov))(scatter logexptot bin if D==0)(scatter logexptot bin if D==1)

Stata determines the bandwidth endogeneously, correct? Can I see for what bandwidth Stata decided and does someone know how to apply either the cross-validation method or where in Stata I find all relevant information for using Silverman's rule of thumb?

Thank you in advance
Kathrin
Tags: bandwidth

Nick Cox

Join Date: Mar 2014
Posts: 35698

19 Oct 2021, 11:49

See the manual entry [R] lpoly for default bandwidth.

I was disappointed in finding that lpoly by default usually undersmoothed to my taste, indeed extraordinarily so. Hence I fooled around with various examples to come up with a different default. The result is localp on SSC, which is just a wrapper for lpoly. The advice is not to use my default but to follow suit in thinking up your own defaults.

I know that sometimes people need to look at lots and lots of these and if so they need some automated (please: not objective; please, not optimal) bandwidth selection method. That's not been my need so my practice is outrageous. Have a default. Smooth more -- or less -- if it helps thinking about the data in the light of what you know about underlying processes.

Silverman's rule of thumb is for density estimation, not scatter plot smoothing, is it not?

localp was announced at https://www.statalist.org/forums/for...ial-regression

There is a known bug in the SSC version so here is the code from my machine.

Code:

*! 1.0.3 NJC 13 November 2020
*! 1.0.2 NJC 11 April 2015
* 1.0.1 NJC 9 April 2015
* 1.0.0 NJC 19 March 2015
program localp
    version 10
    syntax varlist(min=2 max=2 numeric) [fweight aweight] [if] [in] ///
    [, Kernel(str) BWidth(str) DEGree(int 1) AT(str) Msymbol(str)   ///
    GENerate(passthru) * ]

    tokenize "`varlist'"
    args yvar xvar

    if "`kernel'" == "" local kernel "biweight"
    
    if "`bwidth'" == "" {
        su `xvar' `if' `in', meanonly
        local bwidth = 0.2 * (r(max) - r(min))
        local factor = 10^floor(log10(`bwidth'))
        local bwidth = floor(`bwidth' / `factor') * `factor'
    }

    if "`at'" == "" local at `xvar'

    quietly {
        tempvar ypred

        lpoly `yvar' `xvar' `if' `in' [`weight' `exp'], ///
        k(`kernel') bw(`bwidth') at(`at') deg(`degree') ///
        nograph gen(`ypred')

        reg `yvar' `ypred' [`weight' `exp'] `if' `in'
        local rsq  : di %04.3f `e(r2)'
        local rmse : di %5.4g `e(rmse)'
        if c(stata_version) < 11 {
            local text "R-sq = `rsq', RMSE = `rmse'"
        }
        else local text "{it:R}{sup:2} = `rsq', RMSE = `rmse'"
    }

    if "`msymbol'" == "" local msymbol "Oh"

    local word : word `= `degree' + 1' of ///
    mean linear quadratic cubic quartic quintic
    if "`word'" != "" local title "local `word' smooth"
    else local title "local polynomial smooth"

    lpoly `yvar' `xvar' `if' `in' [`weight' `exp'],           ///
    k(`kernel') bw(`bwidth') at(`at') deg(`degree')           ///
    msymbol(`msymbol') `generate'                             ///
    subtitle("`text'", size(medsmall) place(w))               ///
    title(`title', size(medium) place(w)) yla(, ang(h)) `options'
end

So my default is range / 5 rounded down to a nice number. My default does not depend on sample size, which I want to call lacking in subtlety.

I default to biweight, for which I have an irrational fondness.

Reasons for not liking the Epanechnikov: kernel:

1. Whenever a procedure is named after somebody, I want to be able to tell a story about that person. I've never found anything on E.

2. I don't trust my students to spell the name correctly. Heck, I don't trust myself to spell it correctly.

3. Technicalities. I prefer the way that the biweight behaves at the kernel edge.

Comment

FernandoRios

Join Date: Apr 2014

Posts: 2469
#3

19 Oct 2021, 14:20

for cross validation, you have the utilities i explain here:https://www.stata-journal.com/articl...article=st0613
all within package -vc_pack- available from SSC.

Take a look into vc_bw y, vcoeff(x)
This assumes a local linear model.

You could, also use Stata official npregress kernel, making sure you select degree(1) or degree(0), for local linear and local constant models.
Comment
Kathrin Me

Join Date: Sep 2021

Posts: 54
#4

20 Oct 2021, 02:02

Thanks to both of you for your help!

My main regression is:

Code:

egen double_cluster=group(id year) reg logexptot D logwpop logwpop_2 logwpop_3 logwpop_D1 logwpop_D2 logwpop_D3, vce(cluster double_cluster)

When using vc-pack, would I then write:

Code:

vc_bwalt logexptot logwpop logwpop_2 logwpop_3 logwpop_D1 logwpop_D2 logwpop_D3, vcoeff(D)

and what of the outcome would give me the by cross-validation determined bandwidth?

How can I specify the kernel? The following is not allowed.

Code:

vc_bwalt logexptot logwpop logwpop_2 logwpop_3 logwpop_D1 logwpop_D2 logwpop_D3, vcoeff(D) kernel(epanechnikov)

Lastly, I wondered whether --- in general --- it is "allowed" to estimate the model non-parametrically as the running variable (of my RD-regression) is discrete (log population)?

Thanks again
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#5

20 Oct 2021, 02:31

So, now your question has morphed into one on vc_pack (please explain its provenance: FAQ Advice #12) and as it turns out FernandoRios will be the right person to answer.

Your general question may well be fielded by folks who do this kind of thing.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#6

20 Oct 2021, 03:52

Hi Kathrin,
couple of thoughts
1) while i suggest vc_bwalt as a more robust method to find optimal bandwidth, vc_bw will be far better.
2) in this command, the running variable goes to -vcoeff()-. So I'm guessing D is that variable.
3) log population doesn't sound discreet. if it works with lpoly, should work here too.
4) you cannot use interactions of the vcoeff() variable as part of explanatory variables.
5) see the helpfile for kernel specification. the key word i use for this is epan.
6) D has o be continuos. otherwise it will not work well (or semi continuous)

HTH
Comment
Kathrin Me

Join Date: Sep 2021

Posts: 54
#7

26 Oct 2021, 10:05

Hi again,

my question moved vc_pack as I understood it as one way how to find the optimal bandwidth using the cross-validation method and this was one of my preferred approaches how to find the optimal bandwidth.

FernandoRios: My running variable is logwpop, D is the treatment variable (i.e. D=1 if logwpop >0 and zero otherwise). So I now changed the code into:

Code:

vc_bw logexptot logwpop_2 logwpop_3 if D == 0, vcoeff(logwpop) kernel(epan) display $opbw_ vc_bw logexptot logwpop_2 logwpop_3 if D == 1, vcoeff(logwpop) kernel(epan) display $opbw_

and I get an optimal bandwidth of .0014289 and .00967457 respectively. I am wondering a bit, because when doing the local polynomial regression plot, i.e.

Code:

twoway (lpoly logexptot bin if D==0, degree (1) lwidth(1) kernel(epanechnikov))(lpoly logexptot bin if D==1, degree (1) lwidth(1) kernel(epanechnikov))(scatter logexptot bin if D==0)(scatter logexptot bin if D==1)

Stata determines .02337655 and .03403092 to be optimal which is quite far away from what I get when using vc_bw.

Thanks again for your help!
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#8

26 Oct 2021, 10:42

Hi Kathrin,
If you are using logwpop as your running variable, you cannot use "logwpop_2 logwpop_3" (which I'm guessing are the squared and cubed of lopwpop). Or any other transformation of logwpop as explanatory variable.

If you want to compare it with lpoly, you will need to do the following:
vc_bw logexptot if D == 0, vcoeff(logwpop) kernel(epan) vc_bw logexptot if D == 1, vcoeff(logwpop) kernel(epan) The BW you find there is the one you could use with lpoly HTH
Comment
Kathrin Me

Join Date: Sep 2021

Posts: 54
#9

28 Oct 2021, 03:40

Hi Fernando,

alright!

When using your specification I get 2.385829 if D == 0 and .0410291 if D == 1. The latter is somewhat close to what I find with lpoly. However 2.39 is still quite far away from 0.023. Do you have any idea for why I find so different results?

Thanks again

Best, Kathrin
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#10

28 Oct 2021, 06:33

Probably because some sparse distribution of the running variable for d=0
Bw is sensitive to outliers (specially near the boundaries)

since you are using epan I would double check the bw using vc_bwalt
epan cross validation plot tends to be highly nonlinear and that could be a problem
Comment
Kathrin Me

Join Date: Sep 2021

Posts: 54
#11

28 Oct 2021, 06:56

I tried already, but this gives me the following error:

"unable to allocate matrix;
You have attempted to create a matrix with too many rows or columns or attempted to fit a model with too many variables.

You are using Stata/IC which supports matrices with up to 800 rows or columns. See limits for how many more rows and columns Stata/SE and Stata/MP can support.

If you are using factor variables and included an interaction that has lots of missing cells, try set emptycells drop to reduce the required matrix size; see help set
emptycells.

If you are using factor variables, you might have accidentally treated a continuous variable as a categorical, resulting in lots of categories. Use the c. operator on such
variables.
r(915);"
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#12

28 Oct 2021, 11:15

mmm, I see. I will check my code to see if I could bypass that. But it seems that you go through too many iterations. One thing you could do is use the following:
vc_bwalt y, vcoeff(x) bwi(2.3858) kernel(epan)

That way uses bwi as initial value.

The other thing you could do, if it ends in error (as you show) type
matrix list cvbw

This will give show you all the bandwidths and Crossvalidations. You can plot them to see how stable was the solution towards the optimal bandwidth.

Hope this helps

F
Comment
Kathrin Me

Join Date: Sep 2021

Posts: 54
#13

29 Oct 2021, 02:18

So when using vc_bwalt logexptot if D == 1, vcoeff(logwpop) kernel(epan) bwi(2.39) I get a bandwidth of 51099.896.

I now also tried your other suggestion, npregress.

Given my main regression,

Code:

egen double_cluster=group(id year) reg logexptot D logwpop logwpop_2 logwpop_3 logwpop_D1 logwpop_D2 logwpop_D3, vce(cluster double_cluster)

how would I specify the command? Do I still use logwpop or do I have to switch back to D (an indicator variable for my treatment)? Can I use logwpop_2 logwpop_3 logwpop_D1 logwpop_D2 logwpop_D3 in the specification?

So far I tried:

Code:

npregress kernel logexptot logwpop if D == 0

where I get this as an output:

What of the output gives me the optimal bandwidth? Would it be -4.19 in this case?

Thanks again for your help!
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#14

29 Oct 2021, 04:15

i should have asked for a plot first!
so the bandwidth with npregress is 97953
in other words the function between logextpop and logwpop Is basically linear.
it could be because your sample size is so small.
Comment
Kathrin Me

Join Date: Sep 2021

Posts: 54
#15

30 Oct 2021, 04:48

I see!
Do I need to include logwpop_2 logwpop_3 logwpop_D1 logwpop_D2 logwpop_D3 in the specification?
Comment

Announcement