Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to find optimal bandwidth?

    Hi all,

    when plotting
    Code:
    twoway (lpoly logexptot bin if D==0,  degree (1) kernel(epanechnikov))(lpoly logexptot bin if D==1,  degree (1) kernel(epanechnikov))(scatter logexptot bin if D==0)(scatter logexptot bin if D==1)
    Stata determines the bandwidth endogeneously, correct? Can I see for what bandwidth Stata decided and does someone know how to apply either the cross-validation method or where in Stata I find all relevant information for using Silverman's rule of thumb?

    Thank you in advance
    Kathrin

  • #2
    See the manual entry [R] lpoly for default bandwidth.

    I was disappointed in finding that lpoly by default usually undersmoothed to my taste, indeed extraordinarily so. Hence I fooled around with various examples to come up with a different default. The result is localp on SSC, which is just a wrapper for lpoly. The advice is not to use my default but to follow suit in thinking up your own defaults.

    I know that sometimes people need to look at lots and lots of these and if so they need some automated (please: not objective; please, not optimal) bandwidth selection method. That's not been my need so my practice is outrageous. Have a default. Smooth more -- or less -- if it helps thinking about the data in the light of what you know about underlying processes.

    Silverman's rule of thumb is for density estimation, not scatter plot smoothing, is it not?

    localp was announced at https://www.statalist.org/forums/for...ial-regression

    There is a known bug in the SSC version so here is the code from my machine.

    Code:
    *! 1.0.3 NJC 13 November 2020
    *! 1.0.2 NJC 11 April 2015
    * 1.0.1 NJC 9 April 2015
    * 1.0.0 NJC 19 March 2015
    program localp
        version 10
        syntax varlist(min=2 max=2 numeric) [fweight aweight] [if] [in] ///
        [, Kernel(str) BWidth(str) DEGree(int 1) AT(str) Msymbol(str)   ///
        GENerate(passthru) * ]
    
        tokenize "`varlist'"
        args yvar xvar
    
        if "`kernel'" == "" local kernel "biweight"
        
        if "`bwidth'" == "" {
            su `xvar' `if' `in', meanonly
            local bwidth = 0.2 * (r(max) - r(min))
            local factor = 10^floor(log10(`bwidth'))
            local bwidth = floor(`bwidth' / `factor') * `factor'
        }
    
        if "`at'" == "" local at `xvar'
    
        quietly {
            tempvar ypred
    
            lpoly `yvar' `xvar' `if' `in' [`weight' `exp'], ///
            k(`kernel') bw(`bwidth') at(`at') deg(`degree') ///
            nograph gen(`ypred')
    
            reg `yvar' `ypred' [`weight' `exp'] `if' `in'
            local rsq  : di %04.3f `e(r2)'
            local rmse : di %5.4g `e(rmse)'
            if c(stata_version) < 11 {
                local text "R-sq = `rsq', RMSE = `rmse'"
            }
            else local text "{it:R}{sup:2} = `rsq', RMSE = `rmse'"
        }
    
        if "`msymbol'" == "" local msymbol "Oh"
    
        local word : word `= `degree' + 1' of ///
        mean linear quadratic cubic quartic quintic
        if "`word'" != "" local title "local `word' smooth"
        else local title "local polynomial smooth"
    
        lpoly `yvar' `xvar' `if' `in' [`weight' `exp'],           ///
        k(`kernel') bw(`bwidth') at(`at') deg(`degree')           ///
        msymbol(`msymbol') `generate'                             ///
        subtitle("`text'", size(medsmall) place(w))               ///
        title(`title', size(medium) place(w)) yla(, ang(h)) `options'
    end


    So my default is range / 5 rounded down to a nice number. My default does not depend on sample size, which I want to call lacking in subtlety.

    I default to biweight, for which I have an irrational fondness.

    Reasons for not liking the Epanechnikov: kernel:

    1. Whenever a procedure is named after somebody, I want to be able to tell a story about that person. I've never found anything on E.

    2. I don't trust my students to spell the name correctly. Heck, I don't trust myself to spell it correctly.

    3. Technicalities. I prefer the way that the biweight behaves at the kernel edge.

    Comment


    • #3
      for cross validation, you have the utilities i explain here:https://www.stata-journal.com/articl...article=st0613
      all within package -vc_pack- available from SSC.

      Take a look into vc_bw y, vcoeff(x)
      This assumes a local linear model.

      You could, also use Stata official npregress kernel, making sure you select degree(1) or degree(0), for local linear and local constant models.

      Comment


      • #4
        Thanks to both of you for your help!

        My main regression is:
        Code:
        egen double_cluster=group(id year)
        reg logexptot D logwpop logwpop_2 logwpop_3 logwpop_D1 logwpop_D2 logwpop_D3, vce(cluster double_cluster)
        When using vc-pack, would I then write:
        Code:
        vc_bwalt logexptot logwpop logwpop_2 logwpop_3 logwpop_D1 logwpop_D2 logwpop_D3, vcoeff(D)
        and what of the outcome would give me the by cross-validation determined bandwidth?
        Click image for larger version

Name:	bandw.PNG
Views:	1
Size:	9.4 KB
ID:	1632510


        How can I specify the kernel? The following is not allowed.
        Code:
        vc_bwalt logexptot logwpop logwpop_2 logwpop_3 logwpop_D1 logwpop_D2 logwpop_D3, vcoeff(D) kernel(epanechnikov)
        Lastly, I wondered whether --- in general --- it is "allowed" to estimate the model non-parametrically as the running variable (of my RD-regression) is discrete (log population)?

        Thanks again

        Comment


        • #5
          So, now your question has morphed into one on vc_pack (please explain its provenance: FAQ Advice #12) and as it turns out FernandoRios will be the right person to answer.

          Your general question may well be fielded by folks who do this kind of thing.

          Comment


          • #6
            Hi Kathrin,
            couple of thoughts
            1) while i suggest vc_bwalt as a more robust method to find optimal bandwidth, vc_bw will be far better.
            2) in this command, the running variable goes to -vcoeff()-. So I'm guessing D is that variable.
            3) log population doesn't sound discreet. if it works with lpoly, should work here too.
            4) you cannot use interactions of the vcoeff() variable as part of explanatory variables.
            5) see the helpfile for kernel specification. the key word i use for this is epan.
            6) D has o be continuos. otherwise it will not work well (or semi continuous)

            HTH

            Comment


            • #7
              Hi again,

              my question moved vc_pack as I understood it as one way how to find the optimal bandwidth using the cross-validation method and this was one of my preferred approaches how to find the optimal bandwidth.

              FernandoRios: My running variable is logwpop, D is the treatment variable (i.e. D=1 if logwpop >0 and zero otherwise). So I now changed the code into:

              Code:
              vc_bw logexptot logwpop_2 logwpop_3 if D == 0, vcoeff(logwpop) kernel(epan) 
              display $opbw_
              vc_bw logexptot logwpop_2 logwpop_3 if D == 1, vcoeff(logwpop) kernel(epan) 
              display $opbw_
              and I get an optimal bandwidth of .0014289 and .00967457 respectively. I am wondering a bit, because when doing the local polynomial regression plot, i.e.

              Code:
              twoway (lpoly logexptot bin if D==0,  degree (1) lwidth(1) kernel(epanechnikov))(lpoly logexptot bin if D==1,  degree (1) lwidth(1) kernel(epanechnikov))(scatter logexptot bin if D==0)(scatter logexptot bin if D==1)
              Stata determines .02337655 and .03403092 to be optimal which is quite far away from what I get when using vc_bw.

              Thanks again for your help!

              Comment


              • #8
                Hi Kathrin,
                If you are using logwpop as your running variable, you cannot use "logwpop_2 logwpop_3" (which I'm guessing are the squared and cubed of lopwpop). Or any other transformation of logwpop as explanatory variable.

                If you want to compare it with lpoly, you will need to do the following:
                vc_bw logexptot if D == 0, vcoeff(logwpop) kernel(epan) vc_bw logexptot if D == 1, vcoeff(logwpop) kernel(epan) The BW you find there is the one you could use with lpoly HTH

                Comment


                • #9
                  Hi Fernando,

                  alright!

                  When using your specification I get 2.385829 if D == 0 and .0410291 if D == 1. The latter is somewhat close to what I find with lpoly. However 2.39 is still quite far away from 0.023. Do you have any idea for why I find so different results?

                  Thanks again

                  Best, Kathrin

                  Comment


                  • #10
                    Probably because some sparse distribution of the running variable for d=0
                    Bw is sensitive to outliers (specially near the boundaries)

                    since you are using epan I would double check the bw using vc_bwalt
                    epan cross validation plot tends to be highly nonlinear and that could be a problem

                    Comment


                    • #11
                      I tried already, but this gives me the following error:

                      "unable to allocate matrix;
                      You have attempted to create a matrix with too many rows or columns or attempted to fit a model with too many variables.

                      You are using Stata/IC which supports matrices with up to 800 rows or columns. See limits for how many more rows and columns Stata/SE and Stata/MP can support.

                      If you are using factor variables and included an interaction that has lots of missing cells, try set emptycells drop to reduce the required matrix size; see help set
                      emptycells.

                      If you are using factor variables, you might have accidentally treated a continuous variable as a categorical, resulting in lots of categories. Use the c. operator on such
                      variables.
                      r(915);"

                      Comment


                      • #12
                        mmm, I see. I will check my code to see if I could bypass that. But it seems that you go through too many iterations. One thing you could do is use the following:
                        vc_bwalt y, vcoeff(x) bwi(2.3858) kernel(epan)

                        That way uses bwi as initial value.

                        The other thing you could do, if it ends in error (as you show) type
                        matrix list cvbw

                        This will give show you all the bandwidths and Crossvalidations. You can plot them to see how stable was the solution towards the optimal bandwidth.

                        Hope this helps

                        F

                        Comment


                        • #13
                          So when using vc_bwalt logexptot if D == 1, vcoeff(logwpop) kernel(epan) bwi(2.39) I get a bandwidth of 51099.896.

                          I now also tried your other suggestion, npregress.

                          Given my main regression,

                          Code:
                          egen double_cluster=group(id year)
                          reg logexptot D logwpop logwpop_2 logwpop_3 logwpop_D1 logwpop_D2 logwpop_D3, vce(cluster double_cluster)
                          how would I specify the command? Do I still use logwpop or do I have to switch back to D (an indicator variable for my treatment)? Can I use logwpop_2 logwpop_3 logwpop_D1 logwpop_D2 logwpop_D3 in the specification?

                          So far I tried:

                          Code:
                          npregress kernel logexptot logwpop if D == 0
                          where I get this as an output:

                          Click image for larger version

Name:	bwi.JPG
Views:	1
Size:	75.9 KB
ID:	1633784



                          What of the output gives me the optimal bandwidth? Would it be -4.19 in this case?

                          Thanks again for your help!

                          Comment


                          • #14

                            i should have asked for a plot first!
                            so the bandwidth with npregress is 97953
                            in other words the function between logextpop and logwpop Is basically linear.
                            it could be because your sample size is so small.

                            Comment


                            • #15
                              I see!
                              Do I need to include logwpop_2 logwpop_3 logwpop_D1 logwpop_D2 logwpop_D3 in the specification?

                              Comment

                              Working...
                              X