No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Plot (kernel) density estimates as areas

    This is a brief puff for an idea that has become standard in some quarters, but seems to deserve a bigger push until everyone who might care knows about it. Here is a reproducible example, which as always is indicative, not definitive.

    sysuse auto, clear
    gen where = _n + 4 in 1/45
    local choices kernel(biweight) bw(5) at(where)
    kdensity mpg if foreign, `choices' gen(x1 d1)
    kdensity mpg if !foreign, `choices' gen(x0 d0)
    gen rug1 = -0.004
    gen rug0 = -0.008
    twoway area d1 d0 where, xtitle("`: var label mpg'") color(orange%40 blue%40) ///
    || scatter rug1 mpg if foreign, ms(|) mc(orange) msize(medlarge) ///
    || scatter rug0 mpg if !foreign, ms(|) mc(blue) msize(medlarge) ///
    legend(order(1 "Foreign" 2 "Domestic") pos(1) ring(0) col(1)) ///
    ytitle(Probability density) yla(, ang(h)) xla(10(10)40)
    Click image for larger version

Name:	kdensity.png
Views:	1
Size:	26.2 KB
ID:	1547539

    Kernel density estimates are plotted by default in Stata as lines, meaning curves. It is elementary (meaning, fundamental) that area under the curve has an interpretation as probability.

    Often area-based graphs say in a complicated way what could be said much more simply. Bad examples include bars with arbitrary bases that could just be replaced by point symbols for the values in question, or bars that start at zero, when not being zero is banal or irrelevant.

    However, area graphs can be helpful when comparing two or more distributions. (Histograms work that way.) But then transparency becomes vital to see overlap clearly.

    You can do something like this directly with kdensity or twoway density with the option recast(area). There is no special rationale for coding as above, although the default of truncating the density at the observed extremes can be unfortunate, so I typically work a little harder at setting up a wider grid on which to calculate estimates.

    The immediate inspiration for this came from an excellent book by Claus Wilke. This is a link to a review I wrote with several detailed comments:

  • #2
    Thank you once more, Nick, for your incredible contributions. That code has been remarkably useful for our academic projects.


    • #3
      Thanks very much for #2. Anyone interested in this thread might find interesting or even useful.


      • #4
        Hi Nick

        Can you advise to subset the matched cohort after Kernel matching?



        • #5
          Kernel matching is not something I know anything about. I'd advise a separate question.


          • #6
            extremely useful. Nice and easy to use. Will implement it in my paper. Thanks a lot!


            • #7
              Thank you!!!!!!