Distributional graphing - Kernel density

Zuhumnan Dapel

Join Date: Sep 2014

Posts: 392
#1

Distributional graphing - Kernel density

11 Feb 2015, 14:33

Dear All,
I have a cross-section survey dataset with weights. In plotting the distribution of income (not logged), I used the

Code:

kdensity income

. The result was a graph of kernel density estimate. The density is represented on the y-axis. How do I represent the actual population size instead of the proportion on the y-axis?

Thank you,
Dapel
Tags: None
David Radwin

Join Date: Mar 2014

Posts: 368
#2

11 Feb 2015, 15:57

You can't. Kernel density graphs show densities, not counts. Try histogram income, frequency instead, e.g.,

Code:

. sysuse nlsw88 (NLSW, 1988 extract) . histogram wage, frequency (bin=33, start=1.0049518, width=1.2042921)

Attached Files

David Radwin
Senior Researcher, California Competes
californiacompetes.org
Pronouns: He/Him
Comment
Zuhumnan Dapel

Join Date: Sep 2014

Posts: 392
#3

11 Feb 2015, 18:01

Thanks Dear Dave. Please have a look the graphs on this paper page 13 http://www.columbia.edu/~xs23/papers...2006.121.2.pdf. Is similar to what I have been wanting. How can I plot this?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#4

12 Feb 2015, 00:58

The units of density are probability per unit of measurement, dollars or log dollars or whatever. So, the area under the curve integrates to 1, namely the total probability in the distribution. If you want to show something else, work out what you think the distribution integrates to (the population?) and just modify your y axis labels accordingly. You can use e.g. mylabels (SSC) as a convenience command. It's just a multiplication. As long as the axis information is what you need, nothing else need be changed.
Comment
Zuhumnan Dapel

Join Date: Sep 2014

Posts: 392
#5

12 Feb 2015, 08:52

Thanks Prof. How do I modify this

Code:

mylabels 0(25)100, myscale(@/100) local(labels) kdensity income [aw=weight]

to suit what I want. I want a distribution graph, showing the transformed y-axis: from density probability to actual values, in this case, the population size?
Comment
Zuhumnan Dapel

Join Date: Sep 2014

Posts: 392
#6

12 Feb 2015, 08:52

Thanks Prof. How do I modify this

Code:

mylabels 0(25)100, myscale(@/100) local(labels) kdensity income [aw=weight]

to suit what I want. I want a distributional graph-kernel density, showing the transformed y-axis: from density probability to actual values, in this case, the population size? weighted, there are millions of obs to be represented. Or the code to transformed to percentages.

Last edited by Zuhumnan Dapel; 12 Feb 2015, 08:56.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#7

12 Feb 2015, 09:22

(Thanks for the implied compliment, but I am not a Professor)

Whatever you do, the density scale cannot be recast as a percent scale. The units are always something per unit of measurement. You can't get rid of the unit of measurement. It's the difference between a function and its integral.

But you can show percent per unit.

Code:

sysuse auto, clear kdensity mpg * ssc inst mylabels for next command to work mylabels 0(2)8, myscale(@/100) local(what) kdensity mpg, yla(`what') ytitle(Percent per mpg)
1 like
Comment

David Radwin

Join Date: Mar 2014
Posts: 368

12 Feb 2015, 11:47

I do not recommend this approach, but you could fake the frequency as the y-axis scale in this manner, continuing the example above:

Code:

mylabels 0(2)8, myscale(@/100) local(what)
kdensity mpg, yla(`what') ytitle(Percent per mpg) name(graph1)

summarize mpg
mylabels 0(2)8, myscale(@/`r(N)') local(freq)
kdensity mpg, yla(`freq') ytitle(Frequency) name(graph2)

graph combine graph1 graph2

The problem is that the plot appears to show that the modal value (the one with the largest number of cases) occurs in about 5 observations, but in truth it occurs in 9 observations.

Code:

. tab mpg, sort

    Mileage |
      (mpg) |      Freq.     Percent        Cum.
------------+-----------------------------------
         18 |          9       12.16       12.16
         19 |          8       10.81       22.97
         14 |          6        8.11       31.08
         21 |          5        6.76       37.84
         22 |          5        6.76       44.59
         25 |          5        6.76       51.35
         16 |          4        5.41       56.76
         17 |          4        5.41       62.16
         24 |          4        5.41       67.57
         20 |          3        4.05       71.62
         23 |          3        4.05       75.68
         26 |          3        4.05       79.73
         28 |          3        4.05       83.78
         12 |          2        2.70       86.49
         15 |          2        2.70       89.19
         30 |          2        2.70       91.89
         35 |          2        2.70       94.59
         29 |          1        1.35       95.95
         31 |          1        1.35       97.30
         34 |          1        1.35       98.65
         41 |          1        1.35      100.00
------------+-----------------------------------
      Total |         74      100.00

David Radwin
Senior Researcher, California Competes
californiacompetes.org
Pronouns: He/Him

Comment

Zuhumnan Dapel

Join Date: Sep 2014

Posts: 392
#9

12 Feb 2015, 11:53

Thank you Sir.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#10

12 Feb 2015, 13:10

I don't think anyone ever claimed that kernel density estimation estimates the density at the mode accurately. Loosely, a smoother will always reduce peaks and fill valleys.That's (a side-effect of) its job!

I think there is a standard issue of over-use. I suspect too many regard the density estimate produced by the default kernel and default width as canonical. It was ever thus. I don't find that students exploit the scope for tuning the details of a histogram as often as they should.

Last edited by Nick Cox; 12 Feb 2015, 13:12.
1 like
Comment
David Radwin

Join Date: Mar 2014

Posts: 368
#11

13 Feb 2015, 10:10

I agree completely on all points, but on the first point I suspect many readers interpret kernel density plots as essentially equivalent to histograms.

It's not incumbent on authors to anticipate and preempt every possible type of misinterpretation by readers, but it's good practice to avoid obvious pitfalls.

David

David Radwin
Senior Researcher, California Competes
californiacompetes.org
Pronouns: He/Him
Comment
Zahid Khan

Join Date: Mar 2019

Posts: 16
#12

15 Mar 2020, 06:21

Hi all,
I have gone through all this and I am trying something similar, where I want to have a percentage on y-axis instead of density. I tried the suggestion above but then I kdensity graph become flat.

I am trying to generate kdensity for wages but for a certain range of wages.

Code:

mylabels 0(2)8, myscale(@/100) local(what) kdensity formalfwage if formalfwage<25000, yla(`what') ytitle(Percent per formalfwage)

I get kdensity when using following codes

Code:

mylabels 0(200)8, myscale(@/100) local(what) kdensity formalfwage if formalfwage<25000, yla(`what') ytitle(Percent per formalfwage)

but then it only shows zero as in the following

I am sorry for not a well written question.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#13

15 Mar 2020, 07:29

As your range of wage is about 25000 -- in whatever currency units you are using -- it follows that the probability density is typically of the order of 1/25000 with the peak density, let's say, about 5 times that, of the order of 1/5000. If you multiply that scale by 100 the peak density -- with units now percent per unit currency -- still only is 100/5000 or 1/50 or 0.02, and thus much less than 2 and similarly much, much less than any other label you want to see. Hence Stata just ignores that label as being way out of range. Now multiply by another 1000 and 0.02 becomes 20 and you can then show say 0 5 10 15 20 but the units are now percent per 1000 units of currency.

I find it easier to explain what probability density is than to choose unusual units in the vague hope that readers will found them congenial.

Either way, calling your vertical units percent is strictly meaningless without a statement of the units on the horizontal axis,

Not the question, but I guess most people in the field would work with log wage and/or smooth more than the default. In your graph minor modes around 10000 12000 15000 20000 seem all too likely to be side-effects of coarseness in reporting data. If the intent is to graph the data faithfully a spikeplot would be an honest graph; conversely, if you decide minor modes are not of inherent interest or concern you need to smooth more.

Over several years on Statalist and in other forums I've seen many variants of this question.

I imagine the pedagogic problem arises like this. People meet histograms early in their statistical education and and often see frequency scales, On a histogram the bin width is evident and it would, I guess, be thought be unnecessary or pedantic to spell out the units beyond "Frequency". Equally, percent is hardly more difficult to understand, and it's tacit that strictly the units are percent given a certain bin width.

Kernel density estimation often is encountered much later and it may be that many writers or teachers assume that readers or students already understand from elsewhere about probability density. In my experience that is often not the case, or people can't make the jump between probability theory and applied statistics.

And it doesn't seem customary with kernel density estimation to label the density axis with any units. A good reason for this is to suppose that it is evident that the area under the curve shows probability, so the units of the height of the curve are a step towards that fundamental and fairly easy fact. Another good, or defensible, reason not to show units is that the units of probability density are often awkward to think about directly. Thus (and it is easy to think up more bizarre examples) I often deal with river discharges which are volume of water per unit time, so their probability density has units the reciprocal of cubic metres per second (or cubic feet per second in certain countries), which doesn't have much meaning even to those used to hydrological data.

Not the question either, but I've seen comments that probability densities above 1 show a bug in the command. Those saying this are always ignoring the horizontal range on their graph, which if less than 1 in the units shown oblige at least some of the density to be more than 1 in the reciprocal of those units.

Last edited by Nick Cox; 15 Mar 2020, 07:31.
1 like
Comment
Zuhumnan Dapel

Join Date: Sep 2014

Posts: 392
#14

25 Mar 2020, 07:58

Please permit me to jump in with a question: I want to plot a density graph with time on the horizontal axis and row values of the variables (in level) on the vertical axis.

Thanks,
Dapel
Comment

Announcement

Distributional graphing - Kernel density

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment