Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Distributional graphing - Kernel density

    Dear All,
    I have a cross-section survey dataset with weights. In plotting the distribution of income (not logged), I used the
    Code:
    kdensity income
    . The result was a graph of kernel density estimate. The density is represented on the y-axis. How do I represent the actual population size instead of the proportion on the y-axis?

    Thank you,
    Dapel

  • #2
    You can't. Kernel density graphs show densities, not counts. Try histogram income, frequency instead, e.g.,

    Code:
    . sysuse nlsw88
    (NLSW, 1988 extract)
    
    . histogram wage, frequency
    (bin=33, start=1.0049518, width=1.2042921)
    Click image for larger version

Name:	wage.png
Views:	5
Size:	41.7 KB
ID:	807367
    Attached Files
    David Radwin
    Senior Researcher, California Competes
    californiacompetes.org
    Pronouns: He/Him

    Comment


    • #3
      Thanks Dear Dave. Please have a look the graphs on this paper page 13 http://www.columbia.edu/~xs23/papers...2006.121.2.pdf. Is similar to what I have been wanting. How can I plot this?

      Comment


      • #4
        The units of density are probability per unit of measurement, dollars or log dollars or whatever. So, the area under the curve integrates to 1, namely the total probability in the distribution. If you want to show something else, work out what you think the distribution integrates to (the population?) and just modify your y axis labels accordingly. You can use e.g. mylabels (SSC) as a convenience command. It's just a multiplication. As long as the axis information is what you need, nothing else need be changed.

        Comment


        • #5
          Thanks Prof. How do I modify this
          Code:
           mylabels 0(25)100, myscale(@/100) local(labels)
          kdensity  income [aw=weight]
          to suit what I want. I want a distribution graph, showing the transformed y-axis: from density probability to actual values, in this case, the population size?

          Comment


          • #6
            Thanks Prof. How do I modify this
            Code:
             mylabels 0(25)100, myscale(@/100) local(labels)
            kdensity  income [aw=weight]
            to suit what I want. I want a distributional graph-kernel density, showing the transformed y-axis: from density probability to actual values, in this case, the population size? weighted, there are millions of obs to be represented. Or the code to transformed to percentages.
            Last edited by Zuhumnan Dapel; 12 Feb 2015, 08:56.

            Comment


            • #7
              (Thanks for the implied compliment, but I am not a Professor)

              Whatever you do, the density scale cannot be recast as a percent scale. The units are always something per unit of measurement. You can't get rid of the unit of measurement. It's the difference between a function and its integral.

              But you can show percent per unit.

              Code:
               
              sysuse auto, clear
              kdensity mpg
              * ssc inst mylabels for next command to work 
              mylabels 0(2)8, myscale(@/100) local(what)
              kdensity mpg, yla(`what') ytitle(Percent per mpg)

              Comment


              • #8
                I do not recommend this approach, but you could fake the frequency as the y-axis scale in this manner, continuing the example above:

                Code:
                mylabels 0(2)8, myscale(@/100) local(what)
                kdensity mpg, yla(`what') ytitle(Percent per mpg) name(graph1)
                
                summarize mpg
                mylabels 0(2)8, myscale(@/`r(N)') local(freq)
                kdensity mpg, yla(`freq') ytitle(Frequency) name(graph2)
                
                graph combine graph1 graph2
                The problem is that the plot appears to show that the modal value (the one with the largest number of cases) occurs in about 5 observations, but in truth it occurs in 9 observations.

                Code:
                . tab mpg, sort
                
                    Mileage |
                      (mpg) |      Freq.     Percent        Cum.
                ------------+-----------------------------------
                         18 |          9       12.16       12.16
                         19 |          8       10.81       22.97
                         14 |          6        8.11       31.08
                         21 |          5        6.76       37.84
                         22 |          5        6.76       44.59
                         25 |          5        6.76       51.35
                         16 |          4        5.41       56.76
                         17 |          4        5.41       62.16
                         24 |          4        5.41       67.57
                         20 |          3        4.05       71.62
                         23 |          3        4.05       75.68
                         26 |          3        4.05       79.73
                         28 |          3        4.05       83.78
                         12 |          2        2.70       86.49
                         15 |          2        2.70       89.19
                         30 |          2        2.70       91.89
                         35 |          2        2.70       94.59
                         29 |          1        1.35       95.95
                         31 |          1        1.35       97.30
                         34 |          1        1.35       98.65
                         41 |          1        1.35      100.00
                ------------+-----------------------------------
                      Total |         74      100.00
                David Radwin
                Senior Researcher, California Competes
                californiacompetes.org
                Pronouns: He/Him

                Comment


                • #9
                  Thank you Sir.

                  Comment


                  • #10
                    I don't think anyone ever claimed that kernel density estimation estimates the density at the mode accurately. Loosely, a smoother will always reduce peaks and fill valleys.That's (a side-effect of) its job!

                    I think there is a standard issue of over-use. I suspect too many regard the density estimate produced by the default kernel and default width as canonical. It was ever thus. I don't find that students exploit the scope for tuning the details of a histogram as often as they should.
                    Last edited by Nick Cox; 12 Feb 2015, 13:12.

                    Comment


                    • #11
                      I agree completely on all points, but on the first point I suspect many readers interpret kernel density plots as essentially equivalent to histograms.

                      It's not incumbent on authors to anticipate and preempt every possible type of misinterpretation by readers, but it's good practice to avoid obvious pitfalls.

                      David
                      David Radwin
                      Senior Researcher, California Competes
                      californiacompetes.org
                      Pronouns: He/Him

                      Comment


                      • #12
                        Hi all,
                        I have gone through all this and I am trying something similar, where I want to have a percentage on y-axis instead of density. I tried the suggestion above but then I kdensity graph become flat.

                        I am trying to generate kdensity for wages but for a certain range of wages.
                        Code:
                        mylabels 0(2)8, myscale(@/100) local(what)
                        
                        kdensity formalfwage if formalfwage<25000, yla(`what') ytitle(Percent per formalfwage)
                        I get kdensity when using following codes
                        Code:
                        mylabels 0(200)8, myscale(@/100) local(what)
                        
                        kdensity formalfwage if formalfwage<25000, yla(`what') ytitle(Percent per formalfwage)
                        but then it only shows zero as in the following
                        Click image for larger version

Name:	Graph.png
Views:	1
Size:	23.4 KB
ID:	1541379


                        I am sorry for not a well written question.

                        Comment


                        • #13
                          As your range of wage is about 25000 -- in whatever currency units you are using -- it follows that the probability density is typically of the order of 1/25000 with the peak density, let's say, about 5 times that, of the order of 1/5000. If you multiply that scale by 100 the peak density -- with units now percent per unit currency -- still only is 100/5000 or 1/50 or 0.02, and thus much less than 2 and similarly much, much less than any other label you want to see. Hence Stata just ignores that label as being way out of range. Now multiply by another 1000 and 0.02 becomes 20 and you can then show say 0 5 10 15 20 but the units are now percent per 1000 units of currency.

                          I find it easier to explain what probability density is than to choose unusual units in the vague hope that readers will found them congenial.

                          Either way, calling your vertical units percent is strictly meaningless without a statement of the units on the horizontal axis,

                          Not the question, but I guess most people in the field would work with log wage and/or smooth more than the default. In your graph minor modes around 10000 12000 15000 20000 seem all too likely to be side-effects of coarseness in reporting data. If the intent is to graph the data faithfully a spikeplot would be an honest graph; conversely, if you decide minor modes are not of inherent interest or concern you need to smooth more.

                          Over several years on Statalist and in other forums I've seen many variants of this question.

                          I imagine the pedagogic problem arises like this. People meet histograms early in their statistical education and and often see frequency scales, On a histogram the bin width is evident and it would, I guess, be thought be unnecessary or pedantic to spell out the units beyond "Frequency". Equally, percent is hardly more difficult to understand, and it's tacit that strictly the units are percent given a certain bin width.

                          Kernel density estimation often is encountered much later and it may be that many writers or teachers assume that readers or students already understand from elsewhere about probability density. In my experience that is often not the case, or people can't make the jump between probability theory and applied statistics.

                          And it doesn't seem customary with kernel density estimation to label the density axis with any units. A good reason for this is to suppose that it is evident that the area under the curve shows probability, so the units of the height of the curve are a step towards that fundamental and fairly easy fact. Another good, or defensible, reason not to show units is that the units of probability density are often awkward to think about directly. Thus (and it is easy to think up more bizarre examples) I often deal with river discharges which are volume of water per unit time, so their probability density has units the reciprocal of cubic metres per second (or cubic feet per second in certain countries), which doesn't have much meaning even to those used to hydrological data.

                          Not the question either, but I've seen comments that probability densities above 1 show a bug in the command. Those saying this are always ignoring the horizontal range on their graph, which if less than 1 in the units shown oblige at least some of the density to be more than 1 in the reciprocal of those units.
                          Last edited by Nick Cox; 15 Mar 2020, 07:31.

                          Comment


                          • #14
                            Please permit me to jump in with a question: I want to plot a density graph with time on the horizontal axis and row values of the variables (in level) on the vertical axis.

                            Thanks,
                            Dapel

                            Comment

                            Working...
                            X