No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Continuing lines at top and bottom of RCD plot

    I am using the user-written command -distplot- to graph reverse cumulative distribution curves of a continuous log-transformed variable, -over- a categorical variable.

    My code is similar to the following:

    sysuse auto, clear
    gen lnmpg = log(mpg)
    distplot lnmpg, over(foreign) reverse(ge) midpoint

    This code graphs two lines with different start and end points (see attached .png). For example, in the "foreign" category, the top of the curve is not graphed until approximately lnmpg = 2.6, even though in theory all individuals have (probability > lnmpg) = 1 at X-axis values of lnmpg < 2.6.

    Click image for larger version

Name:	mpg_example.png
Views:	1
Size:	134.6 KB
ID:	1410265

    I would like to know if there is an option in -distplot- to graph the lines when (probability > lnmpg) is precisely 0 or 1, and then determine the scale of the X-axis manually. My apologies if I have missed this information in the -distplot- help file or elsewhere on the Stata forum.

  • #2
    distplot is user-written as you have pointed out, but what you're asked to say is where it comes from (FAQ Advice #12).

    That was spelled out in a different thread a few hours ago.

    distplot works with cumulative probabilities. Setting aside the possibility of ties, the choice is clear-cut.

    For n data points you can plot cumulative probabilities that run from 1/n to 1 or you can choose to plot 0 to (n - 1)/n or you can choose the midpoint option which will plot 0.5/n to (n - 0.5) / n.

    distplotoffers no option that will show both 0 and 1. It's not in the software and I don't see it as a request that makes obvious sense. Let's take 1/n as lowest probability for lowest value: precisely where would you put 0 for values below the lowest observed? And conversely for the opposite convention.

    If you happen to know that there is a hard minimum and/or a hard maximum beyond which the function can't go, you could clone the code and add your own options to insist on either or both. For log of mpg, that's utterly implausible. Your real example may be different in this respect.

    This code segment and resulting graph perhaps make the choices clearer than your example. With five distinct values the probabilities will run 0.2 to 1 or 0 to 0.8 or 0.1 to 0.9, but there are no other choices.

    set obs 5 
    gen y = _n
    distplot y, name(a, replace) recast(connected) subtitle(a) 
    distplot y, reverse name(b, replace) recast(connected) subtitle(b) 
    distplot y, reverse(ge) name(c, replace) recast(connected) subtitle(c) 
    distplot y, reverse(ge) midpoint name(d, replace) recast(connected) subtitle(d) 
    graph combine a b c d, ycommon
    Click image for larger version

Name:	distplot.png
Views:	1
Size:	30.2 KB
ID:	1410284


    • #3
      Thanks very much for your reply and for illustrating my question more clearly than I could; my apologies for failing to mention the provenance of -distplot-, of course written by yourself and first mentioned in Stata Technical Bulletin, September 1999 (

      I do understand your comment about how a distribution could include 0 or 1, but not both. My original wish was to use the -midpoint- option in order to better illustrate differences over the categories that I'd chosen. Below are graphs that illustrate all categories and an isolation of the second category (my apologies in advance if the figures are too large, my browser seems to be having some trouble uploading):

      distplot log2sba, reverse(ge) midpoint over(category)
      distplot log2sba if category==2, reverse(ge) midpoint
      Click image for larger version

Name:	graphexample3.png
Views:	1
Size:	157.7 KB
ID:	1410300 Click image for larger version

Name:	graphexample.png
Views:	1
Size:	103.0 KB
ID:	1410301

      Or, alternatively, without the midpoint option:

      distplot log2sba if category==2, reverse(ge)
      Click image for larger version

Name:	graphexample2.png
Views:	1
Size:	88.9 KB
ID:	1410302

      I am supposing that the graphed RCD curve begins at X=3 because for my dataset, there are no circumstances of X < 3. However, wouldn't it then follow that the proportion of individuals in this category with 0 < X < 3 is 1.0? Supposing I wish to include calculated y-axis values in the RCD plot from X=0 to X=3, is there a way I might I achieve this within -distplot- for each category? Again, my apologies if I am missing something obvious (a likely scenario). Perhaps the answer is, as you suggested, to clone the code and force it to equal 1 for these values.


      • #4
        Clearly the whole support has probability 1, but the programmer's need is to know what to put where on a graph given finite data.

        As a user-programmer I sometimes write for myself and sometimes try to provide what others are asking for, but my altruism is strictly limited and I won't deliberately support what I think is poor statistics, or at least not to my taste. My own preference is for what is coded by the midpoint option, not least because that is consistent with quantile plot practice, which to me is the bigger deal.

        In your last paragraph you seem to be asking again how to stretch curves so that they touch 0 and 1, but the main point of #2 is that that is not supported by distplot. But Stata is programmable and what I think you want is something like

        sysuse auto, clear
        set scheme s1color 
        bysort foreign (mpg) : gen cump = (_n - 1) / (_N - 1)
        gen negcump = 1 - cump 
        line negcump mpg if foreign, lc(blue) || line negcump mpg if !foreign, lc(red) ///
        legend(order(1 "Foreign" 2 "Domestic")) ytitle(Probability)


        • #5
          It's really great to hear this from you--as a relative Stata newbie it is reassuring to know that use of the midpoint option is supported by others much more experienced than I.

          Thanks very much for your insight and your code suggestions!


          • #6
            I wouldn't phrase the matter quite like that. There are two literatures that overlap less than one might guess.

            Quantile plots plot ordered values versus what is often called a plotting position, which is often (rank - a) / (#values - 2a + 1) for some constant a. So in particular a = 0.5 yields (rank - 0.5) / #values. That idea goes back at least to Galton in 1883! The main reason for that is for comparison with any brand-name distribution the quantile function may be undefined for either or both of 0 and 1 and so pulling in the plotting position is a good idea. Quantile plots are hard to combine with weights. However, a = 1 yields a plotting position that can attain 0 and 1, which some people desire.

            (Empirical (cumulative)) distribution plots plot cumulative probability versus value. It's easy to combine these with weights. It's my impression that most people calculate probability >= or <= and don't use a midpoint approximation at all. But it's hard to spot unless the sample size is quite small and people rarely are explicit about the precise rules (and may not know if they don't read documentation carefully or inspect the code).


            • #7
              Cross-reference to update on SSC

              A revised help file tries to explain more clearly what is being plotted.