Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Truncate Y-axis in histogram

    Hi,

    I want to graph an histogram, and the distribution is very skewed, For comparative purposes, I want to truncate Y-axis in 25% instead of 80% (the maximum that Stata use).

    I used:

    hist Yvar, percent yscale(range(0 25) noextend ) ylabel(0(5)25)

    but it doesn´t work.

    Any help?

    Thank in advance,

    Adriana



  • #2
    Stata won't omit representations of data just because you specify narrower limits using xscale() or yscale().

    You could omit the mode separately, but then you'd need to calculate the percents separately.

    Show us the results of

    Code:
     
    tab Yvar
    to give a precise idea of what the data look like.

    Comment


    • #3
      Thanks Nick.

      This is the distribution:

      Yvar Freq. Percent Cum.

      0 868 51.06 51.06
      .0011111 1 0.06 51.12
      .0019444 27 1.59 52.71
      .0022222 9 0.53 53.24
      .0038889 19 1.12 54.35
      .0041667 31 1.82 56.18
      .0044444 2 0.12 56.29
      etc

      But I don´t want to omite 0 from the graph because I want to compare this histogram with others, less skewed, distributions, I was just thinking to mention on the graph or in a footnote than the first bar reaches until 80%...

      Comment


      • #4
        Stata makes it difficult to falsify (choose different word if you wish) a histogram and that seems much more right than wrong. I can think of ways to do what you ask but I would rather not publish code to do something I regard as a bad idea.

        You won't like this idea, probably, but to me a defensible solution is show frequencies (in your case percents) on a square root scale. There are good reasons for this statistically, including the fact that square roots stabilise variability in counts. The fact that zeros map to zeros is also often pertinent.

        A trick to do this fairly easily is to call up spikeplot and recast it as a bar chart, here a histogram. Here is an example with simple data. I ensure that the axis labels do not lie.

        Code:
        clear
        set obs 11
        gen x = _n - 1
        gen freq = floor(10000 * exp(-0.9*_n))
        sum freq, meanonly
        gen percent = 100 * freq/r(sum)
        list
        * ssc inst mylabels
        mylabels 0 1 4 9 16 25 36 49 64, myscale(sqrt(@)) local(labels)
        spikeplot  x [aw=percent], bfcolor(none) scheme(s1color) root ///
        yla(`labels')  recast(bar) yla(, ang(h)) ytitle(Percent (root scale))
        Click image for larger version

Name:	spikeplot.png
Views:	1
Size:	11.2 KB
ID:	1333492


        To write code to do this for your data, I would need to know the bin width you wanted.


        Last edited by Nick Cox; 31 Mar 2016, 11:06.

        Comment


        • #5
          I like the idea, but I am not sure that a reviewer of a medical journal thinks the same way.
          Thanks anyway!

          Comment


          • #6
            John Wilder Tukey, no less, encouraged us to plot frequencies using a root scale. A key word is rootogram.

            Indeed see references to the same idea too in Sir Ronald Fisher, no less too, in http://www.stata-journal.com/sjpdf.h...iclenum=gr0052
            Last edited by Nick Cox; 01 Apr 2016, 18:49.

            Comment


            • #7
              Dear Nick,

              I have a similar problem - most of my data are zeros, so patterns in the non-zero data are hard to see. I wish to plot histograms of the distributions in both cases and controls as per http://www.ats.ucla.edu/stat/stata/f...am_overlay.htm. Because I have more controls than cases, I wish to plot density/percentage rather than frequencies. Is it possible to overlay two spikeplots on top of one another as it is with histograms?

              Thanks,
              Tom

              Comment


              • #8
                Superimposed spikeplots aren't going to look good unless you arrange for an offset, which itself is easy enough. For a write-up of the main idea, see http://www.stata-journal.com/sjpdf.h...iclenum=gr0026

                Here's some technique, which is easy to take further by calculating densities or percents.

                This doesn't address the question of a large spike of zeros. If you are plotting frequencies, I would tend just to omit the zero spike or bar and put the number(s) in a subtitle ("6 trillion zeros").

                As you are inclined to plot densities or percents, you would need to take the calculation further. Note that if you choose a histogram command, densities and percents are always for the values plotted, so if you omit zeros from the plot the percents shown are changed correspondingly.

                Code:
                set scheme s1color 
                sysuse auto, clear 
                bysort foreign rep78 : gen freq = _N 
                gen rtfreq = sqrt(freq) 
                gen rep78_1 = rep78 + 0.15 
                gen rep78_2 = rep78 - 0.15 
                egen tag = tag(foreign rep78) 
                levelsof rep78 
                
                twoway spike freq rep78_1 if foreign & tag || ///
                spike freq rep78_2 if !foreign & tag , xla(`r(levels)') ///
                xtitle(Repair record) ytitle(Frequency) legend(order(1 "Foreign" 2 "Domestic")) 
                
                twoway bar freq rep78_1 if foreign & tag , barw(0.25) || ///
                bar freq rep78_2 if !foreign & tag , barw(0.25) xla(`r(levels)') ///
                xtitle(Repair record) ytitle(Frequency) legend(order(1 "Foreign" 2 "Domestic"))

                Comment


                • #9
                  Thanks Nick,

                  Is there a way of rescaling the y axis in a similar way using twoway and histogram? As I have continuous data, I would prefer to overlay the two distributions (as per http://www.ats.ucla.edu/stat/stata/f...am_overlay.htm) rather than use an offset. Here is an example of what the untransformed data look like...



                  However, the code below (trying a logarithmic y axis) yields a plot that does not reflect the distribution of the data...

                  foreach var of varlist adeno e229 flua flub fluc hmpv nl63 oc43 piv1 piv2 piv3 piv4 rhino rsva rsvb {
                  twoway (histogram `var' if spneumo==1, start(0) width(1) color(red) yscale(log)) ///
                  (histogram `var' if spneumo==0, start(0) width(1) ///
                  fcolor(none) lcolor(black) yscale(log)), legend(order(1 "Severe Pneumonia" 2 "Controls" )) ///
                  name(hist`var', replace)
                  }



                  I suspect I am making a fairly basic mistake...

                  With best wishes,
                  Tom

                  Comment


                  • #10
                    I see two image icons that don't display properly, so I can't easily tell what you are trying to show, A histogram with logarithmic scale makes little sense to me in principle. The whole conceit is that bar areas are directly informative, which is compromised with a log scale. In any cases as densities integrate to 1, two histograms could be identical but otherwise one will necessarily occlude the other over part of its support.

                    I advise reading of http://www.statalist.org/forums/help#stata as your presentation of images here doesn't work, there is no data example for anyone to can work with and the code would be clearer if presented as such.

                    I'd advise some kind of quantile plot here personally.
                    Last edited by Nick Cox; 09 May 2016, 02:43.

                    Comment


                    • #11
                      Apologies, I have attached the two images. By using blocks of colour for the cases with only an outline for the controls, I think I have avoided the problem with one distribution occluding the other. However, this type of overlay may not be possible for the nice rootograms you presented earlier in this thread?

                      Best wishes,
                      Tom
                      Attached Files

                      Comment


                      • #12
                        In fact, the second graph is Stata's way of telling you that a log scale makes no sense here. (Personally I would prefer an error message for the sake of my colleagues and students.) Look carefully to see that the y axis labels are all scrunched up together.

                        Otherwise put, it should seem absurd that the bars are all of approximately equal height in the second graph when they manifestly aren't in the first graph.

                        You don't seem to believe me that log scale is a bad idea, but it's more important that Stata won't play with that any way.

                        To overlay rootograms you would need to calculate your own frequency variable and produce the graph yourself, as in my last post.

                        Comment


                        • #13
                          Hi Nick,

                          Understood. Might a rootogram be a good option here? Do the same theoretical problems with the area not approximating the amount of data not also occur with rootograms?

                          With best wishes,
                          Tom

                          Comment


                          • #14
                            Not to the same extent. For one, zero roots to zero, so at least there is a visible baseline which makes sense. On top of that, you do sacrifice the area interpretation, but the square root of a count is more stable than the count,and the highest counts are pulled in.

                            Comment


                            • #15
                              I have managed to produce rootograms with an offset, which I am very happy with.

                              Many thanks,
                              Tom

                              Comment

                              Working...
                              X