Raincloud plot in Stata

Edgar Kausel

Join Date: Jul 2015

Posts: 13
#1

Raincloud plot in Stata

20 Nov 2022, 13:03

Hi:

Is anyone aware or has anyone tried to recreate a raincloud plot in Stata?

As in here:
https://wellcomeopenresearch.org/articles/4-63
https://shilaan.rbind.io/post/visual...incloud-plots/

I'm looking for a graph as below.

Thanks.
Edgar
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35699
#2

20 Nov 2022, 13:40

It's a hybrid box plot, density trace and parallel coordinates (profile, slope) plot. You need twoway for that if it has not been written up as a single command before.
1 like
Comment
Max Burger

Join Date: Feb 2023

Posts: 1
#3

27 Feb 2023, 09:35

I was looking for a way to produce a raincloud plot in Stata and came across this post. As it seems there is not user-written package yet, I followed the steps laid out by Nick Cox above producing the graph below. I did not find another post on the matter so I thougth it might be best to place my response here. As suggested, twoway was used with the boxplot-workaround, scatter plot (raindrops), and density plot (cloud). Find the single steps below. Happy for suggestions/improvements.

Step 1: Setup for boxplot in twoway
For the boxplot in a twoway graph the procedure suggested in Cox (2009) was followed (https://stats.oarc.ucla.edu/stata/co...twoway-graphs/).

Code:

sysuse auto, clear * Use egen to generate the median, quartiles, interquartile range (IQR), and mean. egen med = median(mpg) egen lqt = pctile(mpg), p(25) egen uqt = pctile(mpg), p(75) egen iqr = iqr(mpg) egen mean = mean(mpg) * Find the lowest value that is more than lqt - 1.5 iqr (Lower whisker) gen l = mpg if(mpg >= lqt-1.5*iqr) egen ls = min(l) * Find the highest value that is less than uqt + 1.5 iqr (Upper whisker) gen u = mpg if(mpg <= uqt+1.5*iqr) egen us = max(u)

Step 2: Set vertical position of "raindrops" (observations)
The single observation below the distribution plot (i.e., the raindrops under the cloud) will be displayed with help of a scatterplot. For this, a further variable determining their position in the graph must be set. Each observation receives a randomized, negative y-value using the runiform function. The interval needs to be adjusted to the corresponding density plot. The smaller value in the interval determines the distance between density plot and scatter plot. The boxplot is to be set in the “middle” of the observations.

Code:

* Set vertical position for observations in an interval gen position_drops= runiform(0.005,0.02) * (-1) // adjust interval * Set vertical position of boxplot in the middle of the interval of the observations gen position_box = (0.005 + 0.02) / (-2)

Step 3: Combine box, scatter, and density plot
The width/height of the boxplot needs to be adjusted to the density plot by adjusting the argument in the option barw(.).

Code:

twoway /// kdensity mpg , recast(area) color(purple*1.5%50) || /// Distribution (Cloud) rbar lqt med position_box, barw(.013) fcolor(white%0) lcolor(gs5) lwidth(thin) horizontal || /// Start: Boxplot rbar med uqt position_box, barw(.013) fcolor(white%0) lcolor(gs5) lwidth(thin) horizontal || /// rspike lqt ls position_box, lcolor(gs5) lwidth(thin) horizontal || /// rspike uqt us position_box, lcolor(gs5) lwidth(thin) horizontal || /// End: Boxplot scatter position_drops mpg , color(purple*1.5%25) /// Single observations (Rain) , legend(off) xtitle("Mileage per Gallon") yscale(off) xlab(15 20 25 30 35 40)

Graph produced

Last edited by Max Burger; 27 Feb 2023, 09:39.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35699
#4

27 Feb 2023, 12:21

The term raincloud plot in the original paper (first link in #1) refers to a combination of strip, box, and violin plot, but I guess it can be whatever you want to be, so long as there is a scatter of data points somewhere.

The multiplication of terms for univariate distribution plots in several literatures is anywhere between amusing and annoying. Many are mentioned and cited in the help for stripplot from SSC. Raincloud plot is in my collection, but any terms not mentioned there would be welcome news to me, following the principle that nonsense is always nonsense but the history of nonsense is a branch of scholarship.

My own inclination is, depending on caprice and circumstance,

to write up favourite choices as documented commands

to help explain how to code your own if you don't much like canned choices.

I don't find the classic Tukey rule of plotting separately those points more than 1.5 IQR from the nearer quartile especially compelling for current analyses. He got there after some experimentation with alternatives while in search for a simple rule of thumb for which data points should be plotted separately by hand with pen and paper, so that they could be thought about. This was 50 or so years ago, and plotting by pen and paper is not what anyone I know either prefers or practises. Now we can plot them all, at least in a trial run.

I don't find jittering more helpful than stacking.

More common plot types don't need much publicity, but here is what appears to be a less common plot type. a hybrid quantile-box plot with the box in Tufte form as a point symbol for the median and whiskers stretching between quartiles and extremes. A sales pitch would run that the quantile plot shows a great deal of detail, so the Tufte variant may be complementary in giving a big picture summary. Leaving the box implicit may seem odd, but equally I'd like to stress that in many problems the tails beyond the quartiles are just as important, scientifically or practically, as the half of the data between the quartiles.

Code:

sysuse auto, clear stripplot mpg, over(foreign) cumul cumprob tufte vertical xla(, tlength(0)) xsc(alt titlegap(*3)) height(0.7) yla(, ang(h))

Putting x axis stuff at the top is just playing around in this example.
2 likes
Comment
Stuart Leske

Join Date: Apr 2023

Posts: 2
#5

07 Apr 2023, 21:59

Dear Edgar Kausel, I'm sure you have done your raincloud plot by now, but just to add to this discussion, Asjad Naqvi has recently added a guide on generating raincloud plots to the Stata Guide on Medium: https://medium.com/the-stata-guide/s...s-577473033c11
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35699
#6

08 Apr 2023, 03:30

I like density estimation (here kernel density estimation) as much as most people, but I think its use needs special care, especially in exploratory graphs which need sceptical distrust as much as appreciative trust. Two points among several:

1. The use of density estimation carries some obligation to choose kernel width carefully and to explain how that was done. Most uses of raincloud and the loosely similar violin plots don't seem to do this and/or to trust program defaults naively.

2. Linked to that is that kernel density estimates can be misleading when a variable is skewed and/or bounded -- both of which are naturally common. #3 here is a case in point as the density estimate is just truncated at the sample extremes. Quite what should be done instead is often difficult to say, but this is a crucial question that should not be ignored. Not visible in #3 but often seen is kernel density estimation smearing probability mass into impossible regions, e.g. negative values for variables that can't be. It doesn't know better, at least in default form, but researchers should. There are known work-arounds, but they are rarely used.

Naturally, naive or careless uses (or users) don't indict good ideas, but what else is new?
1 like
Comment
Asjad Naqvi

Join Date: Oct 2014

Posts: 93
#7

13 Apr 2023, 05:36

Thank you Stuart Leske for mentioning the raincloud post! Even though I wrote the guide out of curiousity, I pretty much do not like raincloud plots. Nick above states several good reasons for this.
Comment
Stuart Leske

Join Date: Apr 2023

Posts: 2
#8

03 May 2023, 21:46

Thanks for clarifying, Asjad Naqvi, it's helpful to know that Stata Medium guides != endorsements. Regarding Nick's concerns above, there is no guidance on the bandwidth to use in the introductory article to raincloud plots: https://wellcomeopenresearch.org/articles/4-63. I like the boxplot and raw data points, but I find a histogram more informative than the kernel density estimate. Or, as Nick argues elsewhere in 2016 (https://www.stata.com/meeting/uk16/slides/cox_uk16.pdf), the quantile-quantile plots discussed above (I like the superimposed example), for reasons in Nick's slides (no binning choices required). Essential Medical Statistics 2nd Edition also discusses this advantage on page 21.

Last edited by Stuart Leske; 03 May 2023, 22:10. Reason: Accidentally posted before finishing.
Comment

Announcement

Raincloud plot in Stata

Comment

Comment

Comment

Comment

Comment

Comment

Comment