Dear Statalist,
I am trying to determine if there is a good way to plot the probability density functions using joyplots/ridgeline plots with across two grouping variables. Both kdensity and community contributed command joyplot make it possible to do this by one grouping variable (in the Joy Division album cover style). But I was hoping to add another grouping variable that separates these groups of distributions. I was able to concoct an example of what I am looking for in Python in the below graph, but because I am a Stata user first and foremost, I am wondering how I can do this same type of thing in Stata. This graph shows the (purely simulated) distributions of student debt for students in a given field of study at a given college, so each density is a set of students in the same program, and they are arrayed vertically so we see those distributions for students at all programs within the same field:

To me the key challenges of adding this second layer of grouping are
Here's my current example code using Stata 18.0 on Windows 10:
Is it possible to modify this approach to achieve this nested, recentered structure when adding a grouping for the race variable (or any other categorical variable in the dataset)? I'm open to alternative approaches that could create a similar visualization. Any guidance, suggestions, or example code would be greatly appreciated.
Note: This code uses the community-contributed commands joyplot, winsor, and schemepack. The schemepack provides additional color schemes, including the white_tableau scheme used in this example. Each can be installed using:
ssc install winsor
ssc install joyplot
ssc install schemepack
Thank you for your help!
CJ
PS - in case it is helpful, here is the Python code that produces the first graph:
I am trying to determine if there is a good way to plot the probability density functions using joyplots/ridgeline plots with across two grouping variables. Both kdensity and community contributed command joyplot make it possible to do this by one grouping variable (in the Joy Division album cover style). But I was hoping to add another grouping variable that separates these groups of distributions. I was able to concoct an example of what I am looking for in Python in the below graph, but because I am a Stata user first and foremost, I am wondering how I can do this same type of thing in Stata. This graph shows the (purely simulated) distributions of student debt for students in a given field of study at a given college, so each density is a set of students in the same program, and they are arrayed vertically so we see those distributions for students at all programs within the same field:
To me the key challenges of adding this second layer of grouping are
- being able to just specify the second group in any of the current density plot graphing commands currently available and
- being able to produce it in the way I did in the above graph, where I had two goals:
- The placement of the groups on the x-axis had no numerical meaning, they are just evenly spaced categorical variables, placed so that they more or less do not overlap
- The placement of the distributions on top of the vertical lines for their group does have meaning: specifically, the vertical line is treated as though it is the value of the overall mean for that group (in my above example, the mean student loan amount for all students in that field of study), and it intersects with each distribution where the group-level mean falls in the program-level distribution
Here's my current example code using Stata 18.0 on Windows 10:
Code:
sysuse nlsw88, clear
// Recode industry for top industries and 'Other'
gen industry_group = industry
replace industry_group = 99 if inlist(industry, 1, 2, 3, 10)
label copy indlbl indgrp_lbl
label def indgrp_lbl 99 "Other", modify
label values industry_group indgrp_lbl
// Winsorize wage at 5th and 95th percentiles to simplify visualization
winsor wage, gen(wage_p99) p(0.05)
// Create joyplot
joyplot wage_p99, by(industry_group) scheme(white_tableau) bwidth(1.5) offset(-1.5) ///
xtitle("Hourly Wage") overlap(3) title("Wage Distribution by Industry") ///
note("Source: NLSW 1988 data. Wages winsorized at 5th and 95th percentiles by industry.")
Note: This code uses the community-contributed commands joyplot, winsor, and schemepack. The schemepack provides additional color schemes, including the white_tableau scheme used in this example. Each can be installed using:
ssc install winsor
ssc install joyplot
ssc install schemepack
Thank you for your help!
CJ
PS - in case it is helpful, here is the Python code that produces the first graph:
Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats as scipy_stats
# Set random seed for reproducibility
np.random.seed(42)
# Define fields of study
fields = ["Law", "Medicine", "Engineering", "Business", "Education"]
field_means = {
"Law": 100000,
"Medicine": 120000,
"Engineering": 80000,
"Business": 90000,
"Education": 60000
}
data = []
for field in fields:
n_programs = np.random.randint(5, 10) # Random number of programs per field
field_mean = field_means[field]
field_std = field_mean * 0.15 # 15% of the mean as standard deviation
for program in range(n_programs):
n_students = np.random.randint(30, 201)
program_mean = np.random.normal(field_mean, field_std/3)
program_std = np.random.uniform(field_mean * 0.05, field_mean * 0.1)
loans = np.random.normal(program_mean, program_std, n_students)
data.extend([(field, f"{field}_Program_{program}", loan) for loan in loans])
df = pd.DataFrame(data, columns=['Field', 'Program', 'Loan'])
# Truncate the data at 5th and 95th percentiles within each field
df['Loan'] = df.groupby('Field')['Loan'].transform(lambda x: x.clip(x.quantile(0.05), x.quantile(0.95)))
# Calculate field statistics
field_stats = df.groupby('Field')['Loan'].agg(['mean', 'count']).sort_values('mean', ascending=False)
# Create a new column for the centered and scaled loan amounts within each field
df['Scaled_Loan'] = df.groupby('Field')['Loan'].transform(lambda x: (x - x.mean()) / x.std())
# Set up the plot
fig, ax = plt.subplots(figsize=(16, 8)) # Further reduced height from 10 to 8
colors = plt.cm.Set2(np.linspace(0, 1, len(fields)))
# Create the ridgeline plot
field_width = 1.5
joyplot_scale = 0.4 # Increased to make distributions taller
y_base = 0 # Start from the bottom of the plot
max_density = 0
for i, (field, field_data) in enumerate(df.groupby('Field')):
programs = field_data['Program'].unique()
field_max_density = 0
program_densities = []
for program in programs:
program_data = field_data[field_data['Program'] == program]
density = scipy_stats.gaussian_kde(program_data['Scaled_Loan'])
xs = np.linspace(-3, 3, 200)
ys = density(xs)
program_densities.append((xs, ys))
field_max_density = max(field_max_density, ys.max())
# Update overall max_density
max_density = max(max_density, field_max_density)
# Plot densities for this field
x_center = i * field_width
for j, (xs, ys) in enumerate(program_densities):
y_offset = j * field_max_density * joyplot_scale * 0.5 # Adjust vertical spacing
ax.fill_between(xs * 0.3 + x_center, y_base + y_offset, y_base + y_offset + ys * joyplot_scale,
alpha=0.6, color=colors[i])
ax.plot(xs * 0.3 + x_center, y_base + y_offset + ys * joyplot_scale, color='black', linewidth=0.5)
# Adjust y-axis limits to reduce white space even further
ax.set_ylim(0, max_density * joyplot_scale * (len(programs) * 0.8)) # Reduced multiplier
# Customize the plot
ax.set_title("Distribution of Student Loans Across Fields of Study", fontsize=16, pad=20)
ax.set_xlabel("Fields of Study", fontsize=12)
ax.set_yticks([]) # Remove y-axis ticks
ax.set_xticks([i * field_width for i in range(len(fields))])
ax.set_xticklabels([f"{field}\n(Mean: ${field_stats.loc[field, 'mean']:,.0f})" for field in fields], fontsize=10)
# Add vertical lines for each field, aligned with the center of labels
for i in range(len(fields)):
ax.axvline(i * field_width, color='red', linestyle='--', linewidth=1, alpha=0.7, ymin=0, ymax=1)
# Add legend
legend_elements = [plt.Line2D([0], [0], color=colors[i], lw=4, label=field) for i, field in enumerate(fields)]
ax.legend(handles=legend_elements, title="Fields of Study", loc='upper right', bbox_to_anchor=(1.15, 1))
# Use tight_layout with parameters
plt.tight_layout(rect=[0, 0, 0.95, 1]) # Adjusted to leave space for legend
# Save the figure
plt.savefig('Figure 1.png', dpi=300, bbox_inches='tight')
plt.show()

Comment