Dear Statalist,
I am trying to determine if there is a good way to plot the probability density functions using joyplots/ridgeline plots with across two grouping variables. Both kdensity and community contributed command joyplot make it possible to do this by one grouping variable (in the Joy Division album cover style). But I was hoping to add another grouping variable that separates these groups of distributions. I was able to concoct an example of what I am looking for in Python in the below graph, but because I am a Stata user first and foremost, I am wondering how I can do this same type of thing in Stata. This graph shows the (purely simulated) distributions of student debt for students in a given field of study at a given college, so each density is a set of students in the same program, and they are arrayed vertically so we see those distributions for students at all programs within the same field:

To me the key challenges of adding this second layer of grouping are
Here's my current example code using Stata 18.0 on Windows 10:
Is it possible to modify this approach to achieve this nested, recentered structure when adding a grouping for the race variable (or any other categorical variable in the dataset)? I'm open to alternative approaches that could create a similar visualization. Any guidance, suggestions, or example code would be greatly appreciated.
Note: This code uses the community-contributed commands joyplot, winsor, and schemepack. The schemepack provides additional color schemes, including the white_tableau scheme used in this example. Each can be installed using:
ssc install winsor
ssc install joyplot
ssc install schemepack
Thank you for your help!
CJ
PS - in case it is helpful, here is the Python code that produces the first graph:
I am trying to determine if there is a good way to plot the probability density functions using joyplots/ridgeline plots with across two grouping variables. Both kdensity and community contributed command joyplot make it possible to do this by one grouping variable (in the Joy Division album cover style). But I was hoping to add another grouping variable that separates these groups of distributions. I was able to concoct an example of what I am looking for in Python in the below graph, but because I am a Stata user first and foremost, I am wondering how I can do this same type of thing in Stata. This graph shows the (purely simulated) distributions of student debt for students in a given field of study at a given college, so each density is a set of students in the same program, and they are arrayed vertically so we see those distributions for students at all programs within the same field:
To me the key challenges of adding this second layer of grouping are
- being able to just specify the second group in any of the current density plot graphing commands currently available and
- being able to produce it in the way I did in the above graph, where I had two goals:
- The placement of the groups on the x-axis had no numerical meaning, they are just evenly spaced categorical variables, placed so that they more or less do not overlap
- The placement of the distributions on top of the vertical lines for their group does have meaning: specifically, the vertical line is treated as though it is the value of the overall mean for that group (in my above example, the mean student loan amount for all students in that field of study), and it intersects with each distribution where the group-level mean falls in the program-level distribution
Here's my current example code using Stata 18.0 on Windows 10:
Code:
sysuse nlsw88, clear // Recode industry for top industries and 'Other' gen industry_group = industry replace industry_group = 99 if inlist(industry, 1, 2, 3, 10) label copy indlbl indgrp_lbl label def indgrp_lbl 99 "Other", modify label values industry_group indgrp_lbl // Winsorize wage at 5th and 95th percentiles to simplify visualization winsor wage, gen(wage_p99) p(0.05) // Create joyplot joyplot wage_p99, by(industry_group) scheme(white_tableau) bwidth(1.5) offset(-1.5) /// xtitle("Hourly Wage") overlap(3) title("Wage Distribution by Industry") /// note("Source: NLSW 1988 data. Wages winsorized at 5th and 95th percentiles by industry.")
Note: This code uses the community-contributed commands joyplot, winsor, and schemepack. The schemepack provides additional color schemes, including the white_tableau scheme used in this example. Each can be installed using:
ssc install winsor
ssc install joyplot
ssc install schemepack
Thank you for your help!
CJ
PS - in case it is helpful, here is the Python code that produces the first graph:
Code:
import numpy as np import pandas as pd import matplotlib.pyplot as plt from scipy import stats as scipy_stats # Set random seed for reproducibility np.random.seed(42) # Define fields of study fields = ["Law", "Medicine", "Engineering", "Business", "Education"] field_means = { "Law": 100000, "Medicine": 120000, "Engineering": 80000, "Business": 90000, "Education": 60000 } data = [] for field in fields: n_programs = np.random.randint(5, 10) # Random number of programs per field field_mean = field_means[field] field_std = field_mean * 0.15 # 15% of the mean as standard deviation for program in range(n_programs): n_students = np.random.randint(30, 201) program_mean = np.random.normal(field_mean, field_std/3) program_std = np.random.uniform(field_mean * 0.05, field_mean * 0.1) loans = np.random.normal(program_mean, program_std, n_students) data.extend([(field, f"{field}_Program_{program}", loan) for loan in loans]) df = pd.DataFrame(data, columns=['Field', 'Program', 'Loan']) # Truncate the data at 5th and 95th percentiles within each field df['Loan'] = df.groupby('Field')['Loan'].transform(lambda x: x.clip(x.quantile(0.05), x.quantile(0.95))) # Calculate field statistics field_stats = df.groupby('Field')['Loan'].agg(['mean', 'count']).sort_values('mean', ascending=False) # Create a new column for the centered and scaled loan amounts within each field df['Scaled_Loan'] = df.groupby('Field')['Loan'].transform(lambda x: (x - x.mean()) / x.std()) # Set up the plot fig, ax = plt.subplots(figsize=(16, 8)) # Further reduced height from 10 to 8 colors = plt.cm.Set2(np.linspace(0, 1, len(fields))) # Create the ridgeline plot field_width = 1.5 joyplot_scale = 0.4 # Increased to make distributions taller y_base = 0 # Start from the bottom of the plot max_density = 0 for i, (field, field_data) in enumerate(df.groupby('Field')): programs = field_data['Program'].unique() field_max_density = 0 program_densities = [] for program in programs: program_data = field_data[field_data['Program'] == program] density = scipy_stats.gaussian_kde(program_data['Scaled_Loan']) xs = np.linspace(-3, 3, 200) ys = density(xs) program_densities.append((xs, ys)) field_max_density = max(field_max_density, ys.max()) # Update overall max_density max_density = max(max_density, field_max_density) # Plot densities for this field x_center = i * field_width for j, (xs, ys) in enumerate(program_densities): y_offset = j * field_max_density * joyplot_scale * 0.5 # Adjust vertical spacing ax.fill_between(xs * 0.3 + x_center, y_base + y_offset, y_base + y_offset + ys * joyplot_scale, alpha=0.6, color=colors[i]) ax.plot(xs * 0.3 + x_center, y_base + y_offset + ys * joyplot_scale, color='black', linewidth=0.5) # Adjust y-axis limits to reduce white space even further ax.set_ylim(0, max_density * joyplot_scale * (len(programs) * 0.8)) # Reduced multiplier # Customize the plot ax.set_title("Distribution of Student Loans Across Fields of Study", fontsize=16, pad=20) ax.set_xlabel("Fields of Study", fontsize=12) ax.set_yticks([]) # Remove y-axis ticks ax.set_xticks([i * field_width for i in range(len(fields))]) ax.set_xticklabels([f"{field}\n(Mean: ${field_stats.loc[field, 'mean']:,.0f})" for field in fields], fontsize=10) # Add vertical lines for each field, aligned with the center of labels for i in range(len(fields)): ax.axvline(i * field_width, color='red', linestyle='--', linewidth=1, alpha=0.7, ymin=0, ymax=1) # Add legend legend_elements = [plt.Line2D([0], [0], color=colors[i], lw=4, label=field) for i, field in enumerate(fields)] ax.legend(handles=legend_elements, title="Fields of Study", loc='upper right', bbox_to_anchor=(1.15, 1)) # Use tight_layout with parameters plt.tight_layout(rect=[0, 0, 0.95, 1]) # Adjusted to leave space for legend # Save the figure plt.savefig('Figure 1.png', dpi=300, bbox_inches='tight') plt.show()
Comment