Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating nested ridgeline/joy plots with two grouping variables in Stata

    Dear Statalist,

    I am trying to determine if there is a good way to plot the probability density functions using joyplots/ridgeline plots with across two grouping variables. Both kdensity and community contributed command joyplot make it possible to do this by one grouping variable (in the Joy Division album cover style). But I was hoping to add another grouping variable that separates these groups of distributions. I was able to concoct an example of what I am looking for in Python in the below graph, but because I am a Stata user first and foremost, I am wondering how I can do this same type of thing in Stata. This graph shows the (purely simulated) distributions of student debt for students in a given field of study at a given college, so each density is a set of students in the same program, and they are arrayed vertically so we see those distributions for students at all programs within the same field:



    To me the key challenges of adding this second layer of grouping are
    1. being able to just specify the second group in any of the current density plot graphing commands currently available and
    2. being able to produce it in the way I did in the above graph, where I had two goals:
      1. The placement of the groups on the x-axis had no numerical meaning, they are just evenly spaced categorical variables, placed so that they more or less do not overlap
      2. The placement of the distributions on top of the vertical lines for their group does have meaning: specifically, the vertical line is treated as though it is the value of the overall mean for that group (in my above example, the mean student loan amount for all students in that field of study), and it intersects with each distribution where the group-level mean falls in the program-level distribution
    To give a reproducible example in Stata, I've created a basic plot using the nlsw88 dataset, but I'm wondering how we might add a second grouping variable (say race) to this approach. I am using joyplot here for ease, but I don't think this is mostly a question about the syntax of joyplot (otherwise I'd be asking it on the github for that command), but a more general question about how to approach this type of visualization in Stata.

    Here's my current example code using Stata 18.0 on Windows 10:

    Code:
    sysuse nlsw88, clear
    
    // Recode industry for top industries and 'Other'
    gen industry_group = industry
    replace industry_group = 99 if inlist(industry, 1, 2, 3, 10)
    
    label copy indlbl indgrp_lbl 
    
    label def indgrp_lbl 99 "Other", modify 
    
    label values industry_group indgrp_lbl 
    
    // Winsorize wage at 5th and 95th percentiles to simplify visualization
    winsor wage, gen(wage_p99) p(0.05)
    
    // Create joyplot
    joyplot wage_p99, by(industry_group) scheme(white_tableau) bwidth(1.5) offset(-1.5) ///
        xtitle("Hourly Wage") overlap(3) title("Wage Distribution by Industry") ///
        note("Source: NLSW 1988 data. Wages winsorized at 5th and 95th percentiles by industry.")
    Is it possible to modify this approach to achieve this nested, recentered structure when adding a grouping for the race variable (or any other categorical variable in the dataset)? I'm open to alternative approaches that could create a similar visualization. Any guidance, suggestions, or example code would be greatly appreciated.

    Note: This code uses the community-contributed commands joyplot, winsor, and schemepack. The schemepack provides additional color schemes, including the white_tableau scheme used in this example. Each can be installed using:
    ssc install winsor
    ssc install joyplot
    ssc install schemepack

    Thank you for your help!
    CJ


    PS - in case it is helpful, here is the Python code that produces the first graph:

    Code:
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from scipy import stats as scipy_stats
    
    # Set random seed for reproducibility
    np.random.seed(42)
    
    # Define fields of study
    fields = ["Law", "Medicine", "Engineering", "Business", "Education"]
    field_means = {
        "Law": 100000,
        "Medicine": 120000,
        "Engineering": 80000,
        "Business": 90000,
        "Education": 60000
    }
    
    data = []
    
    for field in fields:
        n_programs = np.random.randint(5, 10)  # Random number of programs per field
        field_mean = field_means[field]
        field_std = field_mean * 0.15  # 15% of the mean as standard deviation
    
        for program in range(n_programs):
            n_students = np.random.randint(30, 201)
            program_mean = np.random.normal(field_mean, field_std/3)
            program_std = np.random.uniform(field_mean * 0.05, field_mean * 0.1)
            loans = np.random.normal(program_mean, program_std, n_students)
            data.extend([(field, f"{field}_Program_{program}", loan) for loan in loans])
    
    df = pd.DataFrame(data, columns=['Field', 'Program', 'Loan'])
    
    # Truncate the data at 5th and 95th percentiles within each field
    df['Loan'] = df.groupby('Field')['Loan'].transform(lambda x: x.clip(x.quantile(0.05), x.quantile(0.95)))
    
    # Calculate field statistics
    field_stats = df.groupby('Field')['Loan'].agg(['mean', 'count']).sort_values('mean', ascending=False)
    
    # Create a new column for the centered and scaled loan amounts within each field
    df['Scaled_Loan'] = df.groupby('Field')['Loan'].transform(lambda x: (x - x.mean()) / x.std())
    
    # Set up the plot
    fig, ax = plt.subplots(figsize=(16, 8))  # Further reduced height from 10 to 8
    colors = plt.cm.Set2(np.linspace(0, 1, len(fields)))
    
    # Create the ridgeline plot
    field_width = 1.5
    joyplot_scale = 0.4  # Increased to make distributions taller
    y_base = 0  # Start from the bottom of the plot
    
    max_density = 0
    for i, (field, field_data) in enumerate(df.groupby('Field')):
        programs = field_data['Program'].unique()
        field_max_density = 0
        program_densities = []
        
        for program in programs:
            program_data = field_data[field_data['Program'] == program]
            density = scipy_stats.gaussian_kde(program_data['Scaled_Loan'])
            xs = np.linspace(-3, 3, 200)
            ys = density(xs)
            program_densities.append((xs, ys))
            field_max_density = max(field_max_density, ys.max())
        
        # Update overall max_density
        max_density = max(max_density, field_max_density)
        
        # Plot densities for this field
        x_center = i * field_width
        for j, (xs, ys) in enumerate(program_densities):
            y_offset = j * field_max_density * joyplot_scale * 0.5  # Adjust vertical spacing
            ax.fill_between(xs * 0.3 + x_center, y_base + y_offset, y_base + y_offset + ys * joyplot_scale, 
                            alpha=0.6, color=colors[i])
            ax.plot(xs * 0.3 + x_center, y_base + y_offset + ys * joyplot_scale, color='black', linewidth=0.5)
    
    # Adjust y-axis limits to reduce white space even further
    ax.set_ylim(0, max_density * joyplot_scale * (len(programs) * 0.8))  # Reduced multiplier
    
    # Customize the plot
    ax.set_title("Distribution of Student Loans Across Fields of Study", fontsize=16, pad=20)
    ax.set_xlabel("Fields of Study", fontsize=12)
    ax.set_yticks([])  # Remove y-axis ticks
    ax.set_xticks([i * field_width for i in range(len(fields))])
    ax.set_xticklabels([f"{field}\n(Mean: ${field_stats.loc[field, 'mean']:,.0f})" for field in fields], fontsize=10)
    
    # Add vertical lines for each field, aligned with the center of labels
    for i in range(len(fields)):
        ax.axvline(i * field_width, color='red', linestyle='--', linewidth=1, alpha=0.7, ymin=0, ymax=1)
    
    # Add legend
    legend_elements = [plt.Line2D([0], [0], color=colors[i], lw=4, label=field) for i, field in enumerate(fields)]
    ax.legend(handles=legend_elements, title="Fields of Study", loc='upper right', bbox_to_anchor=(1.15, 1))
    
    # Use tight_layout with parameters
    plt.tight_layout(rect=[0, 0, 0.95, 1])  # Adjusted to leave space for legend
    
    # Save the figure
    plt.savefig('Figure 1.png', dpi=300, bbox_inches='tight')
    
    plt.show()

  • #2
    Short answer: I have not used this method or the joyplot command in Stata or any equivalent in other software. I think you need an answer from Asjad Naqvi So I don't have comments directly relevant to what you seek.

    The rest of this reply is just a series of marginal comments.

    These plots often look quite attractive but I suspect partly for the wrong reasons, because they are reminiscent of rolling topography or even human anatomy.

    The name ridgeline plot is greatly preferable, as often explained.

    The easy but hard question is what can you see statistically (economically, or whatever) in these plots that you can't see as well or better in other displays. I typically see overlap, so a mix of juxtaposition and superimposition that has the downsides of both with perhaps the benefits of neither.

    multidensity from SSC is a convenience command for getting several density estimates. (Its name is unfortunate if misread as offering multivariate or even bivariate density estimates.) https://www.statalist.org/forums/for...lable-from-ssc

    36 or so distributions are hard to grasp either in totality or in detail.

    Inspired however by looking at wage from the nlsw88 dataset, I fired up stripplot from SSC for 4 groups.

    I always worry that the density estimation that gives smoother displays leaves out detail that I should care about as interesting or important.

    This wouldn't work well for 36!

    FWIW, I only ever Winsorize myself to get a Winsorized mean or variance.

    Code:
    sysuse nlsw88, clear 
    
    egen group = group(race married), label 
    
    gen ln_wage = ln(wage)
    
    * mylabels from Stata Journal 
    mylabels 1 2 5 10 20 50, myscale(ln(@)) local(yla)
    
    * gmean requires your own _ggmean.ado or installation of egenmore from SSC 
    
    stripplot ln_wage if group <= 4, over(group) centre cumul cumprob vertical refline reflevel(gmean) box(barw(0.08)) boffset(-0.5) pctile(5) xtitle("") ytitle(Wage (log scale)) xla(, noticks) height(0.7) yla(`yla') note("box plot shows median, quartiles, 5 and 95% points" "longer lines show geometric means")
    Click image for larger version

Name:	ln_wage.png
Views:	1
Size:	59.9 KB
ID:	1763188

    Comment


    • #3
      Dear CJ Libassi, this is a very specific use-case scenario and doesn't make much sense to extent joyplot for this. I would recommend doing a custom script for this which is fairly straightforward.

      For each variable, you can extract the kernel density coordinates:

      Code:
      kdensity variable if xxxx, generate(xdens ydens) bwidth(xxx) n(XXX) nograph
      You can loop it over the variables and their categories. Displace the y and x variables and then you can generate the graph shown above.

      As Nick also points above, it is not so clear what the storyline is from the figure so maybe it needs more annotations (e.g. y-axis labels) or even some experimentation with kernels and bandwidths? Are the red dotted lines the medians for each category shown in the label below? This makes it even more difficult to compare across categories so stacking them all on top of each other (that can already be done with the current joyplot command). But these are random points without knowing the background to the figure.

      Nick Cox the joyplot command is now also mirrored as "ridgeline" based on the many discussions earlier on the issue with joyplot.



      Comment


      • #4
        Thanks all for these comments and ideas. I totally take the point both that this threatens to be a busy plot and that the current mock up I have made isn't well labeled. For my use case, my thinking was that it would help me understand a few things. I am trying to understand the distribution of student debt across different graduate fields of study. The difference in average amounts across fields is fairly easy to understand and plot by just showing means by field. But student debt is one of those things where outliers are quite common and matter to how you understand the nature of the problem: if a few students have a lot of debt that has a different policy solution than if a lot of students have a moderate amount of debt. In any case, my data has students nested in programs which are nested in fields of study, and my hope was to try to get an overview of the dispersion of debt both across students within programs and across programs within a field. The idea is that I would like to be able to quickly visually assess if programs in a field generally have their students leaving with the same amount of debt or if it is bimodal or skewed in some interesting way, and if we see a lot of different patterns across programs in a given field. For example, I can law programs, where I have the sense that there is a fair amount of endowed scholarship money available at some elite programs showing a lot of variety in the debt their students leave with but there being other programs where all students take on a lot of debt. But I can also imagine some fields systematically charge similar amounts and have similar price structures for all programs (if, say, nursing programs throughout the country have a fairly consistent business model and price, without much tuition discounting through scholarships), while others have a lot more variety. My hope was just to get a sense of those patterns in one picture.

        I certainly agree that my idea for the vertical lines in the plot isn't clear at the moment, but my idea for it was to think of it as a sort of re-centered mean. That is, while the vertical line's placement on the x-axis is meaningless (though in a revision I think I would be inclined to array the fields in ascending order of the average debt amount per field, though still not make the distance between the lines meaningful, just keep it evenly spaced for visual clarity), the placement of the distributions around the line would be meaningful. If the vertical line were to be treated as the overall field of study mean, the placement of the distributions around the line is meaningful. The vertical line intersects each program's distribution at the point representing the field's average debt. So for example, if the field average is at the 50th percentile of a program's distribution, the curve will be centered on the line. If the field average is at the 75th percentile of a program's distribution, 75% of the curve's area will be to the left of the line, and 25% to the right.

        Comment

        Working...
        X