Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Overlapping Histogram Showing descriptive statistics and lines

    Dear All,

    I am creating an overlapping histogram to show the difference in log wages for males and females. I want to add descriptive statistics to the graph to show the standard errors and median and a separate box showing mean, median, and sd values.

    I have tried to follow several threads but I have failed to get the lines and text. When the text appears, the histogram only shows the male graph. Data sample using dataex:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int RO3 float log_hourly_wage
    2         3.212141
    2         1.213983
    1  3.657131
    1  3.267666
    1  3.218876
    2         2.198173
    1  4.527209
    1 4.0734115
    2 2.2118309
    1 1.8345364
    1  4.121373
    1  3.669928
    1  2.995732
    1 3.2834144
    1  2.716349
    1  3.715908
    2 3.6105475
    1  4.775681
    1  3.624341
    1  3.267666
    1  3.405753
    1   2.70805
    1  4.467549
    1  3.657131
    1 3.7784915
    end
    label values RO3 RO3
    label def RO3 1 "Male 1", modify
    label def RO3 2 "Female 2", modify
    The codes that I have tried:

    Code:
    summarize log_hourly_wage if RO3 == 1
    local group1_mean = round(r(mean), 0.01)
    local group1_median = round(r(p50), 0.01)
    local group1_sd = round(r(sd), 0.01)
    
    summarize log_hourly_wage if RO3 == 2
    local group2_mean = round(r(mean), 0.01)
    local group2_median = round(r(p50), 0.01)
    local group2_sd = round(r(sd), 0.01)
    
    // Display the summary statistics in the graph
    gr twoway ///
        (histogram log_hourly_wage if RO3 == 1, bin(10) percent xlabel(, nogrid) ylabel(, nogrid) color(red) fcolor(gs12)) ///
        (histogram log_hourly_wage if RO3 == 2, bin(10) percent xlabel(, nogrid) ylabel(, nogrid) color(black) fcolor(gs12)) ///
        (rcap `group1_mean' 5 `group1_mean' `group1_mean' `group1_sd', horizontal lc(blue) lw(medthick)) ///
        (rcap `group2_mean' 5 `group2_mean' `group2_mean' `group2_sd', horizontal lc(red) lw(medthick)) ///
        (scatteri `group1_mean' 5 1, ms(Oh) mla(`group1_mean') mlabposition(0) mlabgap(0) mlabsize(small) mlabcolor(blue)) ///
        (scatteri `group2_mean' 5 1, ms(Oh) mla(`group2_mean') mlabposition(0) mlabgap(0) mlabsize(small) mlabcolor(red)) ///
        (scatteri `group1_median' 5 1, ms(Oh) mla(`group1_median') mlabposition(0) mlabgap(0) mlabsize(small) mlabcolor(blue)) ///
        (scatteri `group2_median' 5 1, ms(Oh) mla(`group2_median') mlabposition(0) mlabgap(0) mlabsize(small) mlabcolor(red)) ///
        legend(order(1 "Male" 2 "Female")) ///
        ytitle("Percent", size(small)) ///
        text(0.6 `group1_mean' "Male Mean: `group1_mean'", size(small) placement(ne) color(blue)) ///
        text(0.6 `group2_mean' "Female Mean: `group2_mean'", size(small) placement(ne) color(red)) ///
        text(0.4 `group1_median' "Male Median: `group1_median'", size(small) placement(e) color(blue)) ///
        text(0.4 `group2_median' "Female Median: `group2_median'", size(small) placement(e) color(red))
    It gives an error 13 invalid name.

    The next code gives me the values but when I try to add lines it does not work.

    Code:
    summarize log_hourly_wage if RO3 == 1
    local group1_mean = r(mean)
    local group1_median = r(p50)
    local group1_sd = r(sd)
    
    summarize log_hourly_wage if RO3 == 2
    local group2_mean = r(mean)
    local group2_median = r(p50)
    local group2_sd = r(sd)
    
    // Display the summary statistics in the graph
    gr twoway hist log_hourly_wage if RO3==1, bin(10) percent xlabel(, nogrid) ylabel(, nogrid) color(red)  fcolor(%50)|| hist log_hourly_wage if RO3==2, bin(10) percent xlabel(, nogrid) ylabel(, nogrid)  color(black) fcolor(%50) legend(order(1 "Male" 2 "Female" )) text(20 30 "Male Mean: `group1_mean'" ///
               40 20 "Male Median: `group1_median'" ///
               41 20  "Male SD: `group1_sd'", size(small)  placement(ne) color(blue)) ///
         text(10 8  "Female Mean: `group2_mean'" ///
               10 7  "Female Median: `group2_median'" ///
               10 6  "Female: `group2_sd'", size(small) placement(e) color(red))
    Any leads and clues on how to solve it will be appreciated.

  • #2
    Thank you for providing some example data so we can replicate your plot, like:
    Click image for larger version

Name:	Case_Overlapping_Histograms_w_Stats_1.png
Views:	1
Size:	8.7 KB
ID:	1721341






    My first observation is that Stata's twoway does what it is asked to do: put text way outside of the plot as your code states:
    40 20 "Male Median: `group1_median'" ///
    41 20 "Male SD: `group1_sd'", size(small) placement(ne) color(blue)) ///
    text(10 8 "Female Mean: `group2_mean'" ///
    10 7 "Female Median: `group2_median'" ///
    10 6 "Female: `group2_sd'", size(small) placement(e) color(red))

    Note that plot coordinates are read as y (coordinate) x (coordinate). Clearly, in your code that is way outside of the plot/graph region as it is set, which explains why we now can see a snippet of text (Fe) but nothing else.
    We can control both these coordinates and/or the size proportion of the graph region and the plot region to improve this plot.
    Although the help file provides ample explanation of functionality and the multitude of options to control how a graph/plot will be generated by Stata, should you consider doing more work with graphic output, then I can recommend Michael N. Mitchell's book A Visual Guide to Stata Graphics.

    Now, let see what we can do to improve your plot and make some code changes:
    Code:
    * Example 2
    summarize log_hourly_wage if RO3 == 1
    local group1_mean = r(mean)
    local group1_median = r(p50)
    local group1_sd = r(sd)
    
    summarize log_hourly_wage if RO3 == 2
    local group2_mean = r(mean)
    local group2_median = r(p50)
    local group2_sd = r(sd)
    
    // Display the summary statistics in the graph
    gr twoway hist log_hourly_wage if RO3==1, bin(10) percent xlab(, nogrid) ylab(, nogrid) color(red)  fc(%50) || hist log_hourly_wage if RO3==2, bin(10) percent xlab(, nogrid) ylab(, nogrid) color(black) fc(%50) , ///
        xsize(12) ysize(8) xsca(range(1(1)5) noextend)    ///
        legend(order(1 "Male" 2 "Female" )) text(40 5 "Male Mean: `group1_mean'" ///
        38 5 "Male Median: `group1_median'" ///
        36 5 "Male SD: `group1_sd'", size(small)  placement(ne) color(black)) ///
        text(10 5 "Female Mean: `group2_mean'" ///
        8 5 "Female Median: `group2_median'" ///
        6 5 "Female SD: `group2_sd'", size(small) placement(e) color(red))
    which results in:
    Click image for larger version

Name:	Case_Overlapping_Histograms_w_Stats_2.png
Views:	1
Size:	13.0 KB
ID:	1721342






    You might be happy with the fact that we now do see the required text and related summary statistics.
    Note that there is no statistic being reported for the Male and Female Median. This is because you did not ask Stata to calculate this statistic by using:
    summarize log_hourly_wage if RO3 == 1
    instead of:
    summarize log_hourly_wage if RO3 == 1, detail
    or:
    sum log_hourly_wage if RO3 == 1, detail
    So, our next step is to change that in your code and do something about the seemingly endless string of decimals that are reported.
    Therefore, we try to further improve with a much used option which is to round the values, like:
    Code:
    * Example 3
    qui sum log_hourly_wage if RO3 == 1, detail
    local group1_mean = round(r(mean),.01)
    local group1_median = round(r(p50),.01)
    local group1_sd = round(r(sd),.01)
    
    qui sum log_hourly_wage if RO3 == 2, detail
    local group2_mean = round(r(mean),.01)
    local group2_median = round(r(p50),.01)
    local group2_sd = round(r(sd),.01)
    dis `group2_sd'
    
    // Display the summary statistics in the graph
    gr twoway hist log_hourly_wage if RO3==1, bin(10) percent xlab(, nogrid) ylab(, nogrid) color(red)  fc(%50) || hist log_hourly_wage if RO3==2, bin(10) percent xlab(, nogrid) ylab(, nogrid) color(black) fc(%50) , ///
        xsize(12) ysize(8) xsca(range(1(1)5) noextend)    ///
        legend(order(1 "Male" 2 "Female" )) text(40 5 "Male Mean: `group1_mean'" ///
        38 5 "Male Median: `group1_median'" ///
        36 5 "Male SD: `group1_sd'", size(small)  placement(ne) color(black)) ///
        text(10 5 "Female Mean: `group2_mean'" ///
        8 5 "Female Median: `group2_median'" ///
        6 5 "Female SD: `group2_sd'", size(small) placement(e) color(red))
    which results in:
    Click image for larger version

Name:	Case_Overlapping_Histograms_w_Stats_3.png
Views:	1
Size:	12.6 KB
ID:	1721343






    Now, indeed the Mean and the Median are reported in the plot rounded to two decimals. But, maybe you are puzzled by the fact that the SD is not rounded as requested.
    This is because Stata calculates in binary and in this case that is reported with unrounded values.
    Again we have to consider another alternative and, indeed, there is a better alternative for using round(,01) in graphs (plots), as was explained by Nick Winter back in 2005 on the now retired, but accessible stalist archive here (I do hope that the stalist archive is never ever disconnected from the Internet!).
    However, it does require an additional line of code for each statistic and the code now reads as:
    Code:
    * Example 4
    qui sum log_hourly_wage if RO3 == 1, detail
    local group1_mean = round(r(mean),.01)
    local group1_mean_Rd : di %3.2f `group1_mean'
    local group1_median = r(p50)
    local group1_median_Rd : di %3.2f `group1_median'
    local group1_sd = round(r(sd),.01)
    local group1_sd_Rd : di %3.2f `group1_sd'
    
    qui sum log_hourly_wage if RO3 == 2, detail
    local group2_mean = round(r(mean),.01)
    local group2_mean_Rd : di %3.2f `group2_mean'
    local group2_median = r(p50)
    local group2_median_Rd : di %3.2f `group2_median'
    local group2_sd = round(r(sd),.01)
    local group2_sd_Rd : di %3.2f `group2_sd'
    
    // Display the summary statistics in the graph
    gr twoway hist log_hourly_wage if RO3==1, bin(10) percent xlab(, nogrid) ylab(, nogrid) color(red)  fc(%50) || hist log_hourly_wage if RO3==2, bin(10) percent xlab(, nogrid) ylab(, nogrid) color(black) fc(%50) , ///
        xsize(12) ysize(8) xsca(range(1(1)5) noextend)    ///
        legend(order(1 "Male" 2 "Female" )) text(40 5 "Male Mean: `group1_mean_Rd'" ///
        38 5 "Male Median: `group1_median_Rd'" ///
        36 5 "Male SD: `group1_sd_Rd'", size(small)  placement(ne) color(black)) ///
        text(10 5 "Female Mean: `group2_mean'" ///
        8 5 "Female Median: `group2_median_Rd'" ///
        6 5 "Female SD: `group2_sd_Rd'", size(small) placement(e) color(red))
    which results in:
    Click image for larger version

Name:	Case_Overlapping_Histograms_w_Stats_4.png
Views:	1
Size:	12.5 KB
ID:	1721344






    Here we are where we want to be: statistics reported in a Stata graph, plot. that are set to round to two decimals!
    Maybe you dislike the effort it took to get to this point. I agree that it sometimes is somewhat tedious to find code that delivers what you are looking for.
    However, in my opinion is certainly worth the effort as we can provide reproducable code and control about any element of our graph or plot.
    Last edited by ericmelse; 21 Jul 2023, 03:05.
    http://publicationslist.org/eric.melse

    Comment


    • #3
      Noting the very helpful post by ericmelse I want to make a quite different suggestion, that histograms are vastly oversold in this territory. The vital question of how best to show overlapping distributions so that they can be interpreted easily and effectively is complicated further by the usual issues of choosing bin width and origin.

      A quantile plot is one strong competitor. Here I use qplot from the Stata Journal. It is just a wrapper for twoway so text annotations are equally possible.

      I want to flag that the mean log wage can be backtransformed as the geometric mean. Geometric means should be in every introductory statistics text; in practice it is a challenge to find a mention in any introductory text!


      Code:
       qplot log_hourly_wage, over(RO3) aspect(1)
      Click image for larger version

Name:	wage.png
Views:	1
Size:	27.5 KB
ID:	1721353

      Last edited by Nick Cox; 21 Jul 2023, 04:03.

      Comment

      Working...
      X