Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Overlaying normal distribution graphs to compare outcomes between genders

    Hello,

    In order to compare the distribution of math scores between female and male students, we aim to plot bell curves of the results for each gender and overlay them in a single graph with a single set of axes. However, the closest we have come to accomplishing this is to get two separate graphs beside each other in one file. The "twoway" command seems unable to provide normal distribution graphs and smooth lines. How do we achieve our aim?

    This is our current code:


    Code:
    clear
    input float(mathScoreGirl mathScoreBoy)
    -.16130325          .
             . -1.8561664
             .  -.3401746
    -1.4119294          .
    -.06797387          .
     1.1607436          .
             .  .01306791
      .3450717          .
     -3.188828          .
    -1.4759908          .
    end
    
    histogram mathScoreGirl, percent normal /*
    */ color(none) lwidth(none) saving(DistrGirls,  replace)
    
    histogram mathScoreBoy, percent normal /*
    */ color(none) lwidth(none) saving(DistrBoys,  replace)
    
    graph combine DistrGirls.gph DistrBoys.gph, ycommon /*
    */ saving(DistrComp, asis replace)
    Thank you in advance!
    Last edited by Stat Olivia; 04 Apr 2017, 03:31.

  • #2
    Stat Olivia: The request to re-register using a full real name still stands. http://www.statalist.org/forums/foru...t-observations #4

    You can get what I think you are asking for. It's just very hard to optimise:

    1. Superimposing histograms means that one set of bars may often occlude the other. Making the bars transparent solves this problem only to create others.

    2. Although still very popular, binning is arbitrary (which width? which start?) and obscures any fine structure of interest and importance as well as noise.

    Here is some sample code. I won't show the graph, as it isn't interesting.

    Code:
    clear
    input float(mathScoreGirl mathScoreBoy)
    -.16130325          .
             . -1.8561664
             .  -.3401746
    -1.4119294          .
    -.06797387          .
     1.1607436          .
             .  .01306791
      .3450717          .
     -3.188828          .
    -1.4759908          .
    end
    
    twoway histogram mathScoreGirl, ///
    bfc(none) blc(orange) width(0.5) start(-3.5) ///
    || function normalden(x), ra(mathScoreGirl) lc(orange) ///
    || histogram mathScoreBoy, bfc(none) blc(blue) width(0.5) start(-3.5) ///
    || function normalden(x + 0.5),  lcolor(blue) ra(mathScoreGirl) ///
    legend(order(1 "Girls" 3 "Boys")) xtitle(Mathematics scores) ytitle(Probability density)
    Many people will have different favourites for this problem. Kernel density estimation might be one. I was once a fan, now am not so keen. You need not care about that, and in any case others are free to push it. My favourite would be a quantile plot with a transformed probability scale such that a normal distribution shows as a straight line. qnorm will do separate graphs, but superimposition is likely to work better for a problem like yours and for that you could use qplot from the Stata Journal. See also http://stats.stackexchange.com/quest...-sample-t-test and/or search qplot in Stata for references.

    Code:
    gen female = mathScoreGirl < . 
    label def female 0 Boys 1 Girls 
    label val female female 
    gen mathScore = max(mathScoreGirl, mathScoreBoy) 
    qplot mathScore, trscale(invnormal(@)) over(female) aspect(1) 
    
    sysuse auto, clear 
    qplot mpg, trscale(invnormal(@)) over(foreign) aspect(1) ///
    xtitle(standard normal deviate) legend(col(1) order(2 1) pos(11) ring(0))
    As before, I won't show the graph for your data: your example served its purpose of showing us your variables and data structure, but the sample is too small to be interesting. I do show the last example for mpg from the auto dataset.

    The key principles are

    1. A (perfect) sample from a normal distribution falls exactly on a straight line.

    2. Samples with different means have different intercepts.

    3. Samples with different SDs have different slopes.

    4. Curvature may be suggestive of the need for transformation.

    5. Fine structure may be discernible as each data point is plotted separately. (Here, it's the granularity whereby mpg is reported as integers.)



    Click image for larger version

Name:	qnormplot3.png
Views:	1
Size:	14.9 KB
ID:	1381956

    Comment


    • #3
      Plotting normal distribution as if they are empirical results (which is what you seem to want to do) is often a bad idea. It suggest a structure that is not actually observed. I suspect that this is why it is not implemented as a standard graph. That does not mean that it is impossible. Below is an example. But this does not mean that I recommend you doing this ( I don't).

      Code:
      // prepare example data
      sysuse nlsw88, clear
      gen lnwage = ln(wage)
      
      // collect means and standard deviations
      sum lnwage if race == 1
      local m_w = r(mean)
      local sd_w = r(sd)
      sum lnwage if race == 2
      local m_b = r(mean)
      local sd_b = r(sd)
      
      // plot various smooth lines
      twoway function white = normalden(x,`m_w',`sd_w') , range(0 4) || ///
             function black = normalden(x,`m_b',`sd_b') , range(0 4)
      ---------------------------------
      Maarten L. Buis
      University of Konstanz
      Department of history and sociology
      box 40
      78457 Konstanz
      Germany
      http://www.maartenbuis.nl
      ---------------------------------

      Comment


      • #4
        Maarten's post seems very consistent with mine. We have the advantage of knowing each other's views on many issues over several years. He picked up what I did not, that you are suppressing the actual histograms and just want the normal curves. I have to sign up to not recommending that at all.

        Comment


        • #5
          Here's an example revisiting #2 for paired data. I've used one of Stata's own datasets so that anyone interested can reproduce this easily after installing qplot from the Stata Journal website.

          The example is trivial but the ideas here of plotting the differences so that the data can be checked and the mean compared with zero are more general.



          Code:
          webuse fuel, clear
          list 
          gen diff = mpg2 - mpg1
          su diff
          qplot diff, trscale(invnormal(@)) aspect(1) xtitle(standard normal deviate) yla(, ang(h)) yli(`r(mean)') ytitle("difference, mpg2 - mpg1") ymtick(0, grid glc(gs12)) text(`r(mean)' -1.5 "mean", place(12))
          Click image for larger version

Name:	ttest.png
Views:	1
Size:	20.1 KB
ID:	1516635

          Comment

          Working...
          X