Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Alternatives to scatter/jitter: problems with stack command in stripplot


    I would like to display data related to medicine samples that are outside specific testing limits (continuous variable dev_all_gr), by medicine type (categorical variable molnum). The data cluster close to the limits, and I would like to show that. Data look like this (sorry, you need quite a bit of it to recreate the probelm):

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte molnum float dev_all_gr
    4    -5.26
    5    -7.37
    3    -7.75
    2    18.57
    5    13.97
    3   -11.09
    5   -10.95
    3    -9.05
    5     -9.2
    1      -54
    3    -5.77
    3    -5.06
    4    -5.52
    3    -9.35
    4     9.85
    1   -53.54
    5    -5.32
    3    -6.23
    1    -5.08
    5     -6.3
    3     -8.9
    4   -10.57
    3     -5.3
    4    -7.41
    5     -8.3
    3    -5.34
    1     -7.6
    3    -5.53
    5    -5.02
    4     15.4
    5    -9.55
    5    -6.18
    3   -32.96
    3    -7.34
    5    -6.75
    3   -12.05
    5    -6.13
    1    -5.02
    4    -10.8
    1    11.53
    5       -8
    1     -9.7
    5     5.83
    4    -8.83
    4     7.39
    4     5.19
    3    -6.08
    1    -7.54
    2 5.336169
    5    -8.58
    5   -10.17
    3     -8.5
    2     7.64
    5    -6.21
    1    -7.47
    1   -17.32
    1   -44.07
    4   -11.15
    4    -7.92
    4    10.78
    3   -12.85
    4    -7.13
    5    -6.54
    1    85.61
    1    87.26
    1    40.13
    3    -5.02
    3    -8.91
    3   -13.23
    4    -7.72
    3     -5.8
    2    18.48
    3   -15.32
    5   -12.64
    5   -13.71
    5   -10.39
    3    -7.92
    3    -5.29
    3    -8.08
    3   -12.03
    1    -5.56
    3   -11.57
    5   -11.99
    3    -5.64
    4     7.97
    5    -5.52
    5    -8.99
    1     8.53
    5    -7.56
    5    -10.4
    3    -8.52
    5     -7.7
    5    -22.6
    1    -6.62
    1   -49.75
    5    -9.52
    5   -11.71
    5   -14.92
    5   -13.58
    5    -9.58
    end
    label values molnum molnum
    label values dev_all_gr graph_scale
    I have tried various options, including a twoway scatter with jitter. The problem with the scatter/ jitter approach is that the randomness it introduces makes the data points overlap the acceptability limits, looking as though there are failed samples within the acceptability zone, which is confusing to readers. [With my real data, of which there is more, there is more overlap]

    Code:
    twoway scatter dev_all_gr molnum, jitter(3) jitterseed(2) msymbol(diamond) msize(tiny) ///
    yline(-5 5, lpattern(dash) lcolor(orange))
    statlist1.png

    Then I explored the stripplot option, using the very helpful dofiles examples posted by Nick Cox and colleagues in the help files.
    I am using stripplot, newly installed from SSC on Stata/MP 18 .0 (revision 13/07/2023) for Mac Silicon.


    Code:
    stripplot dev_all_gr, over(molnum) stack height(0.8) ms(Sh)  vertical  bar(level(95)) yla(, ang(h)) ///
    yline(-5 5, lpattern(dash) lcolor(orange))
    statlist1_stripplot.png


    Though I use the stack command, the data do not stack. I tried the same command line on the auto file, and it works just fine:


    Code:
    sysuse auto, clear
    stripplot mpg, over(foreign) stack height(0.8) ms(Sh)  vertical  bar(level(95)) yla(, ang(h))
    statlist2.png

    From this I conclude that it must be something to do with the dev_all_gr variable itself, but I am at a loss to guess what.

    Is it possible to restrict the jitter command to only jittering horizontally (so that data points to not get jittered vertically into the acceptability zone)? I guess I could come up with some other kludge involving drawing the limit lines (which are in any case simpy a representation) in a different place. But that would be harder to do for other graph formats where I am having similar probelm, for example this: (again, the data points should not overlap the rbar).

    statalist_bar.png

    (Code for the above -- but not the data) are below FYI:


    Code:
    twoway rbar p25 med  inn_graph if inn==1, barw(.35) color(purple) horizontal || ///
    rbar  med p75 inn_graph if inn==1, barw(.35) color(purple*0.5) horizontal || ///
    scatter inn_graph price_ratio_med if inn==1 & (price_ratio_med <p25 | price_ratio_med >p75), jitter(5) jitterseed(3) msize(tiny) mcolor(purple) msymbol(circle) || ///
    rbar p25 med  inn_graph if inn==0, barw(.35) color(orange) horizontal || ///
    rbar  med p75 inn_graph if inn==0, barw(.35) color(orange*0.5) horizontal || ///
    scatter inn_graph price_ratio_med if inn==0 & (price_ratio_med <p25 | price_ratio_med >p75) & price!=0, jitter(5) jitterseed(3) msize(tiny) mcolor(orange) msymbol(circle) ///
    ytitle("") ylabel(1 2, valuelabel labsize(small) angle(0)) yscale(r(0.5 2.5)) ///
    legend(off)

    For the above twoway rbar/scatter, I spent a while trying to find a way to avoid the mess of overlapping data points completely by overlaying a kernel distribution on each of the values of inn (branded/unrbaded) without drawing separate graphs (for other purposes, we want to do this over a variable with 6 categories). However, I was defeated.

    All suggestions gratefully received.
    Attached Files

  • #2
    The stack() option in stripplot stacks identical values.

    stack specifies that data points with identical values are to be stacked, as in dotplot, except that by default there is no binning of data.
    For a variable like mpg, that is easy because values are reported as integers and there are many repetitions. For measured variables, you need to bin as well using a width() option. Values in the same bin will stack.

    With your data example (thanks!) these seem better:

    Code:
    stripplot dev_all_gr, over(molnum) stack height(0.6) ms(Sh)  vertical  bar(level(95)) yla(, ang(h)) ///
    yline(-5 5, lpattern(dash) lcolor(orange)) width(2) name(G1, replace)
    
    stripplot dev_all_gr, over(molnum)  height(0.8) ms(Sh)  vertical  bar(level(95)) boffset(-0.5) yla(, ang(h)) ///
    yline(-5 5, lpattern(dash) lcolor(orange)) cumul centre name(G2, replace) 
    I am not a great fan of jittering, partly for the reason you report: it doesn't always work as you hope.

    I am a great fan of quantile plotting, which is what cumul vertical does here.

    Comment


    • #3
      Bless you (and Kit Baum). Just FYI, here's what I managed to make of the distribution graph after your help, and playing around a bit with some of the suggestions in the helpful example do files. Idle,entirely aesthetic wish list for Kit: ability to vary the colours of the box plots (across the "over" variable) .

      price_variation_stripplot.png

      Comment


      • #4
        You want different colours for the two boxes? If so, that is on the wish list but odds are that it will get done when I want it myself. Sorry. it's programmable, but more than a simple couple of extra lines.

        Comment


        • #5
          Fair enough! One further question. I am having trouble using value labels in stripplot.

          I have a numerical variable (dev_all_gr) which is offset from the original variable (dev_all) by either plus or minus five points, for visual purposes. [I want to add a notional zone of acceptability in the middle of the graph.] I created labels for dev_all_gr to reflect the actual (not off-set) values at key points, and want to use those rather than the offset values to label the graph (at only the values for which I assigned labels). This worked fine when I was using twoway, but does not seem to work in stripplot. Labmask works on the (over) variable, and I tried a couple of labmask-related workarounds for the dev_all_gr, but get error messages related to non-integers. Any other possible solutions?

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input float(dev_all dev_all_gr) byte inn str13 mol_upper
            -.26  -5.26 0 "Dexamethasone"
           -2.37  -7.37 1 "Cefixime"     
           -2.75  -7.75 0 "Amoxicillin"  
           13.57  18.57 1 "Amlodipine"   
           -6.09 -11.09 0 "Amoxicillin"  
            8.97  13.97 0 "Cefixime"     
           -5.95 -10.95 1 "Cefixime"     
           -4.05  -9.05 1 "Amoxicillin"  
            -4.2   -9.2 0 "Cefixime"     
             -49    -54 1 "Allopurinol"  
            -.77  -5.77 0 "Amoxicillin"  
            -.06  -5.06 0 "Amoxicillin"  
            -.52  -5.52 0 "Dexamethasone"
           -4.35  -9.35 0 "Amoxicillin"  
            4.85   9.85 0 "Dexamethasone"
            -.32  -5.32 0 "Cefixime"     
          -48.54 -53.54 1 "Allopurinol"  
           -1.23  -6.23 0 "Amoxicillin"  
            -.08  -5.08 0 "Allopurinol"  
            -1.3   -6.3 0 "Cefixime"     
            -3.9   -8.9 1 "Amoxicillin"  
           -5.57 -10.57 0 "Dexamethasone"
             -.3   -5.3 0 "Amoxicillin"  
           -2.41  -7.41 0 "Dexamethasone"
            -3.3   -8.3 1 "Cefixime"     
            -.34  -5.34 0 "Amoxicillin"  
            -2.6   -7.6 0 "Allopurinol"  
            -.53  -5.53 0 "Amoxicillin"  
            -.02  -5.02 1 "Cefixime"     
            10.4   15.4 0 "Dexamethasone"
          end
          label values dev_all_gr graph_scale
          label values inn inn
          label def inn 0 "0 branded", modify
          label def inn 1 "1 generic", modify

          Code:
          lab def graph_scale 5 "None" -5 "None" 10 "5" -10 "-5" 15 "10" -15 "-10" 55 "50" -55 "-50" 105 "100"
          lab val dev_all_gr graph_scale
          
          labmask molnum, value(mol_upper)
          
          **GRAPH OF DEVIATION WITH STRIPPLOT
          
          gen inn_disp = . //generate display variable
          replace inn_disp = 1 if fail_det==2 & inn==0 // orange circle
          replace inn_disp = 2 if fail_det==2 & inn==1 // purple circle
          replace inn_disp = 3 if fail_det==3 & inn==0 // orange diamond
          replace inn_disp = 4 if fail_det==3 & inn==1 // purple diamond
          replace inn_disp = 5 if fail_det==4 & inn==0
          replace inn_disp = 6 if fail_det==4 & inn==1
          replace inn_disp = 7 if fail_det==5 & inn==0
          replace inn_disp = 8 if fail_det==5 & inn==1
          
          
          stripplot dev_all_gr, over(molnum) stack height(0.6) ms(circle circle diamond diamond triangle triangle square square)  vertical yla(, ang(h)) separate(inn_disp) mlcolor(orange purple orange purple orange purple orange purple) mcolor(orange*0.4 purple*0.4 orange*0.4 purple*0.4 orange*0.4 purple*0.4 orange*0.4 purple*0.4) msize(vsmall vsmall vsmall vsmall vsmall vsmall vsmall vsmall) ///
          ylabel(-55 -15 -10 -5 5 10 15 55 105, valuelabel angle(0) labsize(vsmall)) ///
          ytitle("Deviation from permitted limits, in percentage points" "(or points, for uniformity)", size(vsmall) margin(r=2))  ///
          yline(-5 5, lpattern(dash) lcolor(green)) width(2) ///
          xscale(r(0.5 5.5)) xlabel(, valuelabel labsize(vsmall)) xtitle("") ///
          text(1 2 "Notional range of acceptability", color(green) size(tiny)) ///
          legend(order(1 "Failed assay" 3 "Failed dissolution" 5 "Failed uniformity" 7 "Failed assay and dissolution") rows(1) size(vsmall) region(lcolor(gray*.3))) ///
          note("Orange markers: Branded; Purple markers: Unbranded", size(vsmall)) ///
          plotregion(lstyle(none))
          
          graph export $results/estimates_output/deviation_stripplot.png, replace

          Comment


          • #6
            The code in #5 fails for me without data on molnum. Looking ahead there will be other problems from variables not in the data example.

            Comment

            Working...
            X