Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Scatter plot

    how to create a scatter plot of two categorical variables by assigning one to shape and other to color. in this example
    Attached Files

  • #2
    .

    Comment


    • #3
      Stata data example please (https://www.statalist.org/forums/help#stata)

      If your question is really about R, I suggest asking on Stack Overflow or R-help.

      Comment


      • #4
        my question Mr. Nick is how can I create scatter plot with categorical variables differentiating them with color and shapes in Stata. attachment #1 is an example which is constructed from R. can you please give me an assist.

        Comment


        • #5
          Stata data example please.

          Comment


          • #6
            I'm actually curious to see the solutions proposed. I'd argue that the displayed R graph is really not clear, as it's difficult to make out the differences in the TOTAL values according to the classes of "rep" and "fraction". The circle/triangle of different colors don't really cut it for me, unless the message is that the TOTAL values are not that different between the different classes. To make it slightly worse, it seems that there are circles of different sizes, which is not present on the legend and I simply can't interpret it at all.

            I took the time to recreate a similar data set as Ian should have posted. Please find it below.

            Code:
            clear
            input str3 location str5 fraction str3 rep float total_af_ppb
            "BMT" "heavy" "one"   .634899
            "BMT" "heavy" "one"  .9944572
            "BMT" "heavy" "one"  .7497677
            "BMT" "heavy" "one"  .1736788
            "BMT" "heavy" "one"  .6107705
            "BMT" "heavy" "two" 2.5754216
            "BMT" "heavy" "two" 2.3678162
            "BMT" "heavy" "two" 2.3005245
            "BMT" "heavy" "two"  2.007538
            "BMT" "heavy" "two"  2.670137
            "BMT" "light" "one" 1.4241406
            "BMT" "light" "one"  1.953762
            "BMT" "light" "one" 1.0867478
            "BMT" "light" "one" 1.8949648
            "BMT" "light" "one" 1.5890286
            "BMT" "light" "two"  3.400583
            "BMT" "light" "two"   3.66549
            "BMT" "light" "two"  3.419839
            "BMT" "light" "two"  3.747205
            "BMT" "light" "two" 3.7190144
            "BSA" "heavy" "one" 104.08141
            "BSA" "heavy" "one"  106.1555
            "BSA" "heavy" "one" 101.74577
            "BSA" "heavy" "one" 103.61765
            "BSA" "heavy" "one"   101.339
            "BSA" "heavy" "two" 300.01364
            "BSA" "heavy" "two"   302.571
            "BSA" "heavy" "two"  306.5174
            "BSA" "heavy" "two"  309.2521
            "BSA" "heavy" "two"  308.2334
            "BSA" "light" "one"  209.2294
            "BSA" "light" "one" 207.48042
            "BSA" "light" "one" 205.21414
            "BSA" "light" "one" 204.02216
            "BSA" "light" "one"   208.682
            "BSA" "light" "two"  402.7266
            "BSA" "light" "two"  407.2395
            "BSA" "light" "two"  407.9555
            "BSA" "light" "two"  408.9251
            "BSA" "light" "two"  407.0788
            "HBY" "heavy" "one"  53.65269
            "HBY" "heavy" "one"   59.3105
            "HBY" "heavy" "one"  56.21681
            "HBY" "heavy" "one"  58.00435
            "HBY" "heavy" "one"  54.79837
            "HBY" "heavy" "two" 151.42947
            "HBY" "heavy" "two" 158.34344
            "HBY" "heavy" "two" 153.43124
            "HBY" "heavy" "two" 155.90686
            "HBY" "heavy" "two"  153.2953
            "HBY" "light" "one"  106.9963
            "HBY" "light" "one" 108.14297
            "HBY" "light" "one" 103.42973
            "HBY" "light" "one" 101.07968
            "HBY" "light" "one" 109.67175
            "HBY" "light" "two"  201.2855
            "HBY" "light" "two" 207.57854
            "HBY" "light" "two"  202.5003
            "HBY" "light" "two" 209.26913
            "HBY" "light" "two" 207.11842
            "KKA" "heavy" "one" 203.59343
            "KKA" "heavy" "one" 209.01303
            "KKA" "heavy" "one" 203.89214
            "KKA" "heavy" "one"  214.2715
            "KKA" "heavy" "one" 204.90623
            "KKA" "heavy" "two"  415.3449
            "KKA" "heavy" "two"  407.3071
            "KKA" "heavy" "two"  405.4138
            "KKA" "heavy" "two"  419.8226
            "KKA" "heavy" "two"  413.7026
            "KKA" "light" "one"  310.0553
            "KKA" "light" "one" 313.80765
            "KKA" "light" "one"   317.272
            "KKA" "light" "one" 300.80927
            "KKA" "light" "one" 303.68445
            "KKA" "light" "two"  508.3976
            "KKA" "light" "two"   512.951
            "KKA" "light" "two"  518.2063
            "KKA" "light" "two" 513.61847
            "KKA" "light" "two" 517.13763
            "KSM" "heavy" "one" 150.96303
            "KSM" "heavy" "one"   162.586
            "KSM" "heavy" "one"  159.3123
            "KSM" "heavy" "one" 156.06264
            "KSM" "heavy" "one" 164.67955
            "KSM" "heavy" "two"  355.4415
            "KSM" "heavy" "two"  354.7074
            "KSM" "heavy" "two"  359.5711
            "KSM" "heavy" "two"  352.8049
            "KSM" "heavy" "two"  357.5802
            "KSM" "light" "one" 257.91446
            "KSM" "light" "one" 261.78012
            "KSM" "light" "one" 257.07602
            "KSM" "light" "one" 253.44977
            "KSM" "light" "one" 261.96524
            "KSM" "light" "two"  452.4741
            "KSM" "light" "two"  463.9942
            "KSM" "light" "two"   455.999
            "KSM" "light" "two"   464.823
            "KSM" "light" "two"  463.9318
            "SYA" "heavy" "one"  786.4038
            "SYA" "heavy" "one"  723.6804
            "SYA" "heavy" "one"   753.365
            "SYA" "heavy" "one"   798.243
            "SYA" "heavy" "one"  721.7005
            "SYA" "heavy" "two"  2342.782
            "SYA" "heavy" "two" 2377.8247
            "SYA" "heavy" "two"  2395.336
            "SYA" "heavy" "two" 2405.7007
            "SYA" "heavy" "two" 2413.2275
            "SYA" "light" "one" 1552.4447
            "SYA" "light" "one"  1536.377
            "SYA" "light" "one"   1615.82
            "SYA" "light" "one"  1580.086
            "SYA" "light" "one" 1617.3347
            "SYA" "light" "two"  3143.638
            "SYA" "light" "two"  3187.648
            "SYA" "light" "two"  3134.591
            "SYA" "light" "two"  3149.664
            "SYA" "light" "two" 3202.1956
            end
            One way I would approach illustrating this is to break the graph in two, according to the data repetition. Thus, I would get rid of the circles/triangles. Different colors would just show the values of TOTAL for different fractions, according to the repetition, in classes of location. One way I would construct such graph is using stripplot (downloaded from http://fmwww.bc.edu/RePEc/bocode/s ), which Nick Cox is the author:

            Code:
            stripplot total_af_ppb, over(location) separate(fraction) by(rep) jitter(2 2) msymbol (o o) vertical
            This graph could use some tweaks in coloring, adjusting size of fonts, etc, but I'm curious in what other approaches would people use to illustrate similar data.

            Cheers

            Comment


            • #7
              I think R is a bit more flexible with assigning colors and symbols to different categories. This is the closest I can get to the graph in #1.
              Code:
              separate total, by(rep) veryshortlabel
              encode location, gen(n_location)
              twoway (scatter total_af_ppb? n_location if fraction == "light", xlabel(1 2 3 4 5 6, val) msymbol(o d) mcolor(red red) jitter(5)) ///
                     (scatter total_af_ppb? n_location if fraction == "heavy", xlabel(1 2 3 4 5 6, val) msymbol(o d) mcolor(blue blue) jitter(5)), ///
                     legend(label(1 "One (Light)") label(2 "Two (Light)") label(3 "One (Heavy)") label(4 "Two (Heavy)"))
              I don't find the graph very clear though so I probably would put it into two graphs side by side anyway, as in Igor Paploski's example.

              Comment


              • #8
                Thanks to Igor Paploski and Wouter Wakker for their contributions. As in a well-conducted examination, I worked out my answer in detail first -- with many iterations not shown here -- before looking at theirs closely. For the convenience of anyone interested, code for the data and all three suggestions is bundled together here.

                Code:
                clear
                input str3 location str5 fraction str3 rep float total_af_ppb
                "BMT" "heavy" "one"   .634899
                "BMT" "heavy" "one"  .9944572
                "BMT" "heavy" "one"  .7497677
                "BMT" "heavy" "one"  .1736788
                "BMT" "heavy" "one"  .6107705
                "BMT" "heavy" "two" 2.5754216
                "BMT" "heavy" "two" 2.3678162
                "BMT" "heavy" "two" 2.3005245
                "BMT" "heavy" "two"  2.007538
                "BMT" "heavy" "two"  2.670137
                "BMT" "light" "one" 1.4241406
                "BMT" "light" "one"  1.953762
                "BMT" "light" "one" 1.0867478
                "BMT" "light" "one" 1.8949648
                "BMT" "light" "one" 1.5890286
                "BMT" "light" "two"  3.400583
                "BMT" "light" "two"   3.66549
                "BMT" "light" "two"  3.419839
                "BMT" "light" "two"  3.747205
                "BMT" "light" "two" 3.7190144
                "BSA" "heavy" "one" 104.08141
                "BSA" "heavy" "one"  106.1555
                "BSA" "heavy" "one" 101.74577
                "BSA" "heavy" "one" 103.61765
                "BSA" "heavy" "one"   101.339
                "BSA" "heavy" "two" 300.01364
                "BSA" "heavy" "two"   302.571
                "BSA" "heavy" "two"  306.5174
                "BSA" "heavy" "two"  309.2521
                "BSA" "heavy" "two"  308.2334
                "BSA" "light" "one"  209.2294
                "BSA" "light" "one" 207.48042
                "BSA" "light" "one" 205.21414
                "BSA" "light" "one" 204.02216
                "BSA" "light" "one"   208.682
                "BSA" "light" "two"  402.7266
                "BSA" "light" "two"  407.2395
                "BSA" "light" "two"  407.9555
                "BSA" "light" "two"  408.9251
                "BSA" "light" "two"  407.0788
                "HBY" "heavy" "one"  53.65269
                "HBY" "heavy" "one"   59.3105
                "HBY" "heavy" "one"  56.21681
                "HBY" "heavy" "one"  58.00435
                "HBY" "heavy" "one"  54.79837
                "HBY" "heavy" "two" 151.42947
                "HBY" "heavy" "two" 158.34344
                "HBY" "heavy" "two" 153.43124
                "HBY" "heavy" "two" 155.90686
                "HBY" "heavy" "two"  153.2953
                "HBY" "light" "one"  106.9963
                "HBY" "light" "one" 108.14297
                "HBY" "light" "one" 103.42973
                "HBY" "light" "one" 101.07968
                "HBY" "light" "one" 109.67175
                "HBY" "light" "two"  201.2855
                "HBY" "light" "two" 207.57854
                "HBY" "light" "two"  202.5003
                "HBY" "light" "two" 209.26913
                "HBY" "light" "two" 207.11842
                "KKA" "heavy" "one" 203.59343
                "KKA" "heavy" "one" 209.01303
                "KKA" "heavy" "one" 203.89214
                "KKA" "heavy" "one"  214.2715
                "KKA" "heavy" "one" 204.90623
                "KKA" "heavy" "two"  415.3449
                "KKA" "heavy" "two"  407.3071
                "KKA" "heavy" "two"  405.4138
                "KKA" "heavy" "two"  419.8226
                "KKA" "heavy" "two"  413.7026
                "KKA" "light" "one"  310.0553
                "KKA" "light" "one" 313.80765
                "KKA" "light" "one"   317.272
                "KKA" "light" "one" 300.80927
                "KKA" "light" "one" 303.68445
                "KKA" "light" "two"  508.3976
                "KKA" "light" "two"   512.951
                "KKA" "light" "two"  518.2063
                "KKA" "light" "two" 513.61847
                "KKA" "light" "two" 517.13763
                "KSM" "heavy" "one" 150.96303
                "KSM" "heavy" "one"   162.586
                "KSM" "heavy" "one"  159.3123
                "KSM" "heavy" "one" 156.06264
                "KSM" "heavy" "one" 164.67955
                "KSM" "heavy" "two"  355.4415
                "KSM" "heavy" "two"  354.7074
                "KSM" "heavy" "two"  359.5711
                "KSM" "heavy" "two"  352.8049
                "KSM" "heavy" "two"  357.5802
                "KSM" "light" "one" 257.91446
                "KSM" "light" "one" 261.78012
                "KSM" "light" "one" 257.07602
                "KSM" "light" "one" 253.44977
                "KSM" "light" "one" 261.96524
                "KSM" "light" "two"  452.4741
                "KSM" "light" "two"  463.9942
                "KSM" "light" "two"   455.999
                "KSM" "light" "two"   464.823
                "KSM" "light" "two"  463.9318
                "SYA" "heavy" "one"  786.4038
                "SYA" "heavy" "one"  723.6804
                "SYA" "heavy" "one"   753.365
                "SYA" "heavy" "one"   798.243
                "SYA" "heavy" "one"  721.7005
                "SYA" "heavy" "two"  2342.782
                "SYA" "heavy" "two" 2377.8247
                "SYA" "heavy" "two"  2395.336
                "SYA" "heavy" "two" 2405.7007
                "SYA" "heavy" "two" 2413.2275
                "SYA" "light" "one" 1552.4447
                "SYA" "light" "one"  1536.377
                "SYA" "light" "one"   1615.82
                "SYA" "light" "one"  1580.086
                "SYA" "light" "one" 1617.3347
                "SYA" "light" "two"  3143.638
                "SYA" "light" "two"  3187.648
                "SYA" "light" "two"  3134.591
                "SYA" "light" "two"  3149.664
                "SYA" "light" "two" 3202.1956
                end
                
                egen group = group(rep fraction), label
                niceloglabels total_af_ppb , style(13) local(show)
                egen median = median(total), by(location)
                egen order = group(median)
                labmask order, values(location)
                set scheme s1color
                
                stripplot total_af_ppb, over(order) vertical box cumul cumprob centre separate(group) ysc(log)  ms(Oh + Oh +) mc(red red blue blue) yla(`show', ang(h)) xla(, noticks) xtitle("") legend(col(1) order(7 6 5 4) pos(3)) refline reflevel(gmean) name(NJC, replace)
                
                stripplot total_af_ppb, over(location) separate(fraction) by(rep) jitter(2 2) msymbol (o o) vertical name(IP, replace)
                
                separate total, by(rep) veryshortlabel
                encode location, gen(n_location)
                twoway (scatter total_af_ppb? n_location if fraction == "light", xlabel(1 2 3 4 5 6, val) msymbol(o d) mcolor(red red) jitter(5)) ///
                       (scatter total_af_ppb? n_location if fraction == "heavy", xlabel(1 2 3 4 5 6, val) msymbol(o d) mcolor(blue blue) jitter(5)), ///
                       legend(label(1 "One (Light)") label(2 "Two (Light)") label(3 "One (Heavy)") label(4 "Two (Heavy)")) name(WW, replace)
                Here is my graph first:
                Click image for larger version

Name:	stripplot_Ian_NJC.png
Views:	1
Size:	35.3 KB
ID:	1512170



                Comments on that:

                1. The original data are not explained but look like concentrations of something or other (parts per billion?) and typically such data are better shown on logarithmic scale. In fact, regardless of what the data are a glance at #1 made me think of using a log scale, assuming that there aren't zeros in the original data. There is perhaps a signal in the graph just above that logarithms over-transform slightly. Know that I tried cube root scale and it wasn't an improvement overall. (You really need a strong case to use cube roots: even people who know what a cube root is can be highly resistant to such a new idea that is unfamiliar.)

                2. With a log scale Stata isn't smart about axis labels: I used niceloglabels from the Stata Journal to help, which you must install before you can use it. If some readers are of the view that labels 1 10 100 1000 would be simpler, then I agree with them.

                3. I would want a nice variable label on the y axis, if only I knew what it should be.

                4. On the x axis, I don't need "Location" as an axis title. (My favourite small edit with students is to urge that you really don't need "Year" on the x axis if your axis is showing say 2000 2005 2010 2015. Behind every such decision I imagine high school teachers, fearsome in your own best interests, insisting that you label your axes! Excellent advice, but rules are just guidance for the wise in this instance.)

                5. I don't need ticks on the x axis either.

                6. I can't see any need to follow alphabetical order slavishly on the x axis. I don't know what the names expand to, but regardless ordering locations by their medians is a not quite arbitrary choice, as I know that in principle taking medians commutes with using logarithms, so that median(log) = log(median). There is some very small print in practice, which doesn't bite usually. Similarly, locations could tie on median, which can be resolved arbitrarily. I used labmask from the Stata Journal to help, which you must install before you can use it.

                7. Jittering is a wonderful idea to shake apart points that are identical or otherwise very close on a scatter plot, but it's overused (perhaps especially by some influential R users) when there are different ways to separate points. Here I use quantile-box plots in Parzen's sense. I add reference lines for geometric means: for that to work, you need an egen function to calculate geometric means to be accessible, such as that from egenmore (SSC). (Recall that medians equal geometric means for any distribution symmetric on logarithmic scale, such as the lognormal.)

                8. Separation of points for each location really isn't difficult. I first wrote separate to make this easier. Wouter is using the undocumented veryshortlabel option, which has often been mentioned on Statalist, The story is that after writing separate and the code being adopted by StataCorp, I realised that an extra option veryshortlabel was often a good idea, so I sold this to StataCorp with a suggestion that they add the option to the code without needing to hit the help files or the manuals, both of which were a bigger deal back then, But the option is written up within https://www.stata-journal.com/articl...article=gr0023

                9. I used red and blue and open circles and pluses for markers. In R circles there seems to be a perverse reluctance to use open symbols, despite the Bell Labs heritage of emphasising their use from the 1980s as better tolerating overlap and occlusion. To be sure, we have transparency now too. I have to hand Yule, G.U. 1911. An Introduction to the Theory of Statistics. London: Griffin who recommends small circles and crosses for scatter plots on p.180. (My personal copy comes originally from a medical library: its excellent, indeed almost pristine, condition suggests that it was rarely borrowed.)

                10. I put the legend outside. I was tempted to put it inside in a corner, but it didn't work quite as well, at least according to my taste. I am not dogmatic about keeping legends out of the plot region if there is space for them somewhere.

                Here are the graphs from Igor and Wouter. I used scheme s1color; they may have been using different schemes for all I know.

                Igor:
                Click image for larger version

Name:	stripplot_Ian_IP.png
Views:	1
Size:	25.6 KB
ID:	1512171


                Wouter
                Click image for larger version

Name:	stripplot_Ian_WW.png
Views:	1
Size:	23.0 KB
ID:	1512172

                Last edited by Nick Cox; 14 Aug 2019, 04:41.

                Comment


                • #9
                  Very very interesting, and very nice discussion. I'll save this topic for future consultation! I find particularly useful to see how people would go into solving something, because that's one of the ways that I actually improve the way I do things. Statalist in this regard is awesome, because with the commands posted it's very easy to see the steps and re-create them (in case it's needed) in the future. Thank you for your time on this Wouter Wakker and Nick Cox.

                  Cheers

                  Comment


                  • #10
                    thank you for your contributions Nick Cox Igor Paploski Wouter Wakker

                    Comment

                    Working...
                    X