No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Scatterplots with weighted marker size revisited

    Hello everybody,

    this is not strictly a technical question, but more one about how to find an appropriate visualization for multidimensional data.

    I found one way to approach this in stata is using weights in scatterplots to adjust markersize.
    However, I found the result looked kinda odd and the actual marker sizes did not really seem to be a proportional representation of the underlying weights.
    Apparently the algorithm behind uses some kind of smoothing so marker sizes do not get out of control in presence of outliers.

    This is what the manual suggests. In some cases this may be misleading, however. Now, Nick Cox also brought up this point in this older post:

    He also mentioned there are better ways to display trivariate data. But I couldn't really come up with a better idea for myself.
    So, I thought maybe the statalisters would have suggestions how to approach such a graphics problem?

    Maybe it's easier to reason about this using an example, so here the one from the manual:

    sysuse census, clear
    generate drate = divorce / pop18p
    label var drate "Divorce rate"
    scatter drate medage [w=pop18p] if state!="Nevada", msymbol(Oh)
            note("Stata data excluding Nevada"
            "Area of symbol proportional to state's population aged 18+")
    Click image for larger version

Name:	Clipboard02.jpg
Views:	1
Size:	20.4 KB
ID:	1538360


    Last edited by Boris Ivanov; 25 Feb 2020, 09:12. Reason: typo

  • #2
    Bubble charts worked for Hans Rosling in a justly famous TED talk. That's partly because of the examples he used. In most other cases, to me they look a useless mess.

    A canonical example is country populations, which vary by a factor of about 1 billion. Do you really want the biggest circle to be 1 billion times the area of the smallest?

    If the question is what else did I have in mind in 2008, goodness knows, except I think scatter plot matrices or dot or bar charts in parallel.

    Now other answers are possible. Here's one. Use different colour intensities for population on a approximately logarithmic scale. Some experiment not shown here indicated that log base 2 of population rounded down gives 7 classes, and I don't want more. I didn't cheat by omitting Nevada, but I did cheat by using logarithmic scale for divorce rate too.

    Suppressing the legend is deliberate. If someone likes this, the story is just stronger marker colours mean larger populations on a stepped logarithmic scale. I identify only states on the convex hull on these scales.

    sysuse census, clear
    generate drate = divorce / pop18p
    label var drate "Divorce rate"
    gen toshow = floor(log(pop18p)/ log(2))
    separate drate , by(toshow)
    gen tolabel = state2 if inlist(state2, "PA", "ND", "NV", "FL", "UT", "AK")
    local mlabel = 7 * "tolabel " 
    set scheme s1color 
    scatter drate?? medage, mfc(blue*0.03 blue*0.06 blue*0.12 blue*0.25 blue*0.5 blue blue*2) ///
    mlc(blue ..) mla(`mlabel') legend(off) ytitle(Divorce rate) mlabc(blue ..) ysc(log) xla(24/35, format(%2.0f))
    Click image for larger version

Name:	notabubble.png
Views:	1
Size:	27.1 KB
ID:	1538368


    • #3
      First of all, let me thank you for bringing up the TED talk of Hans Rosling which I wasn't aware of, so far.
      Ok, I see that bubble charts are not very useful when there are many data points and the range of the weight is very large.
      In this case it's also understandable thab stata somehow adjusts the weights, but it this still affects the interpretation in a way that is not transparent.

      I guess bubble plots could still be appropriate in cases where the range of the weights is limited (say the factor is not a billion but 10 or so) such that keeping the original proportions is feasible if and there aren't too many data points, so the plot does not become a mess (say averages of a few groups at a few discrete points in time weighted with the number of observations in each group).

      But then, do you think there would be any way to prevent stata make keep the original/proportional weights?

      Thank you also for the idea with the colour shading - I will try that!


      • #4
        I don't know any way of controlling how graph twoway ... , ms() uses weights. You would need to go deep, deep into the code to find out what it does and then code around it.

        But you can add your circles wherever you like to a scatter plot with repeated calls to twoway function, by drawing top and bottom halves of a circle separately. That is documented at but I learned of the trick from Vince Wiggins.


        • #5
          Thanks again! Alright, then I guess playing twoway function is more straight forward than such a surgical intervention.


          • #6
            The sizes on offer are displayed in the following graphs. No matter how many points you have, there is no adjustment for a fixed number of markers. It is easy to imagine why this would be the case as things would quickly get out of control if you require size to be proportional to weight.

            set obs 100
            foreach letter in x y w{
                    gen `letter'=_n
            tw scatter x y if _n<6 [aw=w], msymbol(Oh) saving(one, replace)
            tw scatter x y if _n<11 [aw=w], msymbol(Oh) saving(two, replace)
            tw scatter x y if _n<16 [aw=w], msymbol(Oh) saving(three, replace)
            gr combine one.gph two.gph three.gph, col(1)
            Click image for larger version

Name:	Graph.png
Views:	1
Size:	26.5 KB
ID:	1538949


            • #7
              Thank you for the illustrative example. I definitely see that point, so I guess adjusting the size proportionally to the weights really only makes sense when the range of the weight is limited.


              • #8
                Yes. You could group units within a certain range and if they are few, get away with size being proportional to weight. Notice that if the weight consists of consecutive integers, the second marker has twice the diameter of the first, the fourth has twice the diameter of the second, the eighth twice the diameter of the fourth and so on. How many potential groups do you have?


                • #9
                  "Notice that if the weight consists of consecutive integers..." Ok, that's how stata adjusts the diameters. Good to know what's going on.
                  In my case, there is a treatment and a control group and each is split into two subgroups. So, that makes 4 groups. There are also three points in time.
                  I would like to plot the subgroup means at each point in time and illustrate their relative size by the marker size. Since these to do not differ so much across subgroups (by no more than a factor of 10), marker size could as well be propotional to the weights.
                  I will try to work around suggested by Nick Cox and see how it goes.


                  • #10
           is a relevant reference from 2005. it doesn't add much to what is evident so far, but it may be helpful.


                    • #11
                      Here is an illustrative example based on data from the Penn World Table v7.1. I graph trade openness (imports and exports as a share of GDP) against government size (government expenditure as a share of GDP). There is some economic theory underlying this relationship, see for example, I have chosen countries in two regions of similar size (defined in terms of population). Several concepts are illustrated here, including creating weight groups and how to compare markers between groups in a scatter plot with weighted markers. A tip on the latter is forthcoming in the Stata journal.

                      * Example generated by -dataex-. To install: ssc install dataex
                      input str3 isocode int year float(POP GS OPEN region)
                      "ZAR" 2010  69851.29 13.759258 145.65611 1
                      "EGY" 2010  80471.87  8.503781  47.48052 1
                      "ETH" 2010  88013.49  7.837685  47.75925 1
                      "FRA" 2010  64768.39  7.410363  53.27246 2
                      "GER" 2010  81644.45  5.933906  88.18435 2
                      "KEN" 2010  40046.57  5.995493  63.22924 1
                      "NLD" 2010 16783.092  10.34205 148.63039 2
                      "ESP" 2010  46505.96  8.171445  54.67729 2
                      "GBR" 2010  62348.45  7.886595  62.55183 2
                      label values region region
                      label def region 1 "Africa", modify
                      label def region 2 "Europe", modify
                      *The Smallest country is the Netherlands, we assign a weight of 1.
                      *Others, a proportional factor of this weight.
                      gen weight= floor(POP/10000)
                      *Some weights are missing,
                      tab isocode weight
                      *Make sure we have consecutive weights from smallest to largest.
                      *Create extra isocode (named "unknown" below) and assign to 1 region
                      sum weight
                      set obs `=_N+`r(max)''
                      replace isocode= "UNKNOWN" if missing(isocode)
                      replace region=2 if missing(isocode)
                      replace weight=sum(1) if missing(weight)
                      *Ensure weights in one region are also represented in others
                      *Necessary to achieve between group comparisons
                      fillin weight region
                      tw (scatter OPEN GS [aw=weight], by(region, leg(off) ///
                      note("Marker sizes represent size of population")) ///
                      msymbol(Oh)) (scatter OPEN GS, by(region) mcolor(none) ///
                       mlabel(isocode)), ytitle("Trade Openness") ///
                      xtitle("Government Size") scheme(s1color)
                      Click image for larger version

Name:	Graph.png
Views:	1
Size:	141.7 KB
ID:	1540022

                      In the graph, the Netherlands is the smallest country and is about 1/8 the size of Germany, Ethiopia and Egypt. Kenya is about the size of Spain and the UK and France about the size of South Africa. I have not tidied up on the labels, but see #4 of the following for some technique.

                      Last edited by Andrew Musau; 06 Mar 2020, 10:22.


                      • #12
                        Thanks again for the illustrative example!


                        • #13
                          With marker labels the default position is 3, meaning 3 pm with a clockwise notation. Changing that to 0 will put marker labels in the same position as the marker's centre. This is worth a try sometimes whenever -- as in #11 -- the markers are hollow.

                          My position on bubbles remains somewhere between agnostic and unbeliever.