All roads lead to Rome? The connnection between zmap and heatplot

Chen Samulsion

Join Date: Jan 2018
Posts: 923

All roads lead to Rome? The connnection between zmap and heatplot

14 Feb 2025, 19:30

Dear Stata users,

A simple question here: Is there any connection between Nick Cox's -zmap- and Ben Jann's -heatplot- (all from SSC). The two commands can generate plots that are very much alike (see plots that I attached below). And heatplot is clearly designed to create heat plots (maps). However, it seems to me that Nick's initial purpose of writing zmap was not to produce any kind of heat maps.

Code:

webuse nlswork
egen mean = mean(ln_wage), by(age grade)
egen tag = tag(age grade)
label var mean "mean ln wage"
su ln_wage if !missing(age, grade), detail

zmap mean grade age if tag, breaks(.993 1.166 1.361 1.641 1.964 2.275 2.456) ms(S ..) ysc(on) xsc(on) legend(on pos(3) col(1)) yla(0/18, ang(h)) ytitle(`: var label grade') title("") xla(15(5)45) note("") mcolor(stc1) name(zmap, replace)
heatplot mean grade age, discrete levels(8) color(st) ylabel(0/18, ang(h)) xlabel(15(5)45) name(heatplot, replace)
heatplot mean grade age, discrete levels(8) color(st) ylabel(0/18, ang(h)) xlabel(15(5)45) scatter(S) name(heatplot2, replace)
graph combine zmap heatplot, name(g1)
graph combine zmap heatplot2, name(g2)

Click image for larger version

Name: g1.png
Views: 1
Size: 279.0 KB
ID: 1772743

Click image for larger version

Name: g2.png
Views: 1
Size: 269.4 KB
ID: 1772744

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35711
#2

15 Feb 2025, 01:47

This needs a cross-reference to its context: https://www.statalist.org/forums/for...mmands-related

There is no connection, just some overlap of goals. I wrote zmap in 2010 and last tweaked it in 2012.

Code:

*! 1.1.1 NJC 10 December 2012 *! 1.1.0 NJC 3 December 2012 *! 1.0.0 NJC 12 March 2010

So, that's why it doesn't cite heatplot

Code:

*! version 1.1.1 24aug2021 Ben Jann

which I think was first released in 2019 or perhaps a bit earlier.

I can't think there is good reason for Ben to cite zmap. His command is much more versatile. I suspect that everything zmap does could be done by heatplot directly or with a little extra programming, but the converse is not true. It's entirely possible that he didn't know about zmap or had seen the Statalist post in 2010 and forgotten about it. None of that matters one bit.

The description of intent in the help for zmap stands for how I think of it now, but I can add a little context.

zmap graphs (or maps) binned values of a variable z with respect to two variables x and y treated as Cartesian coordinates. In geographical or cartographical terms
x defines distance east and y defines distance north. The range of z is divided into two or more bins or classes and points in each bin are shown distinctly. The
resulting plot is thus a composite scatter plot.

The main intended application is that z is a spatial series measured at numerous points or for numerous small areas with respect to planar coordinates x and y.
However, nothing ties this command to spatial data. Users may wish to use the command on other trivariate data. The marker symbols used may then be better set to
something larger than points.

By default binning is into 8 classes with 7 breaks determined by the 5 10 25 50 75 90 95% points or percentiles of the distribution of z. Alternatively, the user
may specify other percentile breaks, or a set of breaks on the scale of z. The number of classes in general is naturally one more than the number of breaks.

By default with between 1 and 8 breaks, points falling into different bins are shown with different gray scale colours, darker meaning higher values. If more than 8
breaks are specified, default colours are just those of the prevailing graph scheme. In either case users may specify their own colour choices to override defaults.

Lower limits are inclusive, so that each bin contains points >= its lower limit and < its upper limit.

Remarks

If the y variable follows a row or matrix or Southern latitude convention so that it increases downwards, then use the ysc(reverse) option.

The following limitations may be noted.

1. zmap is not smart about tied values. Higher values of z for the same x and y values will just overplot lower values. If this is important, considering
averaging z for each distinct combination of x and y in some way. An example appears below.

2. zmap does not apply any special intelligence to ensure appropriate aspect ratios to maintain equal scales on both x and y axes. The presumption is that most
uses will be exploratory or that, if this is important, xsize(), ysize() or aspect() options may be used according to taste.

3. zmap can do nothing about the limitations of your monitor, or indeed any other monitor.

The personal context is that I wrote zmap for teaching for a course in which students learn about statistics as a complement to geographical information systems. I just wanted quick and dirty plots in Stata that produced crude but serviceable exploratory maps. The students can do much better cartography in any geographical information system. The usual application in my teaching is to gridded topographic data with enough data points for the scatter plot to approximate a choropleth map. We then move on to some more serious statistics.

I certainly knew about heatplots at the time. To spell it out, I am not a fan of heatplots when the axes are arbitrary categories that aren't even ordered. In a nutshell, I find most such applications unconvincing and -- even more important -- I usually think that that there is a better way to show the data OR that almost any visualization of very complicated data seems unlikely to help. If you don't share this view, you're likely to regard it as prejudiced, but in general let's discuss examples! And I have logic too: I don't think people are very good at reading off patterns when encoding is from numeric values to colours, even with a very carefully chosen colour scheme, and when the rows and columns are of unordered categories, and this is a standard point in statistical visualization.

I set aside data art and all that lies in that direction when people don't seem interested in interpreting the underlying data at all.
For an example of such literature, see https://www.amazon.com/Questions-Dat...iews/103214620 and in particular my review there given. (It may be one click away.)

There is a simple chicken-and-egg problem. All kinds of graph are novel when first met -- from histograms and scatter plots onwards to any new or more complicated design, Experience is needed to discover what any kind of graph can show (well), and what it can't, and experience and experiment is needed to get (for example) a good histogram that isn't showing artefacts of bin width and origin, or to move on from an histogram when something else will work better.

Other way round, having seen several heatplots that seemed disappointingly messy, I haven't often persevered to get a strong sense of what to look for. The popularity of heatplots for *nomics data needs to be explained by people who use them often. I don't need to be especially cynical to wonder how far there is an element of routine or ritual, that supervisors, reviewers and others expect heat plots because that is what "everyone" includes in their papers. That certainly happens elsewhere: Box plots that don't even show means are often used in support of analysis of variance, which is to me puzzling if not bizarre. Dynamite, detonator and plunger plots that suppress most detail about the data are all over several literatures, despite repeated dissections of how poor they are at showing the data they supposedly show.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35711
#3

15 Feb 2025, 02:52

If the link in #2 doesn't work for you, you should be able to find those reviews on amazon.com with a search for Questions in Dataviz: A Design-Driven Process for Data Visualisation by Neil Richards

Chen's examples in #1 are a tweaking of a token example in the help for zmap and an equivalent with heatplot. I won't try to sell the zmap example as good and in both cases I think the colour schemes (Chen's choice) could be improved. But the bigger deal is that I suggest that you can do much better. I spent about five minutes experimenting and came up with this and my point is made (much) stronger if anyone has (much) better ideas.

(I am very sympathetic to the idea that you should fit a model first, then show fit, and so on.)

Even a reduction to mean of log wage (which is a one-to-one transform of geometric mean) by age and grade is noisy because many cells have few observations. Hence two takes here, one filtering out smaller subsamples.

fabplot is from the Stata Journal. The idea? fabplot = front-and-back plot, and in turn each group is shown at the front, with all the others as background or backdrop.

Code:

webuse nlswork, clear egen mean = mean(ln_wage), by(age grade) egen N = count(ln_wage), by(age grade) egen tag = tag(age grade) label var mean "mean ln wage" fabplot scatter mean age if tag, by(grade) frontopts(msize(medlarge) ms(O)) name(G1) fabplot scatter mean age if tag & N > 10, by(grade) frontopts(msize(medlarge) ms(O)) name(G2)
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35711
#4

15 Feb 2025, 04:11

Yet further, my comments about heat plots when rows and columns are unordered categories are largely aimed elsewhere. Age and grade are ordered categories -- so my point in a nutshell is that heatplots aren't ruled out for that reason, but nevertheless you can still do better.

I used *nomics in #2 as a wildcard to include genomics and similar fields in modern biology in which heat plots appear popular. By accident it includes economics, for example, but that is what it is.

Last edited by Nick Cox; 15 Feb 2025, 04:21.
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 923
#5

15 Feb 2025, 20:04

Dear Nick Cox, thank you so much for your reply, great apprecited. I have used your -fabplot- command to do some works. And what is interest to me is that you seems prefer -fabplot- to -subsetplot-, because I observed that you used fabplot much more than subsetplot. By the way, please allow me to cite your comments to "Questions in Dataviz: A Design-Driven Process for Data Visualisation", I think it reflects your idea of long standing and worth reading up. And I noticed that a new book "Graphs Everyone Should Know and How to Create Them in Stata" by Franz Buscha was published in Stata Press. Have you read it, and do you have any comment on it? Thank you very much!

Nick Cox: Mostly for fans of data art, much less that helps data analysis
2023.04.11

Long story short: this book has more to offer fans of data art than people with a major interest in data analysis.

Neil Richards is author of a lively blog on data visualization (for a URL, Google using the book title). He has now turned many of his recent posts into this book. It is a manifesto encouraging people to be creative and unconventional in visualization, delivered with enthusiasm and passion, with many stories and reflections from his experiences and engagement with the field. Examples -- most presented as being indeed exemplary -- show a great variety of content and form.

Data visualization naturally extends across many different worlds. Richards' own world is centred on use or awareness of Tableau as software and Twitter as a medium for following or influencing visualization fads, fashions and fame. Although readers can amuse themselves with data on say sport or music, the presumption is that their day jobs are mostly to do with working with clients, especially in business intelligence. Oddly, like several other books, this one mentions Tableau again and again without divulging any detailed tips or techniques on how it could be used to produce the visuals here, let alone similar visuals with different data. Like the Tardis, this world no doubt appears much bigger when one is inside, as is true about the community of anyone's favourite software, including mine. Outsiders can only be bemused by mentions of Tableau Zen Masters or Hall of Fame members, but every group has ways to acclaim its most accomplished practitioners. More seriously, there is not so much recognition of data visualization as a scientific or statistical activity.

The visualizations themselves should be the stars of the show. Richards understandably recycles some historic and recent classics (Florence Nightingale, C.J. Minard, W.E.B. Du Bois, Hans Rosling), but his own work is often somewhere between puzzling and bizarre. Many visualizations are impossible to read in detail, as a matter of page size, through common use of many small multiples, or because a reader cannot interact with the printed image as would be natural on a web page. Several times I could not follow precisely what was being encoded and what was supposed to be clear from the visual. Discussion of each section usually rounds off with a partly defensive, partly assertive summary that Richards works to amuse himself and to extend his skills, and any way is interested mostly in producing data art. That's all a strongly personal stance, but a little tedious when the same attitudes are expressed many times over. Notable detail: on p.19 there is standard advice to avoid combining red and green in the same visual, but several later examples do precisely that.

I was most interested in the examples of tile maps. These are either special cases of cartograms, as produced over the last century and more, or (if you use that term more strictly) relatives of them. The idea is just that areas on maps don't have to be represented as such with detailed echoing of boundaries and coastlines. Countries and states and other areal units can be represented by tiles of constant size and shape so long as they are identifiable, sometimes through the reader's knowledge, but ideally by explanatory text. Examples from the United States work well to show both the need for such tile maps and the scope for them to be useful. Even foreigners should find it easy to work out what is meant by CA, NY and TX. It's not fatal that contiguity, what borders what else, has to be sacrificed to some extent. In this book discussion of tile maps goes further than most accounts in experimenting provocatively with different tile shapes. After the maps in personal interest come various mathematical and musical visualizations, but not all are successful, and these sections suffered from ignoring much previous work. On p.287 it seems to be implied that irrational and transcendental numbers are one and the same; as a mathematician Richards will know that is not the case. In the same vein, pedantically if you will, 'rectangles' means oblongs on pp.86 and 257 and 'tetrahedra' means trapezia on p.255.

Changing successfully from one genre to another -- here packaging blog content as a book -- often requires more work than anyone is willing to undertake. On his blog Richards is his own boss, and can make or break rules at will, but to appeal to readers a book author should apply higher standards. He has been let down by his own occasional lapses and limited peer review, copy editing and proof-reading. The index is poor: some authorities mentioned repeatedly are not included at all. Enthusiastic readers will feel some compulsion to create their own index. There is a fascinating breadth of allusion to work in many different fields, but authors' names are often spelled incorrectly and there are several other minor errors in referencing. There are many awkward or even garbled sentences and too many comma splices and sentence fragments. Those might count as unconventional style but didn't make my reading more comfortable. Words like non-conventional, non-necessary, non-serious, non-used and non-useful are again perhaps part of the author's playful and non-conventional style, but 'at pace' is a mishearing of 'apace'. A generally digressive and conversational tone, down to 'OK', 'Well', and similar tics, was for me unnecessarily difficult to follow. Other way round, minor pomposities such as 'aforementioned', 'per se', 'prior to' or 'as to whether' would have been better edited out.

More serious mistakes or puzzles:

In the United States there are "435 senators across all 50 states ... all of whom are Republican or Democrat" (p.4). 50 states is correct, but the Senate has 100 senators and some strictly are independent.

Hans Rosling was a physician, not a physicist (p.77), a personal detail that is vital to understand his concerns.

On p.138 objective and subjective are the wrong way round.

'It's probably safe to say that almost nobody in the data visualization world likes bubble charts' (p.329) Is that so? It doesn't square with numerous appreciative references to Hans Rosling's TED talk, including earlier in this book. My own experience is that while Rosling's application worked very well, most other bubble charts are just complicated messes that are hard to interpret.

Details aside, the main question raised by this book is encapsulated by a distinction on p.115 between 'sticklers for analytical best practices' and "fans of artistic less functional projects'. The wording is perhaps loaded, but it is a good place to start. If your goal is data art, then beyond personal gratification, the first and key question is whether other people like it as art. Other way round, I can't see that most of the novel visualizations here tell you anything comprehensible about the data, let alone anything that could not have been presented more easily and effectively otherwise. And that doesn't mean that I line up with the absurd view (mocked twice here, but which I have never seen expressed for real) that everything should be a bar chart!

As with art in any sense, appreciation of the value of creativity needs to be matched in visualization by appreciation of the need for criticism, meaning judgment of what works well in any sense, and quite why. As for being unconventional, that is easy too. Even in the narrow world of statistical graphics there are conventions I think over-sold or even wrong and refuse to follow, but what matters is a reasoned argument for something different, not being different for difference's sake.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35711
#6

16 Feb 2025, 01:28

Thanks for this.

I have not yet read Franz Buscha's book. The Stata Journalhas already commissioned a review. Please note that as a Stata Journal Editor I never review books published by Stata Press, either in the SJ or anyone else.

I'd recommend anyone interested in my review of Neil Richards' book to check out other reviews on Amazon. If anyone knows of or finds a review in a journal with some academic or professional standing, I would be interested to hear of it. The reviews posted on https://www.routledge.com/Questions-.../9781032139449 are just puffs solicited by the publisher, and having been on the other side of the fence I appreciate them for what they are.

fabplot is subsetplot renamed and rewritten. It supersedes subsetplot, although the original code for the latter remains on SSC, just in case someone needs to run code that uses it.

https://journals.sagepub.com/doi/epu...6867X211025838 discusses the command name. There is a delicate trade-off. subsetplot as a name didn't and doesn't make explicit what is distinctive about the command. fabplot does, but only if you read the help and find out that it stands for front-and-back. I hope that reading the help is regarded as a reasonable expectation.

The help file njc_stuff on SSC makes explicit what I regard as good, current, superseded or obsolete about commands I've written (and in turn the public version of that is not absolutely up-to-date in other respects). The list includes only commands written up to have a help file and posted somewhere public.

Sorry if any of that was or is confusing.
1 like
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 923
#7

16 Feb 2025, 01:54

Dear Nick Cox. Thank you very much. I have read help file of -fabplot-, and I knew what does fab stand for. I was just curious about your perference of fabplot to subsetplot, you have answered it. And I have installed njc_stuff and njc_best_stuff, I knew some commands are superseded or obsolete. Actually I have checked it yesterday for -hillplot- and -mexplot-, and some other plot commands that I rarely used. I found some commands provide examples without data attachment, the -diplot- (double interval plot) for example, I think commands like this is discipline-specific, and maybe I will never use them until I get into particular disciplines.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35711
#8

16 Feb 2025, 03:08

Some commands are better documented than others!
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 923
#9

16 Feb 2025, 03:58

Absolutely! You provided detailed descriptions in some recent commands, including origin and history of certain plot, they are good learning materials. Thanks.
Comment

Announcement

All roads lead to Rome? The connnection between zmap and heatplot

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment