Visualizing many (continuously scaled) realisations for discrete independent variables?

Guest
#1

Visualizing many (continuously scaled) realisations for discrete independent variables?

22 Mar 2017, 11:29

I have 2 discrete independent variables (6 different years, 10 different deltas) and a continuous dependent variable (h1). Assume I have about 10 realisations for every of the 50 combinations of independent variables. My goal is to visualize the structure of the data, possibly within one plot (not necessarily showing all data points individually). I struggle to choose an adequate strategy to visualize the data, since it seems to me that the chart is either not intuitive or certain data points are not visible because they are above each other.

So far I have identified the following options to visualize the data:

1. Scatter plot, using colors to label one dimension of independent variables. The problem: Some data points are not shown, because there are too many values with exactly the same x-value, whilst there is a lot of white space in between the discrete x-value.

2. Scatter plot using "jitter". Now all data points are shown. The problem: It doesn't really look correct, because the realisations do not match their position on the x-axis (SEE CHART HERE. What do you think?

3. Scatter plot using by-groups, thereby cutting the number of realisations in one plot. The problem: Many plots are neccessary.

4. Box plot: All values are on the plot (however not shown individually, which does NOT matter in my case). However, it doesn't really look intuitively, because the two dimensions of independent variables are not marked-off well (SEE CHART2 HERE:

To make things worse, it seems to me that there is no way to automatically format the outliers ("outsides") in the plot, check out my post on this). Do you have any ideas how I could improve that?

Does anyone have an idea #5 how I could best visualize the data (or improve my ideas #1-#4)? Thanks in advance!
Attached Files

Last edited by Guest; 22 Mar 2017, 12:22.
Tags: categorical, graph, graphics, label, visualization
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

22 Mar 2017, 11:33

You've opened two related threads at almost the same time. http://www.statalist.org/forums/foru...-every-outlier

In any case, your description will make sense to you, but it's much harder work for anyone else to read it, absorb it and imagine your situation.

One good data example is worth a volume of exegesis.
Comment
Guest
#3

22 Mar 2017, 12:24

Dear Nick, thank you for your reply and your recommendation to use PNG - this way it's much easier to grasp, I agree. Thanks!

Regarding the two posts, they discuss two completely different topics, one is about the choice of plot type for a certain situation, and one concerns the very detailed question of how to automatically format outliers in box plots. However I appreciate your reply very much.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

23 Mar 2017, 07:21

I count 66 groups here (not 60 or 50!!!), but the overall problem is clear. That's a very fine subdivision of the data. I think you'll need to sacrifice something. Quite what is difficult to advise on. I have no idea what the deltas are, or whether it's more important to compare different deltas for each year, or vice versa. Perhaps you need to reduce your dataset to a dataset of summaries. In principle I am very interested in this area. in practice I find non-trivial advice is difficult here.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#5

23 Mar 2017, 08:31

Here's a thought experiment. If I had that data and knew nothing extra about it, I would have a separate graph for each year, and in each smooth the response (you say "dependent variable"; why won't that term lie down and die quietly) as a function of delta. There seems, on average, to be a ramped relationship steepest near zero.

Without the data, I can't do that, but here's an example with different data from recent playing around with rangestat (SSC). Any fans of rangestat who stumble upon this may like to know that a version of this will be added shortly to the help file.

Here there are 30-something distinct ages, not 11: the point is that seeing the wood for the trees can be possible.

Code:

webuse nlswork, clear set scheme s1color * ssc inst moremata needed for -mm_quantile()- mata: real rowvector myquantile(real vector X) { return(mm_quantile(X, 1, (0.1, 0.25, 0.5, 0.75, 0.9))) } end rangestat (myquantile) ln_wage, interval(age -2 2) local P 10 25 50 75 90 forval j = 1/5 { gettoken p P : P label var myquantile`j' "p`p'" } scatter ln_wage age, ms(oh) mc(gs8) || /// line myquantile? age, sort legend(order(6 5 4 3 2) col(1) pos(3)) /// ytitle("`: var label ln_wage'") yla(, ang(h)) xla(15(5)45)
Comment

Announcement

Visualizing many (continuously scaled) realisations for discrete independent variables?

Comment

Comment

Comment

Comment