I've stopped using box plots (2021)
The author critiques box plots for being unintuitive and prone to misinterpretation, advocating for simpler alternatives like strip plots to effectively communicate distribution insights without confusing audiences.
Read original articleThe article discusses the author's decision to stop using box plots in data visualization due to their perceived shortcomings in conveying insights effectively. The author argues that box plots are often unintuitive, hard to grasp, and prone to misinterpretation compared to alternative chart types. They highlight issues with the visual design of traditional box plots, such as the misleading perception of segment quantities and the complexity of understanding quartiles. The author suggests that simpler and more intuitive chart types, like strip plots, can effectively communicate distribution-based insights without requiring audiences to understand abstract concepts like quartiles. Additionally, the article points out that box plots can misrepresent distributions by making them appear bell-shaped, potentially leading to incorrect interpretations. While some experts propose modifications to improve the accuracy of representing distributions, the author emphasizes the primary concern of box plots being challenging to interpret. In conclusion, the author advocates for using more intuitive distribution chart types, like strip plots, for clearer and more accessible data visualization.
Related
The Delusion of Advanced Plastic Recycling
The plastics industry promotes pyrolysis as a solution for plastic recycling, but investigations reveal drawbacks. Pyrolysis yields little reusable plastic, relies on fossil fuels, and uses deceptive marketing practices.
Potatoes Are the Perfect Vegetable–But You're Eating Them Wrong
Potato consumption in the US has dropped by 30%, favoring frozen over fresh options. Debates arise on reclassifying potatoes as a vegetable, impacting health and nutrition. Despite being nutrient-rich, concerns persist over unhealthy associations with deep-fried products. Challenges in breeding productive varieties for climate change and disease are noted, highlighting the historical importance of potatoes.
FreeBSD Bhyve Companion Tools
The author details transitioning from VirtualBox to FreeBSD Bhyve, praising Bhyve's benefits in a FreeBSD setting. Tools like VNC connection and pause/resume scripts optimize Bhyve operations, simplifying VM management.
Start all of your commands with a comma (2009)
The article discusses creating a ~/bin/ directory in Unix to store custom commands, avoiding name collisions with system commands by prefixing custom commands with a comma. This technique ensures unique, easily accessible commands.
Getting 100% code coverage doesn't eliminate bugs
Achieving 100% code coverage doesn't ensure bug-free software. A blog post illustrates this with a critical bug missed despite full coverage, leading to a rocket explosion. It suggests alternative approaches and a 20% coverage minimum.
I feel like this is where the confusion stems from for the author and everyone else here. Box plots don't make anything bell shaped (they don't change the distribution), they assume that your data follows a bell/gaussian shape. This is correct in cases where the central limit theorem can be applied (which is almost everywhere) - but when that is not the case, the assumption is wrong and you shouldn't use a box plot anyways, because the values it shows have no real use. There are very real use cases for box plots, but people need to understand the basics of statistics before they can use them.
Violin plots and bee swarm plots are better. Jittered strip plots can be okay if you're careful to avoid saturation (or more points added in the saturated region will disappear as they can't make it any darker).
Similarly, the distribution represented by a box plot itself is often the distribution of "just one sample". When viewed as such, a distro has its own uncertainty[1] and that uncertainty is not represented in a violin plot, for example. As with every "right tool for the job" debate, people will vary based on experience with the tools, including how to simplify/explain them to others.
What I don't see is anyone saying "box plots are useful because they're the best kind of chart for [specific use case]". I can't off-hand think of any situation where I'd rather see a box plot than a strip plot or violin plot. When and why would you want to summarise the data so coarsely and visualize it so un-intuitively?
But after having read these comments, it really drives home his point that you can get a room full of lots of very smart people who all know what they're talking about, and they'll all disagree about the understanding and interpretation of box plots.
It's a little surprising, but the evidence in these threads pretty much cinches the argument for me.
They are mostly useful for comparing batches not analysing an individual batch.
The author doesn’t know what they are talking about and is telling people as if they do. If he read any of Tukey’s material he might know. But no name dropping is enough clearly…
There's generalizations and 'specific situations' which the author considers worthy of some plots, and other specific situations that the author doesn't consider worthy of other plots. At best, don't use box plots if your distributions do not have a single mode and may likely be misinterpreted is my takeaway. Here's a rant against violin plots by my fave physicist ranter[0] (not Sabine), so maybe never use them.
IMO a good takeaway might be to always use a plot that fairly represents the underlying distribution.
These are ok but it’s hard to differentiate the density of points when they’re randomly offset. Try a swarm plot (seaborn) / bee swarm plot (R).
It’s the same concept but the points are strategically placed across the x axis to show the width of the distribution at each point. It generally looks much cleaner.
This is why it's good to have a really competent visual designer around. Their sole purpose is visual communication, and that very much includes dealing with the subconscious connotations and unintended messages hidden within data visualizations. Yes, you've probably encountered designers that would not be good at that, you imagine. You've also probably encountered developers that would not be good at the sort of data munging that scientists, et al do; that doesn't mean developers, generally, aren't best equipped to handle the related coding problems.
The author has found that compared to other types of plots, people struggle to learn how to intepret box plots.
The author proposes some alternatives that they believe to be easier for people to interpret:
- Strip plots (for few data points)
- Jittered strip plots (for more data points)
- Distribution heatmap (for even more data points)
----
This aligns with my experience of trying to convey information to non-technical or moderately technical people; box plots are a struggle for them. To me it does seem like the proposed alternatives would be more accessible.
Sure, we could try to better educate people about box plots, (as the author has done professionally); or we could consider using something that requires less effort for people to comprehend.
[0] https://davidbaranger.com/2018/03/05/showing-your-data-scatt...
As well as using line charts on the average, I've used a box plot (with the edges of the box being the mean +/- 1 standard deviation) to get an idea of whether a given change is significant or not. I.e. if the boxes are close together I will ignore a change I've made, only committing changes that provide a significant jump in performance. The box plot is a useful way of visualizing that.
They can help with seeing highly variable performance (long box) from consistent performance (narrow box).
I can see this in the data (mean, standard deviation) but having it represented visually can help -- especially looking at the data over several iterations, or when looking for patterns from changing a variable (like the number of items in the data being processed).
I've also used linear regression calculations when data has looked linear or quadratic to check/confirm that assumption. -- You can overlay that on top of the data by computing the values for each value of n along side the actual data average and then including the average and calculated values in a line chart.
My fundamental concern with box plots is that no one has ever shown me a single scenario in which a given insight was clearer in a box plot than it would be in a simpler chart type (i.e., strip plot, distribution heatmap, or stacked histograms). If someone can show me even a hand-crafted, cherry-picked scenario with the same data shown as a (well-designed) box plot AND a strip plot, distribution heatmap and stacked histograms, and in which a potentially useful insight is clearer in the box plot than in the other chart types, I’ll happily change my opinion. I’m still waiting for someone to show me such a scenario, though.
In the meantime, I’m not sure why one would use box plots when simpler chart types are available that say the same thing about the data or, in many cases, say more about the data (show gaps, multi-modal distributions, etc.). Even if the audience is very used to reading box plots, they’ll still find strip plots, distribution heatmaps and stacked histograms to be simpler to read (and will actually see gaps, clusters, etc.)
How do I know that other distribution chart types are simpler to read than box plots? Because I’ve taught these chart types to literally thousands of people of all skill levels all over the world. Quartiles are just inherently less intuitive than bins or, in the case of strip plots, no delimiters to understand at all.
Like I said, if someone can show me a scenario like the one that I described above, though, I’ll happily change my mind…
Before people jump all over me, I should clarify what I mean by a “potentially useful insight.” For example, “showing the interquartile range” is not an “insight” in this context, it’s an “observation” because it doesn’t point to any kind of action or conclusion, in and of itself. A potentially useful insight would be something like, “The employee salaries in Company A are generally higher than those in Company B.” or “Most people make close to $80K in Company A, but the salaries are much more spread out in Company B.” Basically, an “insight” in this context is a piece of information that would point directly to some kind of action or conclusion.
https://en.wikipedia.org/wiki/Sina_plot
https://cran.r-project.org/web/packages/sinaplot/vignettes/S...
Adding on a representation of mean in a different style (like a black bar) can be helpful. So can a boxplot-style indication of variance, in some cases.
On gnu-R:
install.packages('ggplot2')
?ggplot2::geom_violin
- Bar
- Scatter
- Line
- Histogram
You can tell 90% of your stories with these plots. (If you pay attention to professional viz groups, Economist, NY Times, etc, they use these.)
Don't waste your time with other plots unless you have mastered these. When you master these, you will realize you don't need other charts.
The above are between the reasons I prefer remote meeting where data are to be shown instead of in person: anyone attending should have a computer ready to use and IF data are shared and ready usable I can live tweaks a plot ad reason on it while I listen end eventually pose relevant questions shown at my own turn something. Surely not all presentations are meant to be interactive session, but being able to interact even in async form reading a journal article, playing with the data and eventually drop a mail to the author is a nice thing, typically uselessly hard today where in tech term it can be extremely simple.
That's another reason I have presentation software/office automation one instead of plain org-mode, Jupyter, R Studio etc because change things it's hard while it should be easy. Org-mode is excellent to present but not really interactive, I have to regenerate plots to see changes or push data to external software, Jupyter is not really meant to present, R Studio offer nice LaTeX integration and tabular view but do not offer nice means to present, though they are still FAR better then presentation software and even if have some safety aspects to be taken into account I prefer countless of time receiving an active document (org-mode, jupyter notebook etc) instead of a pdf or even worse some office formats.
Just plotting points will lead to saturation in high density areas that depends on point size and opacity.
Making bin color proportional to point density will require normalization to make the plot readable in many cases.
While I like these plots too in certain situations, I would argue they're actually less elegant than the boxplots for those reasons.
And come on, boxplots aren't that hard to explain to someone who already is used to working with percentiles.
Isn't that a bit like banning cars because some people can't drive?
Some diagrams are simply not for mass consumption and this is one, particularly because it is designed to illustrate an interpretation of ranges instead of the direct/linear representation of the raw data.
Of course I'd illustrate this fact as a Venn diagram comparing "box diagram" Vs "people" (intersection those who understand it) but I'm afraid the universal set may be mistaken as "those people who don't have eyes" rather than literally everything else.
Perhaps we should stop using that too, since it's non obvious what the universal set is.
All diagrams have some ambiguity and can be misinterpreted, sometimes it's deliberate (e.g. bar chart vertical axis not starting at 0 or scale not being linear) and that's why there's the saying "There's lies damn, lies, and statistics." That doesn't mean some diagrams are not useful, just that it's not suitable for some audiences who may misinterpret the data.
Related
The Delusion of Advanced Plastic Recycling
The plastics industry promotes pyrolysis as a solution for plastic recycling, but investigations reveal drawbacks. Pyrolysis yields little reusable plastic, relies on fossil fuels, and uses deceptive marketing practices.
Potatoes Are the Perfect Vegetable–But You're Eating Them Wrong
Potato consumption in the US has dropped by 30%, favoring frozen over fresh options. Debates arise on reclassifying potatoes as a vegetable, impacting health and nutrition. Despite being nutrient-rich, concerns persist over unhealthy associations with deep-fried products. Challenges in breeding productive varieties for climate change and disease are noted, highlighting the historical importance of potatoes.
FreeBSD Bhyve Companion Tools
The author details transitioning from VirtualBox to FreeBSD Bhyve, praising Bhyve's benefits in a FreeBSD setting. Tools like VNC connection and pause/resume scripts optimize Bhyve operations, simplifying VM management.
Start all of your commands with a comma (2009)
The article discusses creating a ~/bin/ directory in Unix to store custom commands, avoiding name collisions with system commands by prefixing custom commands with a comma. This technique ensures unique, easily accessible commands.
Getting 100% code coverage doesn't eliminate bugs
Achieving 100% code coverage doesn't ensure bug-free software. A blog post illustrates this with a critical bug missed despite full coverage, leading to a rocket explosion. It suggests alternative approaches and a 20% coverage minimum.