June 23rd, 2024

I've stopped using box plots (2021)

The author critiques box plots for being unintuitive and prone to misinterpretation, advocating for simpler alternatives like strip plots to effectively communicate distribution insights without confusing audiences.

Read original articleLink Icon
I've stopped using box plots (2021)

The article discusses the author's decision to stop using box plots in data visualization due to their perceived shortcomings in conveying insights effectively. The author argues that box plots are often unintuitive, hard to grasp, and prone to misinterpretation compared to alternative chart types. They highlight issues with the visual design of traditional box plots, such as the misleading perception of segment quantities and the complexity of understanding quartiles. The author suggests that simpler and more intuitive chart types, like strip plots, can effectively communicate distribution-based insights without requiring audiences to understand abstract concepts like quartiles. Additionally, the article points out that box plots can misrepresent distributions by making them appear bell-shaped, potentially leading to incorrect interpretations. While some experts propose modifications to improve the accuracy of representing distributions, the author emphasizes the primary concern of box plots being challenging to interpret. In conclusion, the author advocates for using more intuitive distribution chart types, like strip plots, for clearer and more accessible data visualization.

Related

The Delusion of Advanced Plastic Recycling

The Delusion of Advanced Plastic Recycling

The plastics industry promotes pyrolysis as a solution for plastic recycling, but investigations reveal drawbacks. Pyrolysis yields little reusable plastic, relies on fossil fuels, and uses deceptive marketing practices.

Potatoes Are the Perfect Vegetable–But You're Eating Them Wrong

Potatoes Are the Perfect Vegetable–But You're Eating Them Wrong

Potato consumption in the US has dropped by 30%, favoring frozen over fresh options. Debates arise on reclassifying potatoes as a vegetable, impacting health and nutrition. Despite being nutrient-rich, concerns persist over unhealthy associations with deep-fried products. Challenges in breeding productive varieties for climate change and disease are noted, highlighting the historical importance of potatoes.

FreeBSD Bhyve Companion Tools

FreeBSD Bhyve Companion Tools

The author details transitioning from VirtualBox to FreeBSD Bhyve, praising Bhyve's benefits in a FreeBSD setting. Tools like VNC connection and pause/resume scripts optimize Bhyve operations, simplifying VM management.

Start all of your commands with a comma (2009)

Start all of your commands with a comma (2009)

The article discusses creating a ~/bin/ directory in Unix to store custom commands, avoiding name collisions with system commands by prefixing custom commands with a comma. This technique ensures unique, easily accessible commands.

Getting 100% code coverage doesn't eliminate bugs

Getting 100% code coverage doesn't eliminate bugs

Achieving 100% code coverage doesn't ensure bug-free software. A blog post illustrates this with a critical bug missed despite full coverage, leading to a rocket explosion. It suggests alternative approaches and a 20% coverage minimum.

Link Icon 41 comments
By @sigmoid10 - 5 months
>box plots always make distributions look bell shaped

I feel like this is where the confusion stems from for the author and everyone else here. Box plots don't make anything bell shaped (they don't change the distribution), they assume that your data follows a bell/gaussian shape. This is correct in cases where the central limit theorem can be applied (which is almost everywhere) - but when that is not the case, the assumption is wrong and you shouldn't use a box plot anyways, because the values it shows have no real use. There are very real use cases for box plots, but people need to understand the basics of statistics before they can use them.

By @mkl - 5 months
The only advantage box plots had is that they can be drawn by hand. Now that computers are ubiquitous this is no longer valuable.

Violin plots and bee swarm plots are better. Jittered strip plots can be okay if you're careful to avoid saturation (or more points added in the saturated region will disappear as they can't make it any darker).

By @cb321 - 5 months
People have conflicting goals. On the one hand they long to compress many numbers into one or a few summary statistics. On the other hand, the moment such lusted after summaries mislead in some way they regret the data compression. What's really going on is that people want a simplicity (often in the form of definite conclusions) which may just not exist. This is really a common malaise of the human condition.

Similarly, the distribution represented by a box plot itself is often the distribution of "just one sample". When viewed as such, a distro has its own uncertainty[1] and that uncertainty is not represented in a violin plot, for example. As with every "right tool for the job" debate, people will vary based on experience with the tools, including how to simplify/explain them to others.

[1] https://github.com/c-blake/bu/blob/main/doc/edplot.md

By @iainmerrick - 5 months
Lots of people defending box plots here -- a lot more than I expected!

What I don't see is anyone saying "box plots are useful because they're the best kind of chart for [specific use case]". I can't off-hand think of any situation where I'd rather see a box plot than a strip plot or violin plot. When and why would you want to summarise the data so coarsely and visualize it so un-intuitively?

By @Falkon1313 - 5 months
I was not entirely convinced by the article, being used to box plots myself for several decades. I've used them in school, college, and at work.

But after having read these comments, it really drives home his point that you can get a room full of lots of very smart people who all know what they're talking about, and they'll all disagree about the understanding and interpretation of box plots.

It's a little surprising, but the evidence in these threads pretty much cinches the argument for me.

By @cjk2 - 5 months
No you shouldn’t stop using box plots. You should use them for when they are appropriate - showing location and spread. And not shape! There’s absolutely no information on modality or distribution presented past quartiles and limits.

They are mostly useful for comparing batches not analysing an individual batch.

The author doesn’t know what they are talking about and is telling people as if they do. If he read any of Tukey’s material he might know. But no name dropping is enough clearly…

By @karmakaze - 5 months
> There are other distribution chart types that can be useful in specific situations, such as frequency polygons, violin plots, cumulative distribution plots, and bee swarm plots, but the three types that I described above are the easiest ones to grasp, and are able to communicate most of the insights that are needed for day-to-day decision-making in most organizations. (I’m not mentioning histograms here because they’re generally only useful for visualizing a single set of values, whereas box plots and their alternatives are for visualizing multiple sets of values, which is a different use case.)

There's generalizations and 'specific situations' which the author considers worthy of some plots, and other specific situations that the author doesn't consider worthy of other plots. At best, don't use box plots if your distributions do not have a single mode and may likely be misinterpreted is my takeaway. Here's a rant against violin plots by my fave physicist ranter[0] (not Sabine), so maybe never use them.

[0] https://youtu.be/_0QMKFzW9fw?si=4VM4DT9Q1zEnV93A

By @CuriouslyC - 5 months
Box plots are a relic of a time when we couldn't print really nice charts. You can just display the distribution in line like a scrolling oscilloscope/topographic display, or you can do a density plot over time (look at gaussian processes) and overlay shaded regions for important time periods.
By @psyklic - 5 months
Box plots make distributions easier to reason about by oversimplifying them. In a similar way, the mean can be very misleading (but we likely won't forbid its use!).

IMO a good takeaway might be to always use a plot that fairly represents the underlying distribution.

By @benrapscallion - 5 months
Do it the way Nature journals now require it to be done: show the underlying data points overlaid on the box plot. Best of both worlds.
By @jcims - 5 months
By @jncfhnb - 5 months
The author showed jittered strip plots where you plot each point correctly on the y axis and randomly offset the x axis.

These are ok but it’s hard to differentiate the density of points when they’re randomly offset. Try a swarm plot (seaborn) / bee swarm plot (R).

It’s the same concept but the points are strategically placed across the x axis to show the width of the distribution at each point. It generally looks much cleaner.

By @chefandy - 5 months
Just like anything else in design, the first question should be "how can I convey this most clearly to the audience I'm addressing" not "hmm, I wonder if there's are any problems the technique I chose because it's what everyone seems to use for this." Use the right tool for the job. There's even a good chance that juxtaposing these elements differently or adding another element could clear this up entirely.

This is why it's good to have a really competent visual designer around. Their sole purpose is visual communication, and that very much includes dealing with the subconscious connotations and unintended messages hidden within data visualizations. Yes, you've probably encountered designers that would not be good at that, you imagine. You've also probably encountered developers that would not be good at the sort of data munging that scientists, et al do; that doesn't mean developers, generally, aren't best equipped to handle the related coding problems.

By @These335 - 5 months
Sure there are alternatives and I agree with the author's criticisms overall. But boxplots are a staple in statistics, and if your audience can reasonably be assumed to have some level of statistical training then boxplots are perfectly reasonable in my opinion.
By @wodenokoto - 5 months
I’m a big fan of the jittered strip plot and I often ad special logic to color dots at the edges of a largish gap. This is super useful if you are plotting the distribution of daily messages and just plotting dots will hide that there are days without messages
By @montebicyclelo - 5 months
The author has experience of teaching box plots in various organisations.

The author has found that compared to other types of plots, people struggle to learn how to intepret box plots.

The author proposes some alternatives that they believe to be easier for people to interpret:

- Strip plots (for few data points)

- Jittered strip plots (for more data points)

- Distribution heatmap (for even more data points)

----

This aligns with my experience of trying to convey information to non-technical or moderately technical people; box plots are a struggle for them. To me it does seem like the proposed alternatives would be more accessible.

Sure, we could try to better educate people about box plots, (as the author has done professionally); or we could consider using something that requires less effort for people to comprehend.

By @riedel - 5 months
Actually you may nicely integrate box, violin, bee/scatter plots [0]. For simple visual ANOVA testing box plots are great. On the other hand violin plots are great to quickly check distribution assumptions for testing and together with scatter plots give you a good impression of the sample.

[0] https://davidbaranger.com/2018/03/05/showing-your-data-scatt...

By @rhdunn - 5 months
When profiling slow queries/code I often collect the elapsed time of a test where I take 5-10 runs and calculate the mean/average, standard deiviation, min, and max.

As well as using line charts on the average, I've used a box plot (with the edges of the box being the mean +/- 1 standard deviation) to get an idea of whether a given change is significant or not. I.e. if the boxes are close together I will ignore a change I've made, only committing changes that provide a significant jump in performance. The box plot is a useful way of visualizing that.

They can help with seeing highly variable performance (long box) from consistent performance (narrow box).

I can see this in the data (mean, standard deviation) but having it represented visually can help -- especially looking at the data over several iterations, or when looking for patterns from changing a variable (like the number of items in the data being processed).

I've also used linear regression calculations when data has looked linear or quadratic to check/confirm that assumption. -- You can overlay that on top of the data by computing the values for each value of n along side the actual data average and then including the average and calculated values in a line chart.

By @nickdesb - 5 months
As the author of the original Nightingale article that kicked off this (wild) thread, maybe I can clarify a few things:

My fundamental concern with box plots is that no one has ever shown me a single scenario in which a given insight was clearer in a box plot than it would be in a simpler chart type (i.e., strip plot, distribution heatmap, or stacked histograms). If someone can show me even a hand-crafted, cherry-picked scenario with the same data shown as a (well-designed) box plot AND a strip plot, distribution heatmap and stacked histograms, and in which a potentially useful insight is clearer in the box plot than in the other chart types, I’ll happily change my opinion. I’m still waiting for someone to show me such a scenario, though.

In the meantime, I’m not sure why one would use box plots when simpler chart types are available that say the same thing about the data or, in many cases, say more about the data (show gaps, multi-modal distributions, etc.). Even if the audience is very used to reading box plots, they’ll still find strip plots, distribution heatmaps and stacked histograms to be simpler to read (and will actually see gaps, clusters, etc.)

How do I know that other distribution chart types are simpler to read than box plots? Because I’ve taught these chart types to literally thousands of people of all skill levels all over the world. Quartiles are just inherently less intuitive than bins or, in the case of strip plots, no delimiters to understand at all.

Like I said, if someone can show me a scenario like the one that I described above, though, I’ll happily change my mind…

Before people jump all over me, I should clarify what I mean by a “potentially useful insight.” For example, “showing the interquartile range” is not an “insight” in this context, it’s an “observation” because it doesn’t point to any kind of action or conclusion, in and of itself. A potentially useful insight would be something like, “The employee salaries in Company A are generally higher than those in Company B.” or “Most people make close to $80K in Company A, but the salaries are much more spread out in Company B.” Basically, an “insight” in this context is a piece of information that would point directly to some kind of action or conclusion.

By @michaelhoffman - 5 months
Wherever possible, I use sina plots, which provide many of the advantages of violin plots while actually showing the individual data points.

https://en.wikipedia.org/wiki/Sina_plot

https://cran.r-project.org/web/packages/sinaplot/vignettes/S...

Adding on a representation of mean in a different style (like a black bar) can be helpful. So can a boxplot-style indication of variance, in some cases.

By @zaptheimpaler - 5 months
I always find new types of plots very interesting. Is there a nice resource showing all the common types of plots, when to use them, alternatives, code etc?
By @pvaldes - 5 months
That problem has been solved long time ago. When a box plot is not enough, just use violin plots

On gnu-R:

install.packages('ggplot2')

?ggplot2::geom_violin

By @__mharrison__ - 5 months
I've resorted to just teaching four plot types when I teach visualization.

- Bar

- Scatter

- Line

- Histogram

You can tell 90% of your stories with these plots. (If you pay attention to professional viz groups, Economist, NY Times, etc, they use these.)

Don't waste your time with other plots unless you have mastered these. When you master these, you will realize you don't need other charts.

By @kkfx - 5 months
Honestly? I do not care much about charts in general, while I do care much about the availability of the data used to produce a chart... In way too much cases I see plots and no data, sometimes data are there but not easy to use, and another thing I do care is the ability to tweak a graph.

The above are between the reasons I prefer remote meeting where data are to be shown instead of in person: anyone attending should have a computer ready to use and IF data are shared and ready usable I can live tweaks a plot ad reason on it while I listen end eventually pose relevant questions shown at my own turn something. Surely not all presentations are meant to be interactive session, but being able to interact even in async form reading a journal article, playing with the data and eventually drop a mail to the author is a nice thing, typically uselessly hard today where in tech term it can be extremely simple.

That's another reason I have presentation software/office automation one instead of plain org-mode, Jupyter, R Studio etc because change things it's hard while it should be easy. Org-mode is excellent to present but not really interactive, I have to regenerate plots to see changes or push data to external software, Jupyter is not really meant to present, R Studio offer nice LaTeX integration and tabular view but do not offer nice means to present, though they are still FAR better then presentation software and even if have some safety aspects to be taken into account I prefer countless of time receiving an active document (org-mode, jupyter notebook etc) instead of a pdf or even worse some office formats.

By @klysm - 5 months
I think there is an aversion to just showing the damn distribution as a histogram or KDE. I hear arguments from product owners that it’s “too complex” etc.
By @moi2388 - 5 months
I’m probably wrong, but this entire article felt as an advertisement for violin plots without it being mentioned once
By @Kalanos - 5 months
Plotly has an option on box plots that shows the individual points as well, which I like better than violins
By @y42 - 5 months
In short and unsurprisingly: Not every analysis and data set works with every visualisation.
By @singingfish - 5 months
And no mention of notched box plots which make a lot of the troublesome aspects go away?
By @emilk - 5 months
Importantly, box plots are also ugly. Beauty matters.
By @ekianjo - 5 months
just use boxplots with an overlay of the actual data and any confusion goes away
By @svara - 5 months
The alternatives he proposes have their problems too.

Just plotting points will lead to saturation in high density areas that depends on point size and opacity.

Making bin color proportional to point density will require normalization to make the plot readable in many cases.

While I like these plots too in certain situations, I would argue they're actually less elegant than the boxplots for those reasons.

And come on, boxplots aren't that hard to explain to someone who already is used to working with percentiles.

By @flusteredBias - 5 months
ECDF plots are what I use.
By @inSenCite - 5 months
been in love with violin plots
By @greentxt - 5 months
Just use a heat map instead. /s
By @bdjsiqoocwk - 5 months
The author just has a bad intuition. On the first picture he says "this looks like a small quantity". No, you can't say that. All you can say is that half the data points are in the shades part. You don't know where the rest are.
By @SillyUsername - 5 months
So the diagram should not be used because of an education problem with some audiences?

Isn't that a bit like banning cars because some people can't drive?

Some diagrams are simply not for mass consumption and this is one, particularly because it is designed to illustrate an interpretation of ranges instead of the direct/linear representation of the raw data.

Of course I'd illustrate this fact as a Venn diagram comparing "box diagram" Vs "people" (intersection those who understand it) but I'm afraid the universal set may be mistaken as "those people who don't have eyes" rather than literally everything else.

Perhaps we should stop using that too, since it's non obvious what the universal set is.

All diagrams have some ambiguity and can be misinterpreted, sometimes it's deliberate (e.g. bar chart vertical axis not starting at 0 or scale not being linear) and that's why there's the saying "There's lies damn, lies, and statistics." That doesn't mean some diagrams are not useful, just that it's not suitable for some audiences who may misinterpret the data.