August 7th, 2024

A/B testing mistakes I learned the hard way

Lior Neu-ner highlights key A/B testing mistakes, emphasizing the need for a clear hypothesis, proper result segmentation, careful user selection, and monitoring counter metrics to avoid misleading conclusions.

Read original articleLink Icon
A/B testing mistakes I learned the hard way

A/B testing can be a powerful tool for product development, but it is fraught with potential pitfalls. Lior Neu-ner shares key mistakes he has encountered while conducting A/B tests, emphasizing the importance of a clear hypothesis, proper segmentation of results, and careful user selection. A well-defined hypothesis should articulate the purpose of the test and expected outcomes, while aggregating results can obscure significant insights, such as performance differences across devices. Including users who are not affected by the test can skew results, leading to inaccurate conclusions. Additionally, premature analysis of test results can mislead decision-making, as early statistical significance may not hold. Neu-ner advises against rushing into experiments without preliminary testing, as this can result in biased outcomes if issues arise. Finally, he highlights the necessity of monitoring counter metrics to identify any unintended negative effects of changes made during testing. By adhering to these guidelines, engineers can enhance the effectiveness of their A/B testing processes.

- A clear hypothesis is crucial for effective A/B testing.

- Results should be segmented by relevant user properties to avoid misleading conclusions.

- Exclude unaffected users from experiments to ensure accurate data.

- Avoid making decisions based on incomplete data by adhering to predetermined test durations.

- Monitor counter metrics to detect any negative impacts from changes made during testing.

Link Icon 8 comments
By @light_hue_1 - 9 months
That's not Simpson's paradox!

> In fact, while the new flow worked great on mobile, conversion was lower on desktop – an insight we missed when we combined these metrics.

> This phenomenon is known as Simpson's paradox – i.e. when experiments show one outcome when analyzed at an aggregated level, but a different one when analyzed by subgroups.

There's nothing strange about finding out that some groups benefit and others lose out when diving up you data. You're looking at an average and some parts are positive and others are negative. Where's the paradox there?

Simpson's paradox is when more button presses lead to more purchases. But then you look at desktop vs mobile and you find out that for both desktop and mobile more clicks doesn't mean more purchases (or worse, more clicks means fewer purchases).

That's why it's a paradox. The association between two variables exists at the aggregate level but doesn't exist or is backwards when you split up the population. It's not a statement about the average performance of something.

I would add a 7th A/B testing mistake to that list and it's not learning about basic probability, statical tests, power, etc. Flying by the seat of your pants when statistics are involved always ends badly.

By @a2128 - 9 months
I feel like I've too often seen in products new (anti)features that are way too easy to accidentally click, and whenever I do accidentally click I just imagine it's increasing some statistics counter that's ultimately showing the product managers super high engagement, clearly meaning the users must love it to be using it all the time
By @clarle - 9 months
#2 is a slippery slope if you don't do it properly.

You might look end up looking at lots of different slices of your data, and you might come to the conclusion, "Oh, it looks like France is statistically significant negative on our new signup flow changes".

It's important to make sure you have a hypothesis for the given slice before you start the experiment and not just hunt for outliers after the fact, or otherwise you're just p-hacking [1].

[1]: https://en.wikipedia.org/wiki/p-hacking

By @sokoloff - 9 months
I recall getting into a heated debate with an analyst at my company over the topic of "peeking" (he was right; I was wrong, but it took me several days to finally understand what he was saying.)

The temptation to "peek" and keep on peeking until the test confesses to the thing you want it to say is very high.

By @iamcreasy - 9 months
The article says 'Changing the color of the "Proceed to checkout" button will increase purchases.' is a bad hypothesis because it is underspecified.

But what else is there to measure other than checkout button click count(and follow up purchases) to measure the effect of button color change?

Or perhaps this is not a robust example to illustrates undespeficaition?

By @ImageXav - 9 months
Here's another one that I feel is often overlooked by traditional A/B testers: if you have multiple changes, don't simply test them independently. Learn about fractional factorial experiments and interactions, and design your experiment accordingly. You'll get a much more relevant result.

My impression is that companies like to add/test a lot of features separately - and individually these features are good, but together they form complex clutter and end up being a net negative.