August 7th, 2024

A/B testing mistakes I learned the hard way

Lior Neu-ner highlights key A/B testing mistakes, emphasizing the need for a clear hypothesis, proper result segmentation, careful user selection, and monitoring counter metrics to avoid misleading conclusions.

Read original article

A/B testing mistakes I learned the hard way

A/B testing can be a powerful tool for product development, but it is fraught with potential pitfalls. Lior Neu-ner shares key mistakes he has encountered while conducting A/B tests, emphasizing the importance of a clear hypothesis, proper segmentation of results, and careful user selection. A well-defined hypothesis should articulate the purpose of the test and expected outcomes, while aggregating results can obscure significant insights, such as performance differences across devices. Including users who are not affected by the test can skew results, leading to inaccurate conclusions. Additionally, premature analysis of test results can mislead decision-making, as early statistical significance may not hold. Neu-ner advises against rushing into experiments without preliminary testing, as this can result in biased outcomes if issues arise. Finally, he highlights the necessity of monitoring counter metrics to identify any unintended negative effects of changes made during testing. By adhering to these guidelines, engineers can enhance the effectiveness of their A/B testing processes.

- A clear hypothesis is crucial for effective A/B testing.

- Results should be segmented by relevant user properties to avoid misleading conclusions.

- Exclude unaffected users from experiments to ensure accurate data.

- Avoid making decisions based on incomplete data by adhering to predetermined test durations.

- Monitor counter metrics to detect any negative impacts from changes made during testing.

You Can't Build Apple with Venture Capital

Humane, a startup, faced challenges with its "Ai Pin" device despite raising $230 million. Criticized for weight, battery life, and functionality, the late pivot to AI was deemed desperate. Venture capital risks and quick idea testing are highlighted, contrasting startup and established company product development processes.

Synthetic User Research Is a Terrible Idea

Synthetic User Research criticized by Matthew Smith for bias in AI outputs. Using AI may lead to generalized results and fabricated specifics, hindering valuable software development insights. Smith advocates for controlled, unbiased research methods.

Automated Test-Case Reduction

Adrian Sampson explains automated test-case reduction techniques using the Shrinkray reducer to debug an interpreter bug. He emphasizes the importance of effective test scripts and occasional manual intervention for efficient bug identification and resolution.

Fear of over-engineering has killed engineering altogether

The article critiques the tech industry's focus on speed over engineering rigor, advocating for "Napkin Math" and Fermi problems to improve decision-making and project outcomes through basic calculations.

Cringey, but True: How Uber Tests Payments in Production

Uber tests payment systems in production to identify real-world bugs, rolling out new methods incrementally and treating each deployment as an experiment to enhance reliability and efficiency based on user feedback.

8 comments

By @light_hue_1 - 9 months

That's not Simpson's paradox!

> In fact, while the new flow worked great on mobile, conversion was lower on desktop – an insight we missed when we combined these metrics.

> This phenomenon is known as Simpson's paradox – i.e. when experiments show one outcome when analyzed at an aggregated level, but a different one when analyzed by subgroups.

There's nothing strange about finding out that some groups benefit and others lose out when diving up you data. You're looking at an average and some parts are positive and others are negative. Where's the paradox there?

Simpson's paradox is when more button presses lead to more purchases. But then you look at desktop vs mobile and you find out that for both desktop and mobile more clicks doesn't mean more purchases (or worse, more clicks means fewer purchases).

That's why it's a paradox. The association between two variables exists at the aggregate level but doesn't exist or is backwards when you split up the population. It's not a statement about the average performance of something.

I would add a 7th A/B testing mistake to that list and it's not learning about basic probability, statical tests, power, etc. Flying by the seat of your pants when statistics are involved always ends badly.

By @a2128 - 9 months

I feel like I've too often seen in products new (anti)features that are way too easy to accidentally click, and whenever I do accidentally click I just imagine it's increasing some statistics counter that's ultimately showing the product managers super high engagement, clearly meaning the users must love it to be using it all the time

By @clarle - 9 months

#2 is a slippery slope if you don't do it properly.

You might look end up looking at lots of different slices of your data, and you might come to the conclusion, "Oh, it looks like France is statistically significant negative on our new signup flow changes".

It's important to make sure you have a hypothesis for the given slice before you start the experiment and not just hunt for outliers after the fact, or otherwise you're just p-hacking [1].

[1]: https://en.wikipedia.org/wiki/p-hacking

By @sokoloff - 9 months

I recall getting into a heated debate with an analyst at my company over the topic of "peeking" (he was right; I was wrong, but it took me several days to finally understand what he was saying.)

The temptation to "peek" and keep on peeking until the test confesses to the thing you want it to say is very high.

By @iamcreasy - 9 months

The article says 'Changing the color of the "Proceed to checkout" button will increase purchases.' is a bad hypothesis because it is underspecified.

But what else is there to measure other than checkout button click count(and follow up purchases) to measure the effect of button color change?

Or perhaps this is not a robust example to illustrates undespeficaition?

By @ImageXav - 9 months

Here's another one that I feel is often overlooked by traditional A/B testers: if you have multiple changes, don't simply test them independently. Learn about fractional factorial experiments and interactions, and design your experiment accordingly. You'll get a much more relevant result.

My impression is that companies like to add/test a lot of features separately - and individually these features are good, but together they form complex clutter and end up being a net negative.

A/B testing mistakes I learned the hard way

Related

You Can't Build Apple with Venture Capital

Synthetic User Research Is a Terrible Idea

Automated Test-Case Reduction

Fear of over-engineering has killed engineering altogether

Cringey, but True: How Uber Tests Payments in Production

Related

You Can't Build Apple with Venture Capital

Synthetic User Research Is a Terrible Idea

Automated Test-Case Reduction

Fear of over-engineering has killed engineering altogether

Cringey, but True: How Uber Tests Payments in Production