A/B testing mistakes I learned the hard way
Lior Neu-ner highlights key A/B testing mistakes, emphasizing the need for a clear hypothesis, proper result segmentation, careful user selection, and monitoring counter metrics to avoid misleading conclusions.
Read original articleA/B testing can be a powerful tool for product development, but it is fraught with potential pitfalls. Lior Neu-ner shares key mistakes he has encountered while conducting A/B tests, emphasizing the importance of a clear hypothesis, proper segmentation of results, and careful user selection. A well-defined hypothesis should articulate the purpose of the test and expected outcomes, while aggregating results can obscure significant insights, such as performance differences across devices. Including users who are not affected by the test can skew results, leading to inaccurate conclusions. Additionally, premature analysis of test results can mislead decision-making, as early statistical significance may not hold. Neu-ner advises against rushing into experiments without preliminary testing, as this can result in biased outcomes if issues arise. Finally, he highlights the necessity of monitoring counter metrics to identify any unintended negative effects of changes made during testing. By adhering to these guidelines, engineers can enhance the effectiveness of their A/B testing processes.
- A clear hypothesis is crucial for effective A/B testing.
- Results should be segmented by relevant user properties to avoid misleading conclusions.
- Exclude unaffected users from experiments to ensure accurate data.
- Avoid making decisions based on incomplete data by adhering to predetermined test durations.
- Monitor counter metrics to detect any negative impacts from changes made during testing.
Related
You Can't Build Apple with Venture Capital
Humane, a startup, faced challenges with its "Ai Pin" device despite raising $230 million. Criticized for weight, battery life, and functionality, the late pivot to AI was deemed desperate. Venture capital risks and quick idea testing are highlighted, contrasting startup and established company product development processes.
Synthetic User Research Is a Terrible Idea
Synthetic User Research criticized by Matthew Smith for bias in AI outputs. Using AI may lead to generalized results and fabricated specifics, hindering valuable software development insights. Smith advocates for controlled, unbiased research methods.
Automated Test-Case Reduction
Adrian Sampson explains automated test-case reduction techniques using the Shrinkray reducer to debug an interpreter bug. He emphasizes the importance of effective test scripts and occasional manual intervention for efficient bug identification and resolution.
Fear of over-engineering has killed engineering altogether
The article critiques the tech industry's focus on speed over engineering rigor, advocating for "Napkin Math" and Fermi problems to improve decision-making and project outcomes through basic calculations.
Cringey, but True: How Uber Tests Payments in Production
Uber tests payment systems in production to identify real-world bugs, rolling out new methods incrementally and treating each deployment as an experiment to enhance reliability and efficiency based on user feedback.
> In fact, while the new flow worked great on mobile, conversion was lower on desktop – an insight we missed when we combined these metrics.
> This phenomenon is known as Simpson's paradox – i.e. when experiments show one outcome when analyzed at an aggregated level, but a different one when analyzed by subgroups.
There's nothing strange about finding out that some groups benefit and others lose out when diving up you data. You're looking at an average and some parts are positive and others are negative. Where's the paradox there?
Simpson's paradox is when more button presses lead to more purchases. But then you look at desktop vs mobile and you find out that for both desktop and mobile more clicks doesn't mean more purchases (or worse, more clicks means fewer purchases).
That's why it's a paradox. The association between two variables exists at the aggregate level but doesn't exist or is backwards when you split up the population. It's not a statement about the average performance of something.
I would add a 7th A/B testing mistake to that list and it's not learning about basic probability, statical tests, power, etc. Flying by the seat of your pants when statistics are involved always ends badly.
You might look end up looking at lots of different slices of your data, and you might come to the conclusion, "Oh, it looks like France is statistically significant negative on our new signup flow changes".
It's important to make sure you have a hypothesis for the given slice before you start the experiment and not just hunt for outliers after the fact, or otherwise you're just p-hacking [1].
The temptation to "peek" and keep on peeking until the test confesses to the thing you want it to say is very high.
But what else is there to measure other than checkout button click count(and follow up purchases) to measure the effect of button color change?
Or perhaps this is not a robust example to illustrates undespeficaition?
My impression is that companies like to add/test a lot of features separately - and individually these features are good, but together they form complex clutter and end up being a net negative.
Related
You Can't Build Apple with Venture Capital
Humane, a startup, faced challenges with its "Ai Pin" device despite raising $230 million. Criticized for weight, battery life, and functionality, the late pivot to AI was deemed desperate. Venture capital risks and quick idea testing are highlighted, contrasting startup and established company product development processes.
Synthetic User Research Is a Terrible Idea
Synthetic User Research criticized by Matthew Smith for bias in AI outputs. Using AI may lead to generalized results and fabricated specifics, hindering valuable software development insights. Smith advocates for controlled, unbiased research methods.
Automated Test-Case Reduction
Adrian Sampson explains automated test-case reduction techniques using the Shrinkray reducer to debug an interpreter bug. He emphasizes the importance of effective test scripts and occasional manual intervention for efficient bug identification and resolution.
Fear of over-engineering has killed engineering altogether
The article critiques the tech industry's focus on speed over engineering rigor, advocating for "Napkin Math" and Fermi problems to improve decision-making and project outcomes through basic calculations.
Cringey, but True: How Uber Tests Payments in Production
Uber tests payment systems in production to identify real-world bugs, rolling out new methods incrementally and treating each deployment as an experiment to enhance reliability and efficiency based on user feedback.