August 7th, 2024

Cringey, but True: How Uber Tests Payments in Production

Uber tests payment systems in production to identify real-world bugs, rolling out new methods incrementally and treating each deployment as an experiment to enhance reliability and efficiency based on user feedback.

Read original articleLink Icon
FrustrationSkepticismAppreciation
Cringey, but True: How Uber Tests Payments in Production

Uber employs a unique approach to testing its payment systems by conducting tests in production rather than relying solely on staging environments. This method, while often viewed with skepticism by engineers, allows Uber to identify and address bugs that may not surface in a controlled setting. The rationale behind this strategy is that as software matures, it becomes increasingly difficult to find and fix issues in a staging environment, which may not accurately replicate real-world conditions. Uber's testing process involves rolling out new payment methods incrementally, starting with a small, representative user base to monitor performance and quickly roll back if issues arise. This approach emphasizes the importance of real user feedback and data in refining software, particularly in the high-stakes realm of payment processing. By treating each deployment as an experiment, Uber can adapt and improve its systems based on actual user interactions, ultimately enhancing the reliability and efficiency of its payment solutions.

- Uber tests its payment systems in production to identify real-world bugs.

- The company rolls out new payment methods incrementally to minimize risk.

- Each deployment is treated as an experiment to gather data and improve systems.

- Staging environments are seen as less effective compared to real-world testing.

- The approach emphasizes the importance of user feedback in software development.

AI: What people are saying
The comments on the article about Uber's payment testing approach reveal a mix of opinions and insights from industry professionals.
  • Many commenters agree that testing in production is a common practice in the industry, especially for payment systems, due to the limitations of staging environments.
  • Some criticize the article for being overly simplistic or lacking depth, while others find value in its insights about real-world testing.
  • There are discussions about the challenges of using testing environments, including issues with payment provider APIs and the high costs associated with testing in production.
  • Several commenters suggest alternative testing strategies, such as parallel deployments or using cloned requests to minimize risks.
  • Overall, there is a consensus that while testing in production can be risky, it is often necessary to uncover bugs that staging environments cannot replicate.
Link Icon 38 comments
By @andrewl-hn - 9 months
Isn't it what's everybody does in the industry?!

Every single place that I ever worked at in a past 20 years tests payments using real cards and real API endpoints. Yes, refunds cost a few pennies and sometimes can't be automated, but most payment providers simply do not offer testing APIs of a sufficient quality.

Situations when a testing endpoint has one set of bugs not found on production and vice versa used to be so ubiquitous in mid-2000 to mid-2010s, that many teams make a choice agains using testing endpoints altogether - it's too much work to work around bugs unique to the environment that no real customers actually hit. And now the whole generation of developers grew in a world of bad testing APIs of PayPal, Authorize.net, BrainTree, BalancedPayments (remember them?), early Stripe, etc. So, now it became an institutional knowledge: "do not use testing endpoints for payments".

To be exact, people often start using testing endpoints for early stages of development when you don't have any payment code at all, but before the product launch things get switched to production endpoints and from that point on testing endpoints aren't used at all. Even for local development people usually use corporate cards if necessary.

I have a suspicion that things may be different in the US, with many payment providers' testing environments simulate a typical domestic US scenario: credit cards and not debit, no 3d-secure, no strict payment jurisdiction restrictions, etc.

By @andriesm - 9 months
I see several comments calling this piece "fluffy" without much real insight - I have to respectfully disagree - I'm 48 and wrote my first code at 8, still write code for my self at 48, have managed teams, held all manner of roles and done some startups. This article is solid gold.

I'm surprised people think this article doesn't have much important to say. I suspect their code probably crashes a lot in production, and will still kill many startups or otherwise end up destroying significant amounts of shareholder value.

They think the article is banal and obvious. They will not really take the key insights to heart and truly live it.

Crowdstrike is the perfect example of this!!!

And for every crowdstrike there are tons of startups that don't make the news but ends up burning their early adopter users through inability to deal with bugs properly, delay their own success unnecessarily or even turns what would have been massive business successes into technical morasses. Imagine failing to capture your businesses full potential because of a bad approach to software defects!

By @zadokshi - 9 months
This article can be rewritten into one line:

“Not all bugs can be found until you deploy to production. So deploying to production can be called ‘testing in production’”

By @madaxe_again - 9 months
From my experience building medium scale ecommerce systems, along with innumerate payment integrations of various flavours, this isn’t unreasonable, for a few reasons.

Firstly, payment service providers honestly suck at providing a coherent staging environment. Either it’ll be out of date, or ahead of production, or full of garbage data that you can’t clear that breaks their outputs, or just plain not representative of the production environment. You’ll have stuff check out perfectly in staging only to be a hot mess on their live environment.

Secondly, if you’re doing this stuff at scale, it’s not as simple as “make an API call and get a result” - you’ve got your egress and ingress to worry about, at different levels (NAT, load balancing, packet routing, http(s) proxies), and there’s a host of stuff that can go wrong for subtle reasons.

We used to (for they are now just a shopify shop since my departure a decade ago) do exactly as is described - test in staging as much as it is useful, and then go live with an immediate test built into the deployment toolchain, with automatic rollback in case of failure for any reason.

It worked. The only payment issues we ever had after having the realisation that testing on staging was damned near meaningless, were on the side of the payment gateway.

By @jatins - 9 months
Extremely fluffy piece. 20% in and not one valuable piece of information
By @TYPE_FASTER - 9 months
We worked with a payment processor to implement billing for our services via credit card. According to the payment processor, the QA environment for one of the major credit cards had been broken for a while, so we tested in production.

We were testing billing customers who were going to pay us, so putting a small charge on a corporate card that was going to come back to us wasn't a big deal, I just remember being slightly surprised that testing something like credit card payments was done against the production environment.

By @takumo - 9 months
Yes, this article is probably longer and fluffier than it needs to be but there are some real truths here.

Payments are one of the original service orientated architecture systems, in production your payment is processed by at least three or four parties each of which will call several systems or sub-systems to process a payment.

This method clearly works for Uber, who have a lot of payments going through their systems most of which are of a relatively small value. Dropping a payment and either asking the user to pay via a different option or simply writing off the revenue for a handful of transactions is probably workable for them.

I have the opposite, the number of transactions we process is relatively low, but the average value of these transactions is high, well in excess of 1000 USD. This leads to the following issue:

1. Screwing up a payment and asking the user to try again can be a big hit to user confidence. 2. We can't write off even a single payment/transaction, they're too high value to write-off. 3. Processing fees and refunds for making test transactions in production are too expensive. If a test costs more than $10 (to test in production we must test with production transaction values) that's going to rack up quickly.

By @throwaway82498 - 9 months
Uber had, and probably still has, a sophisticated setup for directing prod traffic for specific requests to/from developer laptops, for isolating test tenancies in prod services, for simulating trips using test tenancies, for automatically detecting and rolling back deployments based on everything from the usual observability metrics to black box testing against prod, and last but not least, good unit test coverage.

I bet their payments team runs code before it gets deployed. The article seems to imply that Uber engineers don't bother to test code before they land it, when in reality they do test it, and they also catch other stuff afterwards too.

By @cheschire - 9 months
> software is not like other machines. Most machines, in time, rot and decay. But software is just information: if it’s correct, it stays that way. Hardware does need replacement, but the correct software that runs on it keeps running.

Unless you have some empowered person or group in your organization, levels above your team, that is allowed to constantly move the goalposts because of “cybersecurity!!1” and even the most mundane internal-only systems have the be kept to the latest versions of everything ever just so their scanning software shows “green”. Probably because their own OKRs are based on how many green circles they keep or something.

They’re cyberaccountants.

By @snowstormsun - 9 months
> First, you have to copy all production data. It’s expensive, and a reckless breach in privacy and security, but it’s doable.

So, what does "doable" mean in this context? We unnecessarily increased the attack surface for production data and until today haven't suffered a data breach because of it?

A staging env with actual prod data now needs be treated as a production environment. A system is only as secure as its weakest link, so an attacker will have an easier time getting into that "staging" environment where things are tested out, no?

By @randomgiy3142 - 9 months
I tried explaining to people that you’re dealing with systems that are so antiquated places accept Diner’s Club cards. Accepting a credit card at all was a big deal because you literally copied a number and hoped it worked. People have cards that don’t have email associated with them. Furthermore there a ton of settling nuances. It’d be like building a browser if you were an alien who was given RFC specs.

I’ve worked with giant companies working directly with providers. Testing legalese and reality are far apart. In no scenario would we have the customer “test” a major new feature rollout. We’d have a budget and someone would make a real purchase then donate to charity the good or usually it was office candy for a month. I doubt the budget was even touched. We likely had provisions the prevented a $10k charge on a $15 product, that never happened. The only issue was that it’d skip normal QA (India has weird rules), and usually actually be a frivolous purchase or purchases on corporate and private cards.

By @brynb - 9 months
i've built tons of very intricate payments systems over the past 10 years and i honestly have no idea how "payments engineer" is even worthy of a distinct job title. it's a thing people do in the course of building products. ridiculous
By @_heimdall - 9 months
A couple large corporations I worked for had two instances of prod, geographically isolated with one acting as a fallback in case the primary went down. This isn't particularly novel at all, but what I was always interested in was using a similar setup for testing production prior to flipping the release live.

Effectively you'd just have prod and staging with identical deployment configuration. The benefit would be promoting the exact staging release to prod as soon as tests pass.

That said, I've never tried this and I'm sure there are good arguments for avoiding the added complexity of regularly flipping production between two different environments.

By @noiv - 9 months
Just an idea, can't you just swap staging and production? So, actually the system you've tested goes live by switching nothing more than a pointer (no deployment involved).

Won't raising support cost at some point suggest it's cheaper having two swappable live systems than the alternative?

By @usernamed7 - 9 months
I agree with others calling this fluffy. I bounced after this:

> to test your payment systems in sandbox for an amount of time that’s reasonable. And not a second more.

For an amount of time that is reasonable? and not a second more? what is this dribble?

By @graeme - 9 months
What do people with smaller companies do to test with real cards? The terms of credit cards usually disallow using your own card to make a purchase from yourself.
By @robertlagrant - 9 months
> I really like how Charity Majors put it: “staging is just a glorified laptop”. Only production is production.

Production is also just a glorified laptop.

By @trollied - 9 months
Everybody has a testing environment. Some people are lucky enough enough to have a totally separate environment to run production in.
By @gabrieledarrigo - 9 months
I expected a more deep and detailed article, but my hopes were trashed just after the first, poor, introductory section.
By @mannyv - 9 months
To be honest, errors in payment processing are hard to create and reproduce in test. Plus there are errors that apparently never occur anywhere except in production.

So yeah, "testing" in production is normal for all payment systems.

By @JoosToopit - 9 months
Pure graphomania.

Look, ma, I'm a blogger! Wait, no scratch that - I'm a WRITER!

By @cuttysnark - 9 months
cc: 4242424242424242

cvc: 424

exp: 2/4/24

Fond memories of speed-running the checkout flow in Stripe sandbox.

By @lofaszvanitt - 9 months
Payment systems are the blogs of the early 2000s.
By @aloknnikhil - 9 months
I couldn't bother myself to read the whole article. Got GPT-4 to summarize the main points. Not as much insight as I thought I would get going in.

1. *Testing in Staging vs. Production*: - Most engineers prefer testing in staging due to a sense of control. - There's a misconception that it's an either/or situation between staging and production testing. In reality, both are necessary.

2. *Importance of Production Testing*: - Staging environments can’t replicate all possible real-world scenarios. - Production testing is essential to identify complex, real-world issues missed in staging.

3. *Uber's Approach to Testing*: - Uber tests its payment systems in production. - They have developed tools (Cerberus and Deputy) to facilitate transparent interaction with real systems and gather responses effectively.

4. *Every Deployment as an Experiment*: - Every deployment is treated as a hypothesis to be validated against business metrics. - Metrics and monitoring are crucial to determine the success of deployment.

5. *First Rollout Region*: - Uber chooses a specific first rollout region to minimize risk and impact. - Initial rollouts are conducted in regions that are small but significant for practical monitoring.

6. *Canary Deployments*: - Uber conducts canary deployments to a subset of users to detect and mitigate potential issues early. - This approach helps in identifying and fixing issues with minimal impact.

7. *Examples of Issues Discovered Early*: - Uber detected significant issues with GooglePay during its cautious rollout in Portugal, which would have been difficult to identify in a staging environment alone.

8. *Philosophy on Software Quality*: - True robustness and resiliency come from real-world usage and the continuous fixing of encountered issues. - Only production can provide the real stakes and conditions needed for thorough validation.

9. *Author and Newsletter*: - Alvaro Duran, author of “The Payments Engineer Playbook”, emphasizes the importance of sharing and learning from real-world experiences in payments systems. - Encourages readers to engage with the content and share it with colleagues for broader impact.

By @ram_rar - 9 months
Articles like these need TL;DR Testing in prod is a tale as old as time.

It would have been more insightful to cover the underlying infra/tech that enables this seamlessly.

By @tqi - 9 months
> For Uber, every deployment is an experiment

Blindly experimenting without a clear hypothesis is a great way to ship statistical noise.

By @kelsey98765431 - 9 months
my hot take is to test in every environment... what a concept. the even deeper hot take here is to reimplement mocks of your integrated environments AND THEN IMPLEMENT THEIR SYSTEMS! the process of good testing has a side effect of eventually eliminating technical debt, because those same set of tests that ensure your application is working can test if your reimplementation of your upstream integration is working! ta da you are now a growth company.
By @lucw - 9 months
Reminder that if you test a live payment on a new Stripe deployment, you will get INSTANTLY banned. Don't do a live test with a credit card in your name !!
By @mrbluecoat - 9 months
> For Uber, every deployment is an experiment

Me: Let's do that!

Boss: Ummm...

By @ninju - 9 months
Testing in Production

The Crowdstrike philosophy /s

By @NotGMan - 9 months
TLDR: some bugs can only happen in a real production environment, so expect them and be ready when deploying. Thinking your deploy will be ok because staging env passed all tests is delusional.
By @AppliedQuantum - 9 months
Or, one could test in production-parallel deployment. Clone all requests to a parallel test system, use the same production data for enrichment and validation for both, the current production system and the new one. And automatically compare the outputs from both systems for those fields that have to be the same between the systems, and test the expected changed outputs automatically.

Once there are no errors in the new system, you start switching over the systems in a controlled manner where the new system increasingly takes on the production role, and the old one still processes cloned requests for a while as a sanity check…

This way you don’t need an unrealistic staging environment, and you are not introducing any errors into production.

It worked more than 20 years ago when I architected this for a system that had to process 50M transactions every hour.