July 8th, 2024

How we tamed Node.js event loop lag: a deepdive

Trigger.dev team resolved Node.js app performance issues caused by event loop lag. Identified Prisma timeouts, network congestion from excessive traffic, and nested loop inefficiencies. Fixes reduced event loop lag instances, aiming to optimize payload handling for enhanced reliability.

Read original articleLink Icon
How we tamed Node.js event loop lag: a deepdive

The article discusses how the team at Trigger.dev identified and resolved performance issues in their Node.js application caused by event loop lag. Initially, crashes and high CPU usage led them to investigate errors like Prisma transaction timeouts and network traffic spikes. By analyzing AWS logs, they discovered a single IP address generating excessive traffic to a specific endpoint, causing network congestion. Further investigation revealed a nested loop issue processing logs inefficiently, leading to high CPU usage. Implementing a fix and monitoring event loop lag, they addressed various issues like limiting log sizes and optimizing payload calculations. By deploying these fixes, they observed a significant reduction in event loop lag instances. Moving forward, the team plans to optimize payload handling further to enhance application reliability. The article emphasizes the importance of distributing main-thread work efficiently in Node.js applications to prevent event loop lag and ensure smooth operation for all clients.

Related

Optimizing JavaScript for Fun and for Profit

Optimizing JavaScript for Fun and for Profit

Optimizing JavaScript code for performance involves benchmarking, avoiding unnecessary work, string comparisons, and diverse object shapes. JavaScript engines optimize based on object shapes, impacting array/object methods and indirection. Creating objects with the same shape improves optimization, cautioning against slower functional programming methods. Costs of indirection like proxy objects and function calls affect performance. Code examples and benchmarks demonstrate optimization variances.

JavaScript Visualized – Event Loop, Web APIs, (Micro)Task Queue [video]

JavaScript Visualized – Event Loop, Web APIs, (Micro)Task Queue [video]

The event loop in JavaScript is crucial for managing asynchronous tasks efficiently. It includes the call stack, web APIs, task queue, and microtask queue, enabling non-blocking operations. For more details, feel free to inquire.

Bad habits that stop engineering teams from high-performance

Bad habits that stop engineering teams from high-performance

Engineering teams face hindering bad habits affecting performance. Importance of observability in software development stressed, including Elastic's OpenTelemetry role. CI/CD practices, cloud-native tech updates, data management solutions, mobile testing advancements, API tools, DevSecOps, and team culture discussed.

The weirdest QNX bug I've ever encountered

The weirdest QNX bug I've ever encountered

The author encountered a CPU usage bug in a QNX system's 'ps' utility due to a 15-year-old bug. Debugging revealed a race condition, leading to code modifications and a shift towards open-source solutions.

Four lines of code it was four lines of code

Four lines of code it was four lines of code

The programmer resolved a CPU utilization issue by removing unnecessary Unix domain socket code from a TCP and TLS service handler. This debugging process emphasized meticulous code review and system interaction understanding.

Link Icon 7 comments
By @ricardobeat - 6 months
This is not about “taming lag” as suggested by the title, which implies some form of failure on node’s part.

They accidentally wrote synchronous O(n^2) code that hogged the CPU, blocking the event loop, then fixed it. But that doesn’t sound as adventurous…

Otherwise a solid example of using observability tools to debug a live issue.

By @dexwiz - 6 months
Says event loop in the title, but the real culprit is a non paginated endpoint with a nested looped. Pagination or guard rails are basic things for customer facing features. Any time you design a service for X items, some will try it with 10x-1000X items. Be ready for that.
By @williamdclt - 6 months
I’m a bit confused by the monitoring described. Event loop lag is insidious because it doesn’t affect only the slow part of your app, it affects everything: one small part of a request takes seconds, making every concurrent request take seconds. Generally, i found that when the event loop is having lag issue, you can’t really trust much of your application monitoring (OTel spans are very long, but it’s actually just waiting for the event loop). How then did find the root causes of these lag issues?

As an aside, it’s a bit weird to create a span to mark that something happened, OTel events are made for that

By @jauntywundrkind - 6 months
There's an easy but terrible fix for event loops lag. By app means, limit your work, do hard stuff like these folks did. But. If you just want to stop the suffering, all you need to do is yield periodically!

  if (n % 1000 === 0) await require('node:timers/promises').setImmediste()
If you sleep in your async functions, other work can flow. Just unblock the other work.

Node calls this partitioning your work. https://nodejs.org/en/learn/asynchronous-work/dont-block-the...

Maybe don't use this specific library but it's pretty easy to rebuild everyday array iterators (forEach, reduce, map) to automatically yield every n iterations. https://www.npmjs.com/package/nice-loops

By @moonlion_eth - 6 months
How we turned our amateur code into click bait
By @cpursley - 6 months
tl;dr: should have used Erlang/Elixir as well as our customers.