June 19th, 2024

AI-powered conversion from Enzyme to React Testing Library

Slack engineers transitioned from Enzyme to React Testing Library due to React 18 compatibility issues. They used AST transformations and LLMs for automated conversion, achieving an 80% success rate.

Read original articleLink Icon
AI-powered conversion from Enzyme to React Testing Library

In the world of frontend development, Slack engineers faced the challenge of transitioning from Enzyme to React Testing Library (RTL) due to Enzyme's lack of support for React 18. With over 15,000 tests to convert, they explored automated solutions using Abstract Syntax Tree (AST) transformations and Large Language Models (LLMs). While AST transformations required manual rule creation for accuracy, LLMs showed promise but lacked consistency. A hybrid approach combining AST and LLM technologies achieved an 80% conversion success rate. To enhance the process further, the team collected DOM tree data for contextual information and implemented strict controls using prompts and AST instructions to guide the LLM model effectively. Despite the complexities and the need for manual intervention in some cases, Slack's commitment to innovation and problem-solving led to significant improvements in the conversion process, showcasing the value of integrating human insights with automated technologies for successful code migrations.

Related

We no longer use LangChain for building our AI agents

We no longer use LangChain for building our AI agents

Octomind switched from LangChain due to its inflexibility and excessive abstractions, opting for modular building blocks instead. This change simplified their codebase, increased productivity, and emphasized the importance of well-designed abstractions in AI development.

Exposition of Front End Build Systems

Exposition of Front End Build Systems

Frontend build systems are crucial in web development, involving transpilation, bundling, and minification steps. Tools like Babel and Webpack optimize code for performance and developer experience. Various bundlers like Webpack, Rollup, Parcel, esbuild, and Turbopack are compared for features and performance.

Software Engineering Practices (2022)

Software Engineering Practices (2022)

Gergely Orosz sparked a Twitter discussion on software engineering practices. Simon Willison elaborated on key practices in a blog post, emphasizing documentation, test data creation, database migrations, templates, code formatting, environment setup automation, and preview environments. Willison highlights the productivity and quality benefits of investing in these practices and recommends tools like Docker, Gitpod, and Codespaces for implementation.

My weekend project turned into a 3 years journey

My weekend project turned into a 3 years journey

Anthony's note-taking app journey spans 3 years, evolving from a secure Markdown tool to a complex Electron/React project with code execution capabilities. Facing challenges in store publishing, he prioritizes user feedback and simplicity, opting for a custom online deployment solution.

Homegrown Rendering with Rust

Homegrown Rendering with Rust

Embark Studios develops a creative platform for user-generated content, emphasizing gameplay over graphics. They leverage Rust for 3D rendering, introducing the experimental "kajiya" renderer for learning purposes. The team aims to simplify rendering for user-generated content, utilizing Vulkan API and Rust's versatility for GPU programming. They seek to enhance Rust's ecosystem for GPU programming.

Link Icon 24 comments
By @morgante - 7 months
The Slack engineering blog[0] is more pragmatic, and shows more about how the approaches were actually combined.

This is basically our whole business at grit.io and we also take a hybrid approach. We've learned a fair amount from building our own tooling and delivering thousands of customer migrations.

1. Pure AI is likely to be inconsistent in surprising ways, and it's hard to iterate quickly. Especially on a large codebase, you can't interactively re-apply the full transform a bunch.

2. A significant reason syntactic tools (like jscodeshift) fall down is just that most codemod scripts are pretty verbose and hard to iterate on. We ended up open sourcing our own codemod engine[1] which has its own warts, but the declarative model makes handling exceptions cases much faster.

3. No matter what you do, you need to have an interactive feedback loop. We do two levels of iteration/feedback: (a) automatically run tests and verify/edit transformations based on their output, (b) present candidate files for approval / feedback and actually integrate feedback provided back into your transformation engine.

[0] https://slack.engineering/balancing-old-tricks-with-new-feat...

[1] https://github.com/getgrit/gritql

By @anymouse123456 - 7 months
The actual efficiency claim (which is also likely incorrect) is inverted from the original article, "We examined the conversion rates of approximately 2,300 individual test cases spread out within 338 files. Among these, approximately 500 test cases were successfully converted, executed, and passed. This highlights how effective AI can be, leading to a significant saving of 22% of developer time."

Reading that leads me to believe that 22% of the conversions succeeded and someone at Slack is making up numbers about developer time.

By @jmull - 7 months
> saving considerable developer time of at least 22% of 10,000 hours

I wonder how much time or money it would take to just update Enzyme to support react 18? (fork, or, god forbid, by supporting development of the actual project).

Nah, let's play with LLMs instead, and retask all the frontend teams in the company to rewriting unit tests to a new framework we won't support either.

I guess when you're swimming in pools of money there's no need to do reasonable things.

By @AmalgatedAmoeba - 7 months
The conversion is between two testing libraries for React. Not to be too cynical (this sort of works seems to me like a pretty good niche for llms), but I don’t think I’d be that far off of 80% with just vim macros…
By @muglug - 7 months
For people unfamiliar with Enzyme and RTL, this was the basic problem:

Each test made assertions about a rendered DOM from a given React component.

Enzyme’s API allowed you to query a snippet of rendered DOM using a traditional selector e.g. get the text of the DOM node with id=“foo”. RTL’s API required you to say something like “get the text of the second header element”, but prevents you from using selectors.

To do the transformation successfully you have to run the tests, first to render each snippet, then have some system for taking those rendered snippets and the Enzyme code that queries it and convert the Enzyme code to roughly-equivalent RTL calls.

That’s what the LLM was tasked with here.

By @denys_potapov - 7 months
It's a 2024 webdev summary, nothing can be added:

New React version made the lib obsolete, we used LLM to fix it (1/5 success rate)

By @jmartin2683 - 7 months
Sounds like a nightmare to be involved with anything that is written in react and requires 15,000 unit tests.
By @semanser - 7 months
I’m working on a similar project (DepsHub) where LLMs are used to make major library updates as smooth as possible. While it doesn’t work in 100% cases, it really helps to minimize all the noise while keeping your project up to date. I’m not surprised Slack decided to go this way as well.
By @dwringer - 7 months
It feels to me that there may be even more potential in flipping this idea around - human coders write tests to exact specifications, then an llm-using coding system evolves code until it passes the tests.
By @__jonas - 7 months
Seems like a reasonable approach. I wonder if it took less time than it would have taken to build some rule-based codemod script that operates on the AST, but I assume it did.
By @azangru - 7 months
We did this for our codebase (several hundred tests) manually, two or three years ago (the problems were already apparent with React 17). It helped that we never used Enzyme's shallow renderer, because that type of testing was already falling out of favor by late 2010s.

The next fronteer is ditching jest and jsdom in favor of testing in a real browser. But I am not sure the path for getting there is clear yet in the community.

By @larodi - 7 months
Another proof this probabilistic stochastic approach works on the prediction/token level, but not on the semantic level, where it needs a discreet system. This essentially reminds of RAG setup and is similar in its nature.

Perhaps reiterating my previous sentiment that such application of LLMs together with discreet structures brings/hides much more value than chatbots who will be soon considered mere console UI.

By @trescenzi - 7 months
Slightly tangential but one of the largest problems I’ve had working with React Testing Library is a huge number of tests that pass when they should fail. This might be because of me and my team misusing it but regularly a test will be written, seem like it’s testing something, and pass but if you flip the condition, or break the component it doesn’t fail as expected. I’d really worry that any mass automated, or honestly manual, method for test conversion would result in a large percentage of tests which seem to be of value but actually just pass without testing anything.
By @viralpraxis - 7 months
Can someone elaborate if the term “AST” is used correctly in the article?

I’ve been playing with mutation-injection framework for my master’s thesis for some time. I had to use LibCST to preserve syntax information which is usually lost during AST serialization/deserialization (like whitespaces, indentation and so on). I thought that the difference between abstract and concrete trees is that it’s guaranteed CST won’t lose any information, so it can be used to specific tasks where ASTs are useless. So, did they actually use CST-based approach?

By @skywhopper - 7 months
Pretty misleading summary, given that LLMs played only a tiny part in the effort, and probably took more time to integrate than it saved in what is otherwise a pretty standard conversion pipeline, although I’m sure it’s heavily in the Slack engineers’ interest to go along with the AI story to please the Salesforce bosses who have mandated AI must be used in every task. Just don’t fall for the spin here, and think this will actually save you time on a similar effort.
By @29athrowaway - 7 months
Saving 22% of 15,000 tests is 3,300 tests.

While 22% sounds low, saving yourself the effort to rewrite 3,300 tests is a good achievement.

By @torginus - 7 months
Just to shamelessly plug one of my old projects, I did something like this at a German industrial engineering firm - they wanted us to rewrite a huge base of old tests written in TCL into C#.

It was supposed to take 6 months for 12 people.

Using an AST parser I wrote a program in two weeks, that converted like half the tests flawlessly, with about another third needing minor massaging, and the rest having to be done by hand (I could've done better, by handling more corner cases, but I kinda gave up once I hit diminishing returns ).

Although it helped a bunch that most tests were brain dead simple.

Reaction was mixed - the newly appointed manager was kinda fuming that his first project's glory was stolen from him by an Assi, and the guys under him missed out on half a year of leisuirely work.

I left a month after that, but what I heard is that they decided to pretend that my solution didn't exist on the management level, and the devs just ended up manually copypasting the output of my tool, and did a days planned work in 20 minutes, with the whole thing taking 6 months as planned.

By @anymouse123456 - 7 months
Misleading title. Maybe try this one?

"Slack uses ASTs to convert test code from Enzyme to React with 22% success rate"

This article is a poor summary of the actual article, which is at least linked to Slack's engineering blog [0].

[0] https://slack.engineering/balancing-old-tricks-with-new-feat...

[updated]

By @gjvc - 7 months
infoq has gone to pure shit
By @Aurornis - 7 months
This is from the actual Slack blog post:

> We examined the conversion rates of approximately 2,300 individual test cases spread out within 338 files. Among these, approximately 500 test cases were successfully converted, executed, and passed. This highlights how effective AI can be, leading to a significant saving of 22% of developer time. It’s important to note that this 22% time saving represents only the documented cases where the test case passed.

So the blog post says they converted 22% of tests, which they claim as saving 22% of developer time, which InfoQ interpreted as converting 80% of tests automatically?

Am I missing something? Or is this InfoQ article just completely misinterpreting the blog post it’s supposed to be reporting on?

The topic itself is interesting, but between all of the statistics games and editorializing of the already editorialized blog post, it feels like I’m doing heavy work just to figure out what’s going on.