August 29th, 2024

OpenAI is shockingly good at unminifying code

The article illustrates how ChatGPT can reverse engineer and unminify JavaScript code in a React application, providing a clear breakdown and a readable TypeScript version for learning purposes.

Read original articleLink Icon
CuriositySkepticismEnthusiasm
OpenAI is shockingly good at unminifying code

The article discusses the use of ChatGPT to reverse engineer and unminify JavaScript code, specifically within a React application. The author, Frank Fiegel, encountered a minified code block while exploring a component that displayed dynamic ASCII art. Instead of manually deciphering the code or looking for a source map, he decided to leverage ChatGPT to explain the code's functionality. The AI provided a breakdown of the code, detailing its components, such as character set selection, dynamic character generation, and the React component responsible for rendering the ASCII art. Following this, the author requested a TypeScript version of the code, which ChatGPT delivered in a more human-readable format. Although the AI's response missed some implementation details, it was deemed sufficiently clear and useful for learning purposes. The article highlights the potential of using AI tools like ChatGPT for code comprehension and transformation, showcasing a practical application in software development.

- ChatGPT can effectively unminify and explain complex JavaScript code.

- The AI-generated TypeScript version of the code is readable and useful for learning.

- The original code generates dynamic ASCII art based on window size and time.

- Using AI tools can streamline the process of understanding and rewriting code.

- The author found the AI's output valuable despite minor omissions in implementation details.

AI: What people are saying
The comments on the article about ChatGPT's ability to reverse engineer and unminify JavaScript code reveal several key themes and points of discussion.
  • Many users share their positive experiences using LLMs for code unminification and refactoring, highlighting the efficiency and clarity they provide.
  • Concerns are raised about the limitations of LLMs, particularly in handling obfuscated code and ensuring semantic fidelity between minified and unminified versions.
  • Some commenters discuss the implications of LLMs on code security and obfuscation, questioning the effectiveness of minification as a protective measure.
  • There is a debate over the necessity of LLMs for tasks that can be accomplished with existing tools, such as beautifiers for JavaScript.
  • Several users express curiosity about the future capabilities of LLMs, including potential applications in reverse engineering and decompilation.
Link Icon 59 comments
By @jehna1 - 8 months
Author of HumanifyJS here! I've created specifically a LLM based tool for this, which uses LLMs on AST level to guarantee that the code keeps working after the unminification step:

https://github.com/jehna/humanify

By @lifthrasiir - 8 months
JS minification is fairly mechanical and comparably simple, so the inversion should be relatively easy. It would be of course tedious enough to be manually done in general, but transformations themselves are fairly limited so it is possible to read them only with some notes to track mangled identifiers.

A more general unminification or unobfuscation still seems to be an open problem. I wrote handful of programs that are intentionally obfuscated in the past and ChatGPT couldn't understand them even at the surface level in my experience. For example, a gist for my 160-byte-long Brainfuck interpreter in C had some comment trying to use GPT-4 to explain the code [1], but the "clarified version" bore zero similarity with the original code...

[1] https://gist.github.com/lifthrasiir/596667#gistcomment-47512...

By @albert_e - 8 months
Should the title say ChatGPT or gpt-4 (the model) instead of OpenAI (the company)?
By @j_maffe - 8 months
LLMs are excellent at text transformation. It's their core strength and I don't see it being used enough.
By @bdcravens - 8 months
I'm sure there's some number greater than zero of developers who are upset because they use minification as a means of obfuscation.

Reminds me of the tool that was provided in older versions of ColdFusion that would "encrypt" your code. It was a very weak algorithm, and didn't take long for someone to write a decrypter. Nevertheless some people didn't like this, because they were using this tool, thinking it was safe for selling their code without giving access to source. (In the late 90s/early 2000s before open source was the overwhelming default)

By @ninetyninenine - 8 months
This is an example of superior intellectual performance to humans.

There’s no denying it. This task is intellectual. Does not involve rote memorization. There are not tons and tons of data pairs on the web of minimized code and unminified code for llms to learn from.

The llm understands what it is unminifying and it is in general superior to humans on this regard. But only in this specific subject.

By @jackconsidine - 8 months
That's interesting. It's gotten a lot better I guess. A little over a year ago, I tried to use GPT to assist me in deobfuscating malicious code (someone emailed me asking for help with their hacked WP site via custom plugin). I got much further just stepping through the code myself.

After reading through this article, I tried again [0]. It gave me something to understand, though it's obfuscated enough to essentially eval unreadable strings (via the Window object), so it's not enough on it's own.

Here was an excerpt of the report I sent to the person:

> For what it’s worth, I dug through the heavily obfuscated JavaScript code and was able to decipher logic that it:

> - Listens for a page load

> - Invokes a facade of calculations which are in theory constant

> - Redirects the page to a malicious site (unk or something)

[0] https://chatgpt.com/share/f51fbd50-8df0-49e9-86ef-fc972bca6b...

By @api - 8 months
Anyone working on decompiler LLMs? Seems like we could render all code open source.

Training data would be easy to make in this case. Build tons of free GitHub code with various compilers and train on inverting compilation. This is a case where synthetic training data is appropriate and quite easy to generate.

You could train the decompiler to just invert compilation and the use existing larger code LLMs to do things like add comments.

By @creesch - 8 months
This is very close to how I often use LLMs [0]. A first step in deciphering code where I otherwise would need to, to use the authors words, power through reading the code myself.

It has been incredibly liberating to just feed it a spaghetti mess, ask to detangle it in a more readable way and go from there.

As the author also discovered, LLMs will sometimes miss some details, but that is alright as I will be catching those myself.

Another use case is when I understand what the code does, but can't quite wrap my head around why it is done in that specific way. Specifically, where the author of the code is no longer with the company. I will then simply put the method in the LLM chat, explain what it does, and just ask it why some things might be done in a specific way.

Again, it isn't always perfect, but more often than not it comes with explanations that actually make sense, hold up under scrutiny and give me new insights. It actually has prevented me once or twice from refactoring something in a way that would have caught me headaches down the line.

[0] chatGPT and more recently openwebUI as a front end to various other models (Claude variants mostly) to see the differences. Also allows for some fun concepts of having different models review each others answers.

By @eqvinox - 8 months
Okay, but if the unminified code doesn't match the minified code (as noted at the end "it looks like LLM response overlooked a few implementation details"), that massively diminishes its usefulness — especially since in a lot of cases you can't trivially run the code and look for differences like the article does.

[ed.: looks like this was an encoding problem, cf. thread below. I'm still a little concerned about correctness though.]

By @fasteddie31003 - 8 months
I recognized this a few months back when I wanted to see the algorithm that a website used to do a calculation. I just put the minified JS in ChatGPT and figured it out pretty easily. Let's take this a few steps out. What happens when a LLM can clone a whole SAAS app? Let's say I wanted to clone HubSpot. If an LLM can interact with a browser and figure out how a UI works and take code hints from un-mimified code I think we could see all SAAS apps be commoditized. The backend would be proprietary, but it could figure out API formats and suggest a backend architecture.

All this makes me think AI's are going to be a strong deflationary force in the future.

By @interstice - 8 months
Have used Claude to reverse engineer some minified shopify javascript code recently. Definitely handy for unpicking things.
By @nutanc - 8 months
Had tweeted about this sometime back. Found a component which was open source earlier and then removed and only minfied JS was provided. Give the JS to Claude and get the original component back. It even gave good class names to the component and function names.

Actually this opens up a bigger question. What if I like an open source project but don't like its license. I can just prompt AI by giving it the open source code and ask it to rewrite it or write in some other language. Have to look up the rules if this is allowed or will be considered copying and how will a judge prove?

By @ziptron - 8 months
>I apologize, GPT-4, for mistakenly accusing you of making mistakes.

I am testing large language models against a ground truth data set we created internally. Quite often when there is a mismatch, I realize the ground truth dataset is wrong, and I feel exactly like the author did.

By @smusamashah - 8 months
LLMS are trained to predict next text. But examples like these look like they have also 'learned patterns'. If rot13 is applied on this minified code, will LLM still find meaning in it? if it still could, its more than just next tokens. Need to try it.

edit: chatgpt found out that its rot13 and couldn't explain the code directly without deobfuscating it first.

By @redbell - 8 months
That's an interesting finding so far!

> The provided code is quite complex, but I'll break it down into a more understandable format, explaining its different parts and their functionalities.

Reading the above statement generated by ChatGPT, I asked myself: Will we live to the day where these LLMs could take a large binary executable as input, read it, analyze it, understand it, then reply with the above statement?

> I followed up asking to "implement equivalent code in TypeScript and make it human readable" and got the following response.. To my surprise, the response is not only good enough, but it is also very readable.

What if this day came and we can ask these LLMs to rewrite the binary code in [almost] any programming language we want? This would be exciting, yet scary to just think about!

By @blueyes - 8 months
Unminification (obfuscation removal) can also be applied to text. Most specialties develop a jargon that allows insiders to communicate complex ideas quickly; that shorthand excludes outsiders. Large language models can make specialist jargon transparent and thereby expand the circle of people whose understanding applies to specialized fields. Essentially, they solve the problem of mapping specialized, jargonized concepts to things the outside reader already knows. Anyone who wants to learn needs this, and I hope it will become part of students' learning paths.
By @gnutrino - 8 months
The site the post mentions for the original code (https://reactive.network/hackathon) is an accessibility nightmare.
By @Mc91 - 8 months
It is good at unminifying and "minifying" as well.

I have been doing the Leetcode thing recently, and even became a subscriber to Leetcode.

What I have been doing is I go through the Grind 75 list (Blind 75 successor list), look for the best big O time and space editorial answer, which often has a Java example, and then go to ChatGPT (I subscribe) or Perplexity (don't subscribe to Pro - yet) and say "convert this to Kotlin", which is the language I know best. Jetbrains IDE or Android Studio is capable of doing this, but Perplexity and ChatGPT are usually capable of doing this as well.

Then I say "make this code more compact". Usually I give it some constraints too - keep the big O space and time complexity the same or lower it, keep the function signature of the assigned function the same, and keep the return explicit, make sure no Kotlin non-null assertions crop up. Sometimes I continually have it run these instructions on each version of the iterated code.

I usually test that the code compiles and returns the correct answers for examples after each iteration of compacting. I also copy answers from one to the other - Perplexity to ChatGPT and then back to Perplexity. The code does not always compile, or give the right answers for the examples. Sometimes I overcompact it - what is clear in four lines becomes too confusing in three compacted lines. I'm not looking for the most compact answer, but a clear answer that is as compact as possible.

One question asked about Strings and then later said, what if this was Unicode? So now for String manipulation questions I say assume the String is Unicode, and then at the end say show the answer for ASCII or Unicode. Sometimes the big O time is tricky - it is time O(m+n) say, but since m is always equal to or less than m in the program, it is actually O(n), and both Perplexity and ChatGPT can miss that until it is explained.

People bemoan Leetcode as a waste of time, but I am wasting even less time with it, as ChatGPT and Perplexity are helping give me the code I will be demonstrating in interviews. The common advice I have heard from everywhere is don't waste time trying to figure out the answers myself - just look at the given answers, learn them, and then look for patterns (like binary search problems, which are usually similar), so that is what I am doing.

Initially I was a ChatGPT and Perplexity skeptic for early versions of those sites, in terms of programming, as they stumbled more, but these self-contained examples and procedures they seem well-suited for. Not that they don't hallucinate or give programs that don't compile, or give the wrong answers sometimes, but it saves me time ultimately.

By @Tistel - 8 months
This might be fun:

Train on java compiled to class files. Then go from class back to java.

Or even:

Train java compiled to class files, and have separate models that train from Clojure to class and Scala to class files. Then see if you can find some crufty (but important) old java project and go: crufty java -> class -> Clojure (or Scala).

If you could do the same with source -> machine instructions, maybe COBAL to C++! or whatever.

By @nwoli - 8 months
Hopefully it can help do this on emscripten files too and help adblockers dechipher obfuscated code for that purpose
By @ervinxie - 8 months
LLMs are very good at text reading. LLMs read tokenized text, while human use eyes to view words. Another scenario is that ChatGPT is good at analyzing cpp template error messages, which are usually long and hard to understand for human.
By @xanderlewis - 8 months
Is there any reason why it’s ‘OpenAI’ in the title rather than ‘ChatGPT’?
By @BeefWellington - 8 months
You can do this on minified code with beautifiers like js-beautify, for example. It's not clear why we need to make this an LLM task when we have existing simple scripts to do it?
By @rpigab - 8 months
I can see some ways to use this and easily check that the LLM is not hallucinating parts of it, because you can ask the LLM to unminify (or deobfuscate) some component, then request unit tests to be written by the LLM, then humanly check that the unit tests are meaningful and that they don't miss things on the unminified code, then run the tests on the original minified version to confirm the LLM's work, maybe set up some mutation testing if it is relevant.
By @amelius - 8 months
This post basically says that I don't need to document my code anymore. No more comments, they can be generated automatically. Hurray!
By @spacecadet - 8 months
I use LLMs to assist with reverse engineering all the time right now. From minified, to binary, alongside Ghidra, its very helpful.
By @shubhamjain - 8 months
An interesting use-case of this capability is refactoring, which, for me, ChatGPT has been unmistakably good at. It's amazing how I can throw garbage code I wrote at ChatGPT, ask it to refactor, and get clean code that I can use without worrying if it's going to work or not, because in 99% of cases it works without breaking anything.
By @_rwo - 8 months
LLMs are great for self-contained boring tasks; recently I have started to refactor ruby tests with a simple prompt (getting rid of various rspec syntax in favor of more explicit notation at cost of code duplication - so kinda like unminifying things I guess) - works _ridiculously_ good as well
By @1vuio0pswjnm7 - 8 months
Slightly off-topic but I remain perplexed at how "minified" Javascript is acceptable to software developers commenting online but terse code in any other language, e.g., one letter variable names, is unacceptable to a majority of software developer online commenters.
By @emporas - 8 months
There is also topiary. From their website "The universal code formatter". I think it doesn't work with Javascript source for the moment, but it will surely work in the future.

[1] https://topiary.tweag.io/

By @fergie - 8 months
* takes out soap box and stands on it *

We should go back to uncompiled JavaScript code, our democracy depends on it.

By @bredren - 8 months
Would have been cool if this had been used in that air con reverse engineering story yesterday.

I noticed while reading the blog entry that the author described using a search engine multiple times and thought, "I would have asked ChatGPT first for that."

By @joshdavham - 8 months
Are there any serious security implications for this? Of course obfuscation through minification won't work anymore, but I'm not sure if that's really all that serious of an issue at the end of the day.
By @andrewmcwatters - 8 months
I’ve tried using LLMs to deobfuscate libraires like fingerprintjs-pro to understand what specific heuristics implementation details they use to detect bots.

They mostly fail. A human reverse engineer will still do better.

By @cedws - 8 months
I’m hoping LLMs get better at decompiling/RE’ing assembly because it’s a very laborious process. Currently I don’t think they have enough in their training sets to be very good at it.
By @foxhop - 8 months
Here's a hint, STOP MINIFYING CODE! gzip over transport is enough.
By @VMG - 8 months
it is also pretty good at decompiling - try feeding it the output of https://godbolt.org/
By @tanepiper - 8 months
I find LLMs good at these kind of tasks, also converting between CSV to JSON for example (although you have to remind it not to be lazy and do the whole file)
By @l5870uoo9y - 8 months
It is also shockingly good at converting/extracting data to CSV or JSON, but not JSONL. Even the less capable model, `gpt-4o-mini`, can "reliably" parse database schemas in various formats into CSV with the structure:

```csv table_name,column_name,data_type table_name,column_name1,data_type table_name,column_name2,data_type ... ```

I have been running it in production for months[1] as a way to import and optimize database schemas for AI consumption. This performs much better than including the `schema.sql` file in the prompt.

[1]: https://www.sqlai.ai/app/datasources/add/database-schema/ai-...

By @camillomiller - 8 months
Most expensive unminify software in history
By @darepublic - 8 months
This is cool but I would have the worry that it got something wrong. Which is a general llm problem
By @prologist11 - 8 months
I have to ask the obvious question: how do you know the unminified code is semantically equivalent to the minified code? If someone knows how to verify LLM code transformations for semantic fidelity then I'd like to know because I think that would qualify as a major breakthrough for programming languages and semantics.
By @antonoo - 8 months
Is this code available in ChatGPT's training data?

Tried hard, couldn't find any similar code.

By @gagabity - 8 months
Wonder how it does against other obfuscation and general decompiled code fixing.
By @runiq - 8 months
Please let this one have knock-on effects on reverse engineering.
By @Tepix - 8 months
punkpeye could also have asked the LLM to replace the cryptic function and variable names with nice ones. I'm hopeful it would have done a good job.
By @nashashmi - 8 months
Looks like the end is here for security via obscurity.
By @lostdev - 8 months
Why not just use a beautifier?
By @bravetraveler - 8 months
LLMs are good at modeling and transforming text, news at 11. AI proponent hypes AI. I could go on, but I shouldn't have been this sarcastic to start with
By @Julesman - 8 months
"Usually, I would just powerthrough reading the minimized code..."

Huh? Is this a thing? There are endless online code formatting sites. It takes two seconds. Why would anyone ever do this? I don't get it.

By @nprateem - 8 months
And shockingly shit at writing articles that don't sound like essays.
By @MetaverseClub - 8 months
wow, is openAI such a great magic to you?
By @nnurmanov - 8 months
Yet another surprising side effects of LLMs.
By @samstave - 8 months
Dont know if this will apply directly here, but --

As someone who is "not a developer" - I use the following process to help my:

1. I setup StyleGuide rules for the AI, telling it how to write out my files/scripts:

- Always provide full path, description of function, invocation examples, and version number.

- Frequently have it summarize and explain the project, project logic, and a particular file's functions.

- Have it create a README.MD for the file/project

- Tell it to give me mermaid diagrams and swim diagrams for the logic/code/project/process

- Close prompts with "Review, Explain, Propose, Confirm, Execute" <-- This has it review the code/problem/prompt, explain what it understands, propose what its been asked to provide, confirm that its correct or I add mroe detail here - then execute and go with creating the artifacts.

I do this because Claude and ChatGPT are FN malevelant in their ignoring of project files/context - and their hallucinate as soon as their context window/memory fills up.

Further they very frequently "forget" to refer to the project context files uploaded/artifacts they themselves have proposed and written etc.

But - asking for a readme with code mermaid and logic is helpful to keep me on track.

By @pogue - 8 months
Only thing I'd like to suggest is an option to search for Windows 11 compatible machines. With MS cutting off support for Windows 10 next year, making sure a machine has the system requirements needed.

However, I have seen a lot of sellers install W11 on non-compatible devices using a few tricks. I'm not sure how you check that in a search tool, but great job otherwise! I'll definitely be using this in the future (and I think you should pass everything through affiliate links! Pay for the upkeep at least)