Show HN: DOM to Semantic Markdown – For LLMs
The GitHub repository hosts "DOM to Semantic Markdown," converting HTML to Semantic Markdown for Large Language Models. It features AST conversion, main content detection, metadata capture, URL shortening, and npm support.
Read original articleThe GitHub repository contains a tool called "DOM to Semantic Markdown" designed to convert HTML DOM to Semantic Markdown for Large Language Models (LLMs). It offers features like HTML to Semantic Markdown AST conversion, main content detection, metadata capture, URL refification, customizable options, and supports browser and Node.js environments. The tool provides an Abstract Syntax Tree (AST) for content manipulation, URL refification for shortening long URLs, and main content detection for identifying primary webpage content. It can be installed via npm, used in browser and Node.js, and includes functions like `convertHtmlToMarkdown` and `convertElementToMarkdown`. Contributions are welcome, and the tool is under the MIT License. For more details or help, users can inquire within the repository.
Related
AI-powered conversion from Enzyme to React Testing Library
Slack engineers transitioned from Enzyme to React Testing Library due to React 18 compatibility issues. They used AST transformations and LLMs for automated conversion, achieving an 80% success rate.
The Eternal Truth of Markdown
Markdown, a simplified code alternative to HTML, enables diverse document formats from plain text. Despite lacking standardization, it thrives for its adaptability and simplicity, appealing to writers and programmers alike.
Igneous Linearizer: semi-structured source code
The Igneous Linearizer project enhances source code in Obsidian Markdown format, enabling features like links and transclusion. It sacrifices AST correctness for compatibility with text editors and Git, benefiting literate programming. Users must follow specific input file formats for optimal use.
Show HN: a Rust lib to trigger actions based on your screen activity (with LLMs)
The GitHub project "Screen Pipe" uses Large Language Models to convert screen content into actions. Implemented in Rust + WASM, inspired by `adept.ai`, `rewind.ai`, and `Apple Shortcut`. Open source under MIT license.
Converting Codebases with LLMs
Mantle discusses using Large Language Models (LLMs) to convert codebases, emphasizing benefits like improved maintainability and performance. They highlight strategies for automating code translation and optimizing the process.
A trick I’ve found seems to work well is leaving some kind of id or coordinate marker on each column, and adding that to each cell. You could probably do that while still having valid markdown if you put the metadata in HTML comments, although it’s hard to say how an LLM will do at understanding that format.
Are there any data or benchmarks available that show what kind of text content LLMs understand best? Is it generally understood at this point that they "understand" markdown better than html?
I've found that LLMs are often bad at understanding "standard" markdown tables (of which there are many different types, see: https://pandoc.org/chunkedhtml-demo/8.9-tables.html). In our case, we found the best results when keeping the HTML tables in HTML, only converting the non-table parts of the HTML doc to markdown. We also strip the table tags of any attributes, with the exception of colspan and rowspan which are semantically meaningful for more complicated HTML tables. I'd be curous if there are LLM performance differences between the approach the author uses here (seems like it's based on repeating column names for each cell?) and just preserving the original HTML table structure.
One request: It would be great if you also had an option for showing the page's schema (which is contained inside the HTML).
The quick solution was to use beautiful soup and readability-lxml to try and get the main article contents and then send it to an LLM.
The results are ok when the markup is semantic. Often it is not. Then you have tables, images, weirdly positioned footnotes, etc.
I believe the best way to extract information the way it was intended to be presented is to screenshot the page and send it to a multimodal LLM for “interpretation”. Anyone experimented with that approach?
——
The aspiration goal for the tool is to be the Presidential Daily Brief but for everyone.
Also, when trying to run the node example from your readme, I had to use `new dom.window.DOMParser()` instead of `dom.window.DOMParser`
[0]: https://val.town
A few thoughts:
1. URL Refification[sic] would only save tokens if a link is referred to many times, right? Otherwise it seems best to keep locality of reference. Though to the degree that URLs are opaque to the LLM, I suppose they could be turned into references without any destination in the source at all, and if the LLM refers to a ref link you just look it up the real link in the mapping.
2. Several of the suggestions here could be alternate serializations of the AST, but it's not clear to me how abstract the AST is (especially since it's labelled as htmlToMarkdownAST). And now that I look at the source it's kind of abstract but not entirely: https://github.com/romansky/dom-to-semantic-markdown/blob/ma... – when writing code like this I also find keeping the AST fairly abstract also helps with the implementation. (That said, you'll probably still be making something that is Markdown-ish because you'll be preserving only the data Markdown is able to represent.)
3. With a more formal AST you could replace the big switch in https://github.com/romansky/dom-to-semantic-markdown/blob/ma... with a class that can be subclassed to override how particular nodes are serialized.
4. But I can also imagine something where there's a node type like "markdown-literal" and to change the serialization someone could, say, go through and find all the type:"table" nodes and translate them into type:"markdown-literal" and then serialize the result.
5. A more advanced parsing might also turn things like headers into sections, and introduce more of a tree of nodes (I think the AST is flat currently?). I think it's likely that an LLM would follow `<header-name-slug>...</header-name-slug>` better than `# Header Name\n ....` (at least sometimes, as an option).
6. Even fancier if, running it with some full renderer (not sure what the options are these days), and you start to use getComputedStyle() and heuristics based on bounding boxes and stuff like that to infer even more structure.
7. Another use case that could be useful is to be able to "name" pieces of the document so the LLM can refer to them. The result doesn't have to be valid Markdown, really, just a unique identifier put in the right position. (In a sense this is what URL reification can do, but only for URLs?)
Related
AI-powered conversion from Enzyme to React Testing Library
Slack engineers transitioned from Enzyme to React Testing Library due to React 18 compatibility issues. They used AST transformations and LLMs for automated conversion, achieving an 80% success rate.
The Eternal Truth of Markdown
Markdown, a simplified code alternative to HTML, enables diverse document formats from plain text. Despite lacking standardization, it thrives for its adaptability and simplicity, appealing to writers and programmers alike.
Igneous Linearizer: semi-structured source code
The Igneous Linearizer project enhances source code in Obsidian Markdown format, enabling features like links and transclusion. It sacrifices AST correctness for compatibility with text editors and Git, benefiting literate programming. Users must follow specific input file formats for optimal use.
Show HN: a Rust lib to trigger actions based on your screen activity (with LLMs)
The GitHub project "Screen Pipe" uses Large Language Models to convert screen content into actions. Implemented in Rust + WASM, inspired by `adept.ai`, `rewind.ai`, and `Apple Shortcut`. Open source under MIT license.
Converting Codebases with LLMs
Mantle discusses using Large Language Models (LLMs) to convert codebases, emphasizing benefits like improved maintainability and performance. They highlight strategies for automating code translation and optimizing the process.