July 23rd, 2024

Show HN: DOM to Semantic Markdown – For LLMs

The GitHub repository hosts "DOM to Semantic Markdown," converting HTML to Semantic Markdown for Large Language Models. It features AST conversion, main content detection, metadata capture, URL shortening, and npm support.

Read original articleLink Icon
Show HN: DOM to Semantic Markdown – For LLMs

The GitHub repository contains a tool called "DOM to Semantic Markdown" designed to convert HTML DOM to Semantic Markdown for Large Language Models (LLMs). It offers features like HTML to Semantic Markdown AST conversion, main content detection, metadata capture, URL refification, customizable options, and supports browser and Node.js environments. The tool provides an Abstract Syntax Tree (AST) for content manipulation, URL refification for shortening long URLs, and main content detection for identifying primary webpage content. It can be installed via npm, used in browser and Node.js, and includes functions like `convertHtmlToMarkdown` and `convertElementToMarkdown`. Contributions are welcome, and the tool is under the MIT License. For more details or help, users can inquire within the repository.

Link Icon 19 comments
By @mistercow - 3 months
This is cool. When dealing with tables, you might want to explore departing from markdown. I’ve found that LLMs tend to struggle with tables that have large numbers of columns containing similar data types. Correlating a row is easy enough, because the data is all together, but connecting a cell back to its column becomes a counting task, which appears to be pretty rough.

A trick I’ve found seems to work well is leaving some kind of id or coordinate marker on each column, and adding that to each cell. You could probably do that while still having valid markdown if you put the metadata in HTML comments, although it’s hard to say how an LLM will do at understanding that format.

By @gmaster1440 - 3 months
> Semantic Clarity: Converts web content to a format more easily "understandable" for LLMs, enhancing their processing and reasoning capabilities.

Are there any data or benchmarks available that show what kind of text content LLMs understand best? Is it generally understood at this point that they "understand" markdown better than html?

By @DeveloperErrata - 3 months
It's neat to see this getting attention. I've used similar techniques in production RAG systems that query over big collections of HTML docs. In our case the primary motivator was higher token efficiency (ie to represent the same semantic content but with a smaller token count).

I've found that LLMs are often bad at understanding "standard" markdown tables (of which there are many different types, see: https://pandoc.org/chunkedhtml-demo/8.9-tables.html). In our case, we found the best results when keeping the HTML tables in HTML, only converting the non-table parts of the HTML doc to markdown. We also strip the table tags of any attributes, with the exception of colspan and rowspan which are semantically meaningful for more complicated HTML tables. I'd be curous if there are LLM performance differences between the approach the author uses here (seems like it's based on repeating column names for each cell?) and just preserving the original HTML table structure.

By @richardreeze - 3 months
This is really cool. I've already implemented it in one of my tools (I found it to work better than the Turndown/ Readability combination I was previously using).

One request: It would be great if you also had an option for showing the page's schema (which is contained inside the HTML).

By @la_fayette - 3 months
The scoring approach seems interesting to extract the main content of web pages. I am aware of the large body of decades of research on that subject, with sophisticated image or nlp based approaches. Since this extraction is critical to the quality of the LLM response, it would be good to know how well this performs. E.g., you could test it against a test dataset (https://github.com/scrapinghub/article-extraction-benchmark). Also, you could provide the option to plugin another extraction algorithm, since there are other implementations available... just some ideas for improvement...
By @gradientDissent - 3 months
Nice work. Main content extraction based on the <main> tag won’t work with most of the web pages these days. Arc90 could help.
By @kartoolOz - 3 months
WebArena does this really well, called the "accessibility_tree" https://github.com/web-arena-x/webarena/blob/main/browser_en...
By @nvartolomei - 3 months
While I was writing a tool for myself to summarise daily the top N posts from HN, Google Trends, and RSS feed subscriptions I had the same problem.

The quick solution was to use beautiful soup and readability-lxml to try and get the main article contents and then send it to an LLM.

The results are ok when the markup is semantic. Often it is not. Then you have tables, images, weirdly positioned footnotes, etc.

I believe the best way to extract information the way it was intended to be presented is to screenshot the page and send it to a multimodal LLM for “interpretation”. Anyone experimented with that approach?

——

The aspiration goal for the tool is to be the Presidential Daily Brief but for everyone.

By @KolenCh - 3 months
I am curious how it would compare to using pandoc with readability algorithm for example.
By @alexliu518 - 3 months
Converting web pages to Markdown is a common requirement. I have found that turndown does a good job, but it cannot meet the needs of all dynamic web page content. As far as I know, if you need to process dynamic web pages, you need targeted adaptation, such as Google extensions such as Web2Markdown.
By @throwthrowuknow - 3 months
Thank you! I’m always looking for new options to use for archiving and ingesting web pages and this looks great! Even better that it’s an npm package!
By @nbbaier - 3 months
This is really cool! Any plans to add Deno support? This would be a great fit for environments like val.town[0], but they are based on a Deno runtime and I don't think this will work out of the box.

Also, when trying to run the node example from your readme, I had to use `new dom.window.DOMParser()` instead of `dom.window.DOMParser`

[0]: https://val.town

By @KolenCh - 3 months
Does anyone compare the performance between HTML input and other formats? I did an informal comparison and from a few tests it seems the HTML input is better. I thought having markdown input would be more efficient too but I’d like to see more systematic comparison to see it is the case.
By @brightvegetable - 3 months
This is great, I was just in need of something like this. Thank!
By @explosion-s - 3 months
How is this different than any other HTML to markdown library, like Showdown or Turndown? Is there any specific features that make it better for LLMS specifically instead of just converting HTML to MD?
By @Layvier - 3 months
Nice, we have this exact use case for data extraction from scraped webpages. We've been using html-to-md, how does it compare to it?
By @Zetaphor - 3 months
A browser demo would be a nice addition to this readme
By @DevX101 - 3 months
Problem is, with modern websites, everything is a div and you can't necessarily infer semantic meaning from the DOM elements.
By @ianbicking - 3 months
This is a great idea! There's an exceedingly large amount of junk in a typical HTML page that an LLM can't use in any useful way.

A few thoughts:

1. URL Refification[sic] would only save tokens if a link is referred to many times, right? Otherwise it seems best to keep locality of reference. Though to the degree that URLs are opaque to the LLM, I suppose they could be turned into references without any destination in the source at all, and if the LLM refers to a ref link you just look it up the real link in the mapping.

2. Several of the suggestions here could be alternate serializations of the AST, but it's not clear to me how abstract the AST is (especially since it's labelled as htmlToMarkdownAST). And now that I look at the source it's kind of abstract but not entirely: https://github.com/romansky/dom-to-semantic-markdown/blob/ma... – when writing code like this I also find keeping the AST fairly abstract also helps with the implementation. (That said, you'll probably still be making something that is Markdown-ish because you'll be preserving only the data Markdown is able to represent.)

3. With a more formal AST you could replace the big switch in https://github.com/romansky/dom-to-semantic-markdown/blob/ma... with a class that can be subclassed to override how particular nodes are serialized.

4. But I can also imagine something where there's a node type like "markdown-literal" and to change the serialization someone could, say, go through and find all the type:"table" nodes and translate them into type:"markdown-literal" and then serialize the result.

5. A more advanced parsing might also turn things like headers into sections, and introduce more of a tree of nodes (I think the AST is flat currently?). I think it's likely that an LLM would follow `<header-name-slug>...</header-name-slug>` better than `# Header Name\n ....` (at least sometimes, as an option).

6. Even fancier if, running it with some full renderer (not sure what the options are these days), and you start to use getComputedStyle() and heuristics based on bounding boxes and stuff like that to infer even more structure.

7. Another use case that could be useful is to be able to "name" pieces of the document so the LLM can refer to them. The result doesn't have to be valid Markdown, really, just a unique identifier put in the right position. (In a sense this is what URL reification can do, but only for URLs?)