August 13th, 2024

Launch HN: Trellis (YC W24) – AI-powered workflows for unstructured data

Trellis is an AI-powered ETL tool that converts unstructured data into structured SQL formats, addressing enterprise data management challenges, particularly in financial services, using advanced AI techniques for optimization.

ExcitementCuriosityOptimism

Launch HN: Trellis (YC W24) – AI-powered workflows for unstructured data

Trellis, founded by Jacky and Mac, is an AI-powered ETL tool designed to convert unstructured data, such as phone calls, PDFs, and chats, into structured SQL formats based on user-defined schemas. This innovation aims to assist data and operations teams in automating manual data entry and executing SQL queries on disorganized data. The founders, who met at the Stanford AI lab, identified a significant challenge in enterprise data management: 80% of enterprise data is unstructured, which traditional platforms struggle to process. Trellis addresses this issue by utilizing advanced techniques, including LLM-based map-reduce for long documents and model routing to optimize transformation processes. The tool has seen applications in various sectors, particularly in financial services, where it helps streamline the processing of complex documents and enhances operational efficiency. Users can explore a demo and a showcase featuring an analysis of Enron emails, highlighting Trellis's capabilities. The founders invite feedback and offer integration options for interested users, emphasizing their commitment to improving workflows related to unstructured data.

- Trellis transforms unstructured data into structured SQL formats.

- The tool addresses the challenge of managing 80% of enterprise data that is unstructured.

- It employs advanced AI techniques for document processing and model optimization.

- Applications are particularly notable in financial services and customer support.

- Users can access demos and integration options to explore Trellis's capabilities.

Trellis (YC W24) is hiring engineer to build AI-powered ETL for unstructured data

Trellis, a startup backed by Y Combinator, General Catalyst, and investors from Google, Salesforce, and JP Morgan Chase, seeks a Founding Engineer. The role involves developing AI-powered data infrastructure and requires skills in Python, Go, ML/NLP, and cloud technologies. Founded in 2023, Trellis offers opportunities in cutting-edge AI and data projects.

Show HN: txtai: open-source, production-focused vector search and RAG

The txtai tool is a versatile embeddings database for semantic search, LLM orchestration, and language model workflows. It supports vector search with SQL, RAG, topic modeling, and more. Users can create embeddings for various data types and utilize language models for diverse tasks. Txtai is open-source and supports multiple programming languages.

txtai: Open-source vector search and RAG for minimalists

txtai is a versatile tool for semantic search, LLM orchestration, and language model workflows. It offers features like vector search with SQL, topic modeling, and multimodal indexing, supporting various tasks with language models. Built with Python and open-source under Apache 2.0 license.

Trellis (YC W24) is hiring engineer to build AI-powered ETL for unstructured data

Trellis, a startup backed by Y Combinator, seeks a Founding Engineer for backend and ML infrastructure. They aim to create an AI-powered Snowflake for unstructured data, offering opportunities in pioneering AI, data infrastructure, and database development.

Trellis (YC W24) is hiring eng to build AI workflows for unstructured data

Trellis, a Y Combinator-backed startup, seeks a founding engineer for its machine learning team, offering a salary of $110K-$225K and equity. Candidates need experience in full-stack development and relevant technologies.

AI: What people are saying

The comments on Trellis highlight various perspectives on the AI-powered ETL tool's potential and challenges in the market.

Many users express excitement about the tool's ability to extract data from unstructured sources like PDFs, emphasizing its value in sectors like finance and healthcare.
Several commenters share their own experiences with similar technologies, discussing challenges such as accuracy and the need for manual review.
Concerns are raised about competition and the sustainability of Trellis's business model in a rapidly evolving market.
Questions about the tool's accuracy, integration capabilities, and compliance with regulations like HIPAA are prevalent.
Overall, there is a mix of enthusiasm for the innovation and skepticism regarding its practical implementation and market fit.

46 comments

By @john_horton - 9 months

Very cool - I've been working on an open source python package that lets you do some similar things (https://github.com/expectedparrot/edsl).

Here's an example of the Enron email demo using the edsl syntax/package & a few different LLMs: https://www.expectedparrot.com/content/6607caa1-efc5-439f-85...

By @makk - 9 months

> a major commercial bank I work with couldn’t improve credit risk models because critical data was stuck in PDFs and emails.

Great use case! Worked on exactly this a decade ago. It was Hard™ then. Could only make so much progress. Getting this right is a huge value unlock. Congrats!

By @bustodisgusto - 9 months

We built something tangentially related at SoundTrace.

Basically when we onboard a new client they dump all their audiograms on us as PDFs.

The data needs extraction needs to be perfect because the tables values are used to detect hearing loss over time.

We settled on a pipeline that looks roughly like

PDF -> gpto pre filter phase -> OCR to extract text tables and forms -> things branch out here

We do a direct parse of forms and text through an LLM

Extract audiogram graphs and send them to a foundation convnet

Attempt to parse tables programmatically

-> an audiogram might have 3 separate places where the values are so we pass the results of all three of these routes through Claude sonnet and if they match they get auto approved. If they don’t, they get flagged for manual review.

All in all it’s been a journey but the accuracy is near 100 percent. These tools are incredible

By @icey - 9 months

Great idea. I used to work at Instabase, which you probably compete with. The better you are at dealing with dodgy PDFs and document scans, the more valuable this will be to big banks, shipping companies, etc.

By @cs702 - 9 months

Congratulations on launching!

Trellis looks amazing... but only if it works well enough, i.e., if the rate of edge cases that trip up the service consistently remains close to 0%.

Every organization in the world needs and wants this, like, right now.

If you make it work well enough, you'll have customers knocking on your door around the clock.

I'm going to take a look. Like others here, I'm rooting for you guys to succeed.

By @shcheklein - 9 months

Hey, congrats! Are you competing / is there some overlap / what are the key differences with Roe AI (YC W24) - roe.ai (just launched recently on HN https://news.ycombinator.com/item?id=41202694 as well).

By @iudexgundyr - 9 months

Interesting! One quick question, how did you validate your data and ensure its correctness, since the ground truth is unstructured?

By @artembugara - 9 months

Hey folks. Congrats on the launch.

Everyone here knows that it's a really big problem that no one has nailed yet.

My 2 cents:

1. It took us (newscatcherapi.com) three years to realize that customers with the biggest problems and with the biggest budgets are the most underserved. The reason is that everyone is building an infinitely scalable AI/LLM/whatever to gain insights from news.

In reality, this NLP/AI works quite OK out of the box but is not ideal for everyone at the same time. So we decided to do Palantir-like onboarding/integration for each customer. We charge 25x more, but customers have a perfect tailor-made solution and a high ROI.

I see you already do the same! "99%+ accuracy with fine-tuning and human-in-the-loop" is what worked great for us. This way, your competitor is a human on payroll (very expensive) and not AWS Tesseract.

Going from 95% to 99% is just a fractional improvement, but it can be "not good enough" to a "great solution" change that can be charged differently.

2. "AI-powered workflow for unstructured data" what does it even mean? Why don't you say "99%+ accuracy extraction"? It's 2024, everyone is using AI, and everyone knows you need 2 hours to start applying AI from 0. So don't lower my expectations.

By @rahimnathwani - 9 months

I've had do some of this recently, as a one-off, to extract the same fields from thousands of scanned documents.

I used OpenAI's function calling (via Langchain's https://python.langchain.com/v0.1/docs/modules/model_io/chat... API).

Some of the challenges I had:

1. poor recall for some fields, even with a wide variety of input document formats

2. needing to experiment with the json schema (particularly field descriptions) to get the best info out, and ignore superfluous information

3. for each long document, deciding whether to send the whole document in the context, or only the most relevant chunks (using traditional text search and semantic vector search)

4. poor quality OCR

From the demo video, it seems like your main innovation is allowing a non-technical user to do #2 in an iterative fashion. Have I understood correctly?

By @natural1 - 9 months

Has Trellis explored partnerships or integrations with major ERP systems or existing ETL pipelines? The ability to seamlessly fit into existing enterprise architectures could be a significant competitive advantage and a compelling value proposition for large enterprises looking to modernize their data infrastructure.

By @atak1 - 9 months

Congrats on launching! Wish we had this years ago at Flexport for our ops / science teams. Traditional ML approaches are expensive, and the idea of defining your final shape of data and automating the ETL process is the best abstraction out there.

Rooting for you guys!

By @skeptrune - 9 months

Both fulltext (BM25 or SPLADE) and dense vector search have issues with documents of different lengths. Part of what makes recursive sentence splitting work so well are its length normalization properties.

Filters are a really important feature downstream of that which this system can provide.

We have also worked with the Enron corpus for demos and fast, reliable ETL for a set of documents that large is more difficult than it seems and a commendable problem to solve.

Exciting stuff!

By @cellu - 9 months

Really cool project! I'm doing something similar at a very small scale for my personal project using TypeChat with Zod (https://github.com/microsoft/Typechat) and Unstructured (https://unstructured.io/)

By @macklinkachorn - 9 months

Getting a lot of love from HN so the demo site and data processing might slow down by quite a bit. We're fixing it right now!

By @aiden3 - 9 months

What about a pdf with many separate datapoints on it?

For instance, I have 100 pdfs, each with 10-100 individual products listed (in different formats).

I want to create a single table with one row per product appearing in any of the PDFs, with various details like price, product description, etc.,

From what I can tell from the demo, it seems like 1 file = 1 row in Trellis?

By @ellis0n - 8 months

A good project, I think it will help many projects that are in chaos due to an overload of information. But many projects are in chaos created by managers with chaos in their heads. This will only add more chaos to such projects.

By @serjester - 9 months

It seems like your business strategy is contingent on foundational model providers not improving their product on a couple dimensions: price, grounding accuracy and file handling. This is a risky strategy, especially in such a competitive market. Wishing you the best of luck.

By @purplepatrick - 9 months

Two quick questions: any plans on being hipaa compliant? Probably one of the biggest use cases for this is in health insurance, etc.

How do your capabilities compare to Google Document AI or Watson SDU? Also what about standalone competitors such as Indico Data or DocuPanda?

By @dmahanta - 9 months

Didnt work for me as expected

By @MoritzWall - 9 months

> And many companies today want data preprocessing in ETL pipelines and data ingestion for RAG.

I'm curious, have you (or your customers) deployed this in a RAG use case already, and what have been the results like?

By @bitshaker - 9 months

Digitizing and organizing old document scans for birth, marriage, and death records would be a huge win for genealogy research. The Mormon church would be a great customer for you.

By @chrisweekly - 9 months

disclaimer: I'm a barely-informed layperson, not any kind of AI expert

non-snarky genuine question: is "generate structured data from unstructured data using AI" intended to be a moat or differentiator?

catalyst for my question: I just read about this capability becoming available from other AI vendors, e.g.

https://openai.com/index/introducing-structured-outputs-in-t...

By @usehexus - 9 months

Congrats on the launch! This is a great idea! Many usecases.

By @szawinis - 8 months

Super cool! This really is big problem that's waiting for someone to have a solution that fully nails it.

By @sidcool - 9 months

Congrats on launching. What model or AI you use underneath?

By @hubraumhugo - 9 months

You mention validation and schema guarantees as key features for high accuracy. Are you using an LLM-as-a-judge combined with traditional checks for this?

By @EarlyOom - 9 months

Curious how this compares to platforms like https://unstructured.io/

By @inglor - 9 months

I don't understand why you need an LLM for this, wouldn't a simple NER + entity normalization do this at a fraction of the cost?

(congrats on the launch!)

By @doctorpangloss - 9 months

> At the Stanford AI lab where we met... 80% of enterprise data is unstructured, and traditional platforms can’t handle it

You guys came out of an academic lab, so you must know that hypothesis fishing expeditions are not viable.

> ... a major commercial bank... couldn’t improve credit risk models because critical data was stuck in PDFs and emails.

In this example there will be no improvement to the risk model or whatever, because 19/20 times there will be no improvement. In an academic setting this is seen as normal, but in a business setting with no executive champions, only product managers, this will be seen as a failure, and it will be associated with you and your technology, which is bad.

Unfortunately these people are not willing to pay more money for less risk. What they want is a base consulting cost (i.e., a non-venture business) to identify the lowest risk, promotion worthy endeavor, and then they want to pay as little as possible to achieve that. In a sense, the kind of customers who need unstructured data ETLs are poorly positioned to use such a technology, because they don't value technology generally, they aren't forward looking.

Assembling attractive websites that are really features on top of Dagster? There's a lot of value in that. Question is, are people willing to pay for that? Anyone can make attractive Dagster UIs, anyone can do Python glue. It's very challenging to differentiate yourselves, even when you feel like you have some customers, because eventually, one of those middlemen at BankCo are going to punch your USP into Google, and find the pre-existing services with huge account management teams (i.e., the hand holding consulting business people really pay for) that outpace you.

By @aviguptakonda - 9 months

Wow, this is game changing! With your inventions, interestingly we might also be discovering reverse ETL use cases, where the insights/analytics obtained from the troves of unstructured data can be fed back into ERP/CRM/HCM systems, closing the complete loop and amplifying more business value!! Congratulations to the Trellis team :) Regards, Avinash

By @mehulashah - 9 months

Congratulations on the launch! This is the right way to think about LLMs and document processing.

By @xkq - 9 months

Super sick. I’m building this at work rn. Definitely a cool technical problem. Good luck!

By @rmbyrro - 9 months

Domains should start with your company name. Like trellishq.com

Because browsers have an autocomplete feature.

By @vinibrito - 9 months

Nice! How's accuracy of produced data?

By @nosmokewhereiam - 9 months

Love the name! Electronic gardening vibe.

By @blotterfyi - 9 months

Just want to say, this is pretty cool.

By @darkhorse13 - 9 months

Congrats on the launch. Serious question though, does YC only fund AI companies these days?

By @wilburli - 9 months

this is dope!

By @ymoondhra - 9 months

Intriguing!

By @destraynor - 9 months

Congrats on the launch, and thanks for using Intercom (co-founder here)

By @constantinum - 9 months

Congrats on the launch! For anyone curious who wants to dig deep and solve document processing workflows via open-source, do try Unstract https://github.com/Zipstack/unstract

By @localfirst - 9 months

looks like more solutions looking for a problem that can be solved at the vendor level

Launch HN: Trellis (YC W24) – AI-powered workflows for unstructured data

Related

Trellis (YC W24) is hiring engineer to build AI-powered ETL for unstructured data

Show HN: txtai: open-source, production-focused vector search and RAG

txtai: Open-source vector search and RAG for minimalists

Trellis (YC W24) is hiring engineer to build AI-powered ETL for unstructured data

Trellis (YC W24) is hiring eng to build AI workflows for unstructured data

Related

Trellis (YC W24) is hiring engineer to build AI-powered ETL for unstructured data

Show HN: txtai: open-source, production-focused vector search and RAG

txtai: Open-source vector search and RAG for minimalists

Trellis (YC W24) is hiring engineer to build AI-powered ETL for unstructured data

Trellis (YC W24) is hiring eng to build AI workflows for unstructured data