August 15th, 2024

Show HN: Denormalized – Embeddable Stream Processing in Rust and DataFusion

Denormalized is a developing stream processing engine based on Apache DataFusion, supporting Kafka. Users can start with Docker and Rust/Cargo, with future features planned for enhanced functionality.

Read original articleLink Icon
CuriosityExcitementInterest
Show HN: Denormalized – Embeddable Stream Processing in Rust and DataFusion

Denormalized is a fast embeddable stream processing engine built on Apache DataFusion, aimed at real-time stream processing with support for Kafka as both a source and sink. The project is currently in development, and the team is seeking design partners to collaborate on specific use cases. Users can engage with the developers through GitHub issues or email. To get started, users need Docker and Rust/Cargo installed. The quickstart guide includes instructions for running Kafka in Docker, emitting sample data, and performing simple streaming aggregations. Additional examples, such as a Kafka ridesharing scenario, are also provided. The roadmap outlines completed features like stream aggregation and joins, with future plans for checkpointing, session windows, a stateful UDF API, and integrations with DuckDB, PostgreSQL, Python, and TypeScript, along with a user interface. The project is maintained by a team in San Francisco, and inquiries can be directed to hello@denormalized.io or through GitHub.

- Denormalized is a stream processing engine built on Apache DataFusion.

- It supports Kafka for real-time data processing and is currently in development.

- Users can start using it with Docker and Rust/Cargo.

- Future features include checkpointing, session windows, and various integrations.

- The project team is open to collaboration and inquiries via GitHub or email.

AI: What people are saying
The comments on the article about Denormalized reflect a mix of interest and inquiries regarding the new stream processing engine.
  • Users express excitement about the project's potential and ease of setup.
  • Several commenters inquire about specific features, such as support for OLAP use cases and pluggable data sources.
  • There is curiosity about how Denormalized compares to existing solutions like Arroyo and Flink.
  • Founders and developers from related projects show interest in collaboration and integration with Denormalized.
  • Many users are eager for future features, including a Python SDK and TypeScript bindings.
Link Icon 16 comments
By @dman - 6 months
This looks super interesting. I built https://github.com/finos/perspective in a past life but have been out of the streaming analytics game for some time. Nice to see single machine efficiency be a focus, will give this a try and post feedback on github.
By @emgeee - 6 months
Other founder here -- we've been working on this now for several months and have had a lot of fun building on top of arrow and datafusion
By @theLiminator - 6 months
Are you going to support OLAP use cases as well? I haven't yet found a really nice hybrid batch/streaming query engine with dataframe support.

Ideally, you'd support an api similar to Polars (which I have found to be the nicest thus far).

It'd also be important/useful to support Python udfs (think numpy/jax/etc.).

It'd be very cool if you could collaborate with or even tap into the polars frontend. If you could execute polars logical plans but with a streaming source, that would be huge.

By @j-pb - 6 months
I'd be curious to know what your thoughts on differential/timely dataflow are. Superficially it seems that it might be possible to integrate the existing Rust infrastructure from those libraries with DataFusion and Arrow, which could give you quite a few operators for free, and provide your users with the very nice incremental query/streaming-as-view-maintenance model.
By @ethegwo - 6 months
Neat, founder of https://tonbo.io/ here, I am excited to see someone bring stream processing to datafusion, we are working on a arrow-native embedded db and plan to support datafusion in the next release, we’re interested in building the streaming feature on denormalized.
By @shrisukhani - 6 months
Interesting. What use cases are you guys targeting with this?
By @stereosky - 6 months
Congratulations on launching your project! We spoke back in March at a Kafka Summit London social meetup and talked all things Python and Kafka (I work on https://github.com/quixio/quix-streams). Always great to see a new stream processing project tackle a new segment
By @eXpl0it3r - 6 months
For someone not deep in the topic, what is a "Streaming Processing Engine"?

All the description for Denormalized use the term, so if don't know it, it's kind of impossible to understand what Denormalized is / trying to solve.

By @nonlogical - 6 months
This looks totally awesome! Easy to setup, memory-efficient, streaming, real-time data aggregation, compilable to a single self contained binary, that is a dream come true.

Bookmarked for future projects!

By @ztratar - 6 months
Will be excited to see the typescript bindings once out. We may be able to use this to handle some of our workloads at Embra.

Will reach out! Congrats on the ship.

By @drawnwren - 6 months
What differentiates you from i.e. Arroyo and Fluvio?
By @franciscojarceo - 6 months
Can't wait for the Python SDK!
By @lhnz - 6 months
Do you have plans to make the data sources pluggable instead of being Kafka specific?
By @akshay2881 - 6 months
Nice! How feature complete is this with current industry standards like Flink?
By @rNULLED - 6 months
Looks cool! I’ll try it out for my ambitious project :)