June 19th, 2024

Open Source Python ETL

Amphi is an open-source Python ETL tool for data extraction, preparation, and cleaning. It offers a graphical interface, supports structured and unstructured data, promotes low-code development, and integrates generative AI. Available for public beta testing in JupyterLab.

Read original article

Amphi is an open-source Python-based ETL tool focused on extracting, preparing, and cleaning data from various sources and formats. It excels at data integration, data extraction, preparation for data science, and API retrieval. With a graphical user interface, users can design data pipelines and generate native Python code deployable anywhere. The tool supports structured and unstructured file ingestion, data cleansing, API retrieval, and enrichment. Amphi promotes low-code development, reducing development and maintenance time compared to traditional coding. It generates Python code for deployment in various environments, ensuring flexibility and no lock-in. Data is stored and processed locally for privacy and security. Amphi is community-driven, encouraging collaboration and sharing of pipeline definitions. It aims to build a global community of data practitioners, catering to both novices and experts. The tool is AI-native, integrating generative AI capabilities for AI-oriented use cases. Amphi is available for public beta testing in JupyterLab.

Lessons Learned from Scaling to Multi-Terabyte Datasets

Insights on scaling to multi-terabyte datasets, emphasizing algorithm evaluation before scaling. Tools like Joblib and GNU Parallel for single machine scaling, transitioning to multiple machines, and comparing performance/cost implications. Recommendations for parallel workloads and analytical tasks using AWS Batch, Dask, and Spark. Considerations for tool selection based on team size and workload.

OpenPipe Mixture of Agents: Outperform GPT-4 at 1/25th the Cost

OpenPipe's cost-effective agent mixture surpasses GPT-4, promising advanced language processing at a fraction of the cost. This innovation could disrupt the market with its high-performance, affordable language solutions.

Show HN: Eidos – Offline alternative to Notion

The Eidos project on GitHub offers a personal data management framework as a Progressive Web App with AI features. Customizable with extensions and scripting, it leverages sqlite-wasm technology for chromium-based browsers.

Shape Rotation 101: An Intro to Einsum and Jax Transformers

Einsum notation simplifies tensor operations in libraries like NumPy, PyTorch, and Jax. Jax Transformers showcase efficient tensor operations in deep learning tasks, emphasizing speed and memory benefits for research and production environments.

Are AlphaFold's new results a miracle?

AlphaFold 3 by DeepMind excels in predicting molecule-protein binding, surpassing AutoDock Vina. Concerns about data redundancy, generalization, and molecular interaction understanding prompt scrutiny for drug discovery reliability.

28 comments

By @thibautdr - 10 months

Hi everyone, thanks for posting Amphi :)

To give some context, Amphi is a low-code ETL tool for both structured and unstructured data. The key use cases include file integration, data preparation, data migration, and creating data pipelines for AI tasks like data extraction and RAG. What sets it apart from traditional ETL tools is that it generates Python code that you own and can deploy anywhere. Amphi is available as a standalone web app or as a JupyterLab extension.

Visit the GitHub: https://github.com/amphi-ai/amphi-etl Give it a try and let me know what you think

By @ic_fly2 - 10 months

With all the data issues strong quality and normalisation I often get the impression that enabling more people with non CS backgrounds to do this work is not necessarily a good thing.

In other words, if writing python and sql is the skill requirement that stops you from making an etl pipeline, maybe do something else.

By @jamesblonde - 10 months

#dang The title needs changing - it's not open-source, it is license ELv2 - Elastic License v2.

By @mritchie712 - 10 months

If you're looking for "open source Python ETL", two things that are better options:

https://dlthub.com/

https://hub.meltano.com/

we[0] use meltano in production and I'm happy with it. I've played around with dlt and it's great, just not a ton of sources yet.

0 - https://www.definite.app/

By @awesomebytes - 10 months

I was not familiar with the acronym ETL and it is not explained anywhere in the website! My feedback would be to at least write it once, on the first instance so others like me will know what they are reading :)

By @mrwyz - 10 months

Not open source. Misleading title.

By @paulvnickerson - 10 months

Very cool, thanks for sharing. Does it support the pandas-like rapidsai dask_cudf framework? (https://docs.rapids.ai/api/dask-cudf/stable/)

By @cvalka - 10 months

THIS IS NOT OPEN SOURCE!

By @whalesalad - 10 months

Been happy with Dagster but this looks interesting.

By @tayloramurphy - 10 months

I'm curious as to the story of how things like this come to be. It seems like there are already a ton of "open source python ETL" tools on the market. Was this a passion project by the author? Was this born out of academia? Was there a specific problem they were trying to solve that others didn't? It's not necessary to answer these questions in the docs but it is useful for folks who may be familiar with the other options out there.

By @gregw2 - 10 months

Isn’t pandas centric ETL much more memory intensive and less compute efficient than using SQL?

By @C4stor - 10 months

It's a good idea, but from the docs it looks like the high level abstractions are wrong.

If my data pipeline is "take this table, filter it, output it", I really don't want to use a "csv file input" or a "excel file output".

I want to say "anything here in the pipeline that I will define that behaves like a table, apply it this transformation", so that I can swap my storage later without touching the pipeline.

Same things for output. Personally I want to say "this goes to a file" at the pipeline level, and the details of the serialization should be changeable instantly.

That being said, can't complain about a free tool, kudos on making it available !

By @whazor - 10 months

"Python ETL", Github language statistics: TypeScript 87.1%

It looks nice though.

By @Joeboy - 10 months

Since there are "ETL" people here, I have a couple of naive questions, in case anybody can answer:

1) Are there any"standard"-ish (or popular-ish) file formats for node-based / low-code pipelines?

2) Is there any such format that's also reasonably human readable / writable?

3) Are there low-code ETL apps that (can) run in the browser, probably using WASM?

Thanks and sorry if these are dumb questions.

By @tiraz - 10 months

How does it distinguish itself from Dagster or Prefect? Both are there for quite some time, also have a GUI, but a much larger feature set.

By @anakaine - 10 months

Hey, I really like the design. I currently have a lot of ETL going on through various mechanisms, but the thing that is always difficult to communicate to BAs and PMs, and any other individual is a graphical "what is this thing doing and how". This is neat for those of us who are visual.

By @vekker - 10 months

Does this also manage the infrastructure side of ETL? Usually some parts in a complex ETL process take a lot more processing power, so are run on different machines. From a quick glance at this, it seems like a WYSIWYG ETL tool for running ETL jobs on one machine?

By @febed - 10 months

Which open source Python based ETL tool would one recommend for someone starting an ETL project today? It’s a data volume heavy project with lot of interdependencies between import tasks.

By @mitjafelicijan - 10 months

This is actually exactly what I needed for my current project!

By @nextworddev - 10 months

If you are enterprise, just go with Databricks lakeflow

By @olavgg - 10 months

This looks visually similar to Apache Nifi.

By @deknos - 10 months

Is it true opensource / free software, or are there non opensource parts?

By @v3ss0n - 10 months

What's the difference compare to Windmill.

By @rldjbpin - 10 months

as open source as open weights models, but will companies adopt it solely on pricing?

By @Kalanos - 10 months

Reminds me of Elyra

By @iblaine - 10 months

Low code ETL tools (informatica, Appworx, talend, pentaho, ssis) were the original services for ELT/ETL. A lot of progress was made to go towards ETL-as-code starting with Airflow/Luigi. Going back to low code seems backwards as this point.

(I have used all of the above tools in my 15+ yr career. Code as ETL was a huge industry shift)

By @kkfx - 10 months

Do not take me wrong, I appreciate and thanks anyone who contribute to FLOSS, but all low/no code approaches I see turn out to be garbage. IMVHO the reality is that people need to be trained and became capable of fishing alone instead of giving them fishes all days.

ML in ETL is needed for raw initial classification of documents received in various formats from various sources, to clean-up scanned crap, no more than that, all the effort to plug LLMs was so far and i bet will be for the next 10 years a disaster.

ETL is something that should not exists in a modern world because we should exchange data in usable formats instead of having to import the with all sort of gimmick, we do not have such acculturated world but at least we can try to simplify and teaching instead of adding entropy.

Open Source Python ETL

Related

Lessons Learned from Scaling to Multi-Terabyte Datasets

OpenPipe Mixture of Agents: Outperform GPT-4 at 1/25th the Cost

Show HN: Eidos – Offline alternative to Notion

Shape Rotation 101: An Intro to Einsum and Jax Transformers

Are AlphaFold's new results a miracle?

Related

Lessons Learned from Scaling to Multi-Terabyte Datasets

OpenPipe Mixture of Agents: Outperform GPT-4 at 1/25th the Cost

Show HN: Eidos – Offline alternative to Notion

Shape Rotation 101: An Intro to Einsum and Jax Transformers

Are AlphaFold's new results a miracle?