Open Source Python ETL
Amphi is an open-source Python ETL tool for data extraction, preparation, and cleaning. It offers a graphical interface, supports structured and unstructured data, promotes low-code development, and integrates generative AI. Available for public beta testing in JupyterLab.
Read original articleAmphi is an open-source Python-based ETL tool focused on extracting, preparing, and cleaning data from various sources and formats. It excels at data integration, data extraction, preparation for data science, and API retrieval. With a graphical user interface, users can design data pipelines and generate native Python code deployable anywhere. The tool supports structured and unstructured file ingestion, data cleansing, API retrieval, and enrichment. Amphi promotes low-code development, reducing development and maintenance time compared to traditional coding. It generates Python code for deployment in various environments, ensuring flexibility and no lock-in. Data is stored and processed locally for privacy and security. Amphi is community-driven, encouraging collaboration and sharing of pipeline definitions. It aims to build a global community of data practitioners, catering to both novices and experts. The tool is AI-native, integrating generative AI capabilities for AI-oriented use cases. Amphi is available for public beta testing in JupyterLab.
Related
Lessons Learned from Scaling to Multi-Terabyte Datasets
Insights on scaling to multi-terabyte datasets, emphasizing algorithm evaluation before scaling. Tools like Joblib and GNU Parallel for single machine scaling, transitioning to multiple machines, and comparing performance/cost implications. Recommendations for parallel workloads and analytical tasks using AWS Batch, Dask, and Spark. Considerations for tool selection based on team size and workload.
OpenPipe Mixture of Agents: Outperform GPT-4 at 1/25th the Cost
OpenPipe's cost-effective agent mixture surpasses GPT-4, promising advanced language processing at a fraction of the cost. This innovation could disrupt the market with its high-performance, affordable language solutions.
Show HN: Eidos – Offline alternative to Notion
The Eidos project on GitHub offers a personal data management framework as a Progressive Web App with AI features. Customizable with extensions and scripting, it leverages sqlite-wasm technology for chromium-based browsers.
Shape Rotation 101: An Intro to Einsum and Jax Transformers
Einsum notation simplifies tensor operations in libraries like NumPy, PyTorch, and Jax. Jax Transformers showcase efficient tensor operations in deep learning tasks, emphasizing speed and memory benefits for research and production environments.
Are AlphaFold's new results a miracle?
AlphaFold 3 by DeepMind excels in predicting molecule-protein binding, surpassing AutoDock Vina. Concerns about data redundancy, generalization, and molecular interaction understanding prompt scrutiny for drug discovery reliability.
To give some context, Amphi is a low-code ETL tool for both structured and unstructured data. The key use cases include file integration, data preparation, data migration, and creating data pipelines for AI tasks like data extraction and RAG. What sets it apart from traditional ETL tools is that it generates Python code that you own and can deploy anywhere. Amphi is available as a standalone web app or as a JupyterLab extension.
Visit the GitHub: https://github.com/amphi-ai/amphi-etl Give it a try and let me know what you think
In other words, if writing python and sql is the skill requirement that stops you from making an etl pipeline, maybe do something else.
we[0] use meltano in production and I'm happy with it. I've played around with dlt and it's great, just not a ton of sources yet.
If my data pipeline is "take this table, filter it, output it", I really don't want to use a "csv file input" or a "excel file output".
I want to say "anything here in the pipeline that I will define that behaves like a table, apply it this transformation", so that I can swap my storage later without touching the pipeline.
Same things for output. Personally I want to say "this goes to a file" at the pipeline level, and the details of the serialization should be changeable instantly.
That being said, can't complain about a free tool, kudos on making it available !
It looks nice though.
1) Are there any"standard"-ish (or popular-ish) file formats for node-based / low-code pipelines?
2) Is there any such format that's also reasonably human readable / writable?
3) Are there low-code ETL apps that (can) run in the browser, probably using WASM?
Thanks and sorry if these are dumb questions.
(I have used all of the above tools in my 15+ yr career. Code as ETL was a huge industry shift)
ML in ETL is needed for raw initial classification of documents received in various formats from various sources, to clean-up scanned crap, no more than that, all the effort to plug LLMs was so far and i bet will be for the next 10 years a disaster.
ETL is something that should not exists in a modern world because we should exchange data in usable formats instead of having to import the with all sort of gimmick, we do not have such acculturated world but at least we can try to simplify and teaching instead of adding entropy.
Related
Lessons Learned from Scaling to Multi-Terabyte Datasets
Insights on scaling to multi-terabyte datasets, emphasizing algorithm evaluation before scaling. Tools like Joblib and GNU Parallel for single machine scaling, transitioning to multiple machines, and comparing performance/cost implications. Recommendations for parallel workloads and analytical tasks using AWS Batch, Dask, and Spark. Considerations for tool selection based on team size and workload.
OpenPipe Mixture of Agents: Outperform GPT-4 at 1/25th the Cost
OpenPipe's cost-effective agent mixture surpasses GPT-4, promising advanced language processing at a fraction of the cost. This innovation could disrupt the market with its high-performance, affordable language solutions.
Show HN: Eidos – Offline alternative to Notion
The Eidos project on GitHub offers a personal data management framework as a Progressive Web App with AI features. Customizable with extensions and scripting, it leverages sqlite-wasm technology for chromium-based browsers.
Shape Rotation 101: An Intro to Einsum and Jax Transformers
Einsum notation simplifies tensor operations in libraries like NumPy, PyTorch, and Jax. Jax Transformers showcase efficient tensor operations in deep learning tasks, emphasizing speed and memory benefits for research and production environments.
Are AlphaFold's new results a miracle?
AlphaFold 3 by DeepMind excels in predicting molecule-protein binding, surpassing AutoDock Vina. Concerns about data redundancy, generalization, and molecular interaction understanding prompt scrutiny for drug discovery reliability.