Show HN: Mandala – Automatically save, query and version Python computations
The Python library "mandala" on GitHub streamlines ML experiment tracking with tools like `@op` decorator and `ComputationFrame` for recording function details, version control, querying, and structuring code executions. Inquire for support.
Read original articleThe GitHub URL provided contains information about the Python library "mandala," designed to streamline ML experiment tracking and more. It introduces tools such as the `@op` decorator and `ComputationFrame` data structure to record Python function call details like inputs, outputs, and code, facilitating iterative Python development. Additionally, the library includes functionalities for version control, querying, and structuring imperative code executions into a comprehensive computation graph. For further details or support regarding this Python library, please don't hesitate to inquire.
Related
What's up Python? Django get background tasks, a new REPL, bye bye gunicorn
Several Python updates include Django's background task integration, a new lightweight Python REPL, Uvicorn's multiprocessing support, and PyPI blocking outlook.com emails to combat bot registrations, enhancing Python development and security.
Show HN: ControlFlow – open-source AI workflows
ControlFlow is a Python framework for agentic AI workflows. The GitHub repository offers installation guidelines, examples, features, and development instructions. Users can find detailed guidance and support for building AI workflows.
Open-Source Perplexity – Omniplex
The Omniplex open-source project on GitHub focuses on core functionality, Plugins Development, and Multi-LLM Support. It utilizes TypeScript, React, Redux, Next.js, Firebase, and integrates with services like OpenAI and Firebase. Community contributions are welcomed.
Show HN: Improve LLM Performance by Maximizing Iterative Development
Palico AI is an LLM Development Framework on GitHub for streamlined LLM app development. It offers modular app creation, cloud deployment, integration, and management through Palico Studio, with various components and tools available.
Karpathy: Let's reproduce GPT-2 (1.6B): one 8XH100 node 24h $672 in llm.c
The GitHub repository focuses on the "llm.c" project by Andrej Karpathy, aiming to implement Large Language Models in C/CUDA without extensive libraries. It emphasizes pretraining GPT-2 and GPT-3 models.
Do you support persisting into external stores?
You mention incpy in readme, have you discussed this project with Philip Guo? https://pg.ucsd.edu/
What is the memory and cpu overhead?
How does the framework handle dependencies on external libraries or system-level changes that might affect reproducibility?
How do you rollback state when it has memoized a broken computation? How does one decide which memoizations to invalidate vs keep?
Your addition of code/runtime dependencies intrigues me. I will probably take a look at your code to try to understand this better.
I somehow doubt there's enough overlap for us to open source our work and try to merge with yours, but it's really cool to see other people working on similar concepts. I predict we'll see a lot more frameworks like these that lean on mathematical principles like functional purity in the future.
I can't buy into a whole framework in my current context -- but I would really like a way to roll my own content hashing for adhoc caching within an existing system -- where the hash will automatically incorporate the specific implementation logic involved in producing the data i want to cache (so that the hash automatically changes when the implementation does).
eg -- given python function foo -- i want a hash of the code that implements foo (transitive within my project is fine)
I explored a similar idea once (also implemented in Python, via decorators) to help speed up some neuroscience research that involved a lot of hyperparameter sweeps. It's named after a Borges story about a man cursed to remember everything: https://github.com/taliesinb/funes
Maybe one day we'll have a global version of this, where all non-private computations are cached on a global distributed store somehow via content-based hashing.
We wrote our own version of this (I think many or all quant firms do, I know such a thing existed at $prev_job) but we use type annotation inspection to make things efficient (I had ~1-2 days to write it, so had to keep the design simple as well). It's a contract: if you write the type annotation, we can store it. Socially this incentivizes all the complex code to becoming typed. We generally work with timeseries data, which makes some things easier and some things harder, and the output format is largely storable in parquet format (we handle objects, but inefficiently).
One interesting subproblem that is relevant to us is the idea of "computable now", which adds a kind of third variant from the usual None/Some (i.e. is something intrinsically missing, or is it just not knowable yet?). For example, if you call total_airline_flights(20240901), that should (a) return something like an NA, and (b) not cache that value, since we will presumably be able to compute it on 20240902. But if total_airline_flights(20240101) returns NA, we might want to cache that, so we don't pay the cost again.
We sidestep this problem in our own implementation (time constraints) but I think the version at $prevjob had a better solution to this to avoid wasting compute.
(side note: hey Alex! I took 125 under you at Harvard; very neat that you're working on this now)
I believe that this path can be supported as it is right now, and the next step would be to store a computation on some server. If an uncaught exception is raised, store all the computation along with the state, transfer it to your local machine, and restore the state of the machine as it was when the exception was thrown. This way, you can debug the state of the program with all the live variables as it was being run.
I’m experimenting with Python CAD programming using the CadQuery and Build123d libraries. I’d like to speed up iteration time and intelligent caching would help.
These libraries are pretty opinionated, which make it a bit challenging to imagine how to squeeze cache decorators in there. They have a couple different APIs, all of which ultimately use the Open CASCADE (OCCT) kernel via https://github.com/CadQuery/OCP
CadQuery is a Fluent programming design that relies on long Method Chains [0]. It also has an experimental Free Function API [1].
Build123d iterates on CadQuery [2] with the goal of integrating better with the Python programming language. It has two supported APIs. The Builder API uses Python’s context manager (‘with’ blocks) heavily. The secondary Algebraic API is more functional, using arithmetic operators to define geometric operations [3].
The simplest way to integrate Mandala would probably be to use Build123d’s Algebraic API, wrapping subassemblies in functions decorated with @op.
However, it would be even better to proactively cache function/argument pairs provided by the underlying APIs. For example, if I change 50% of the edges passed to a Fillet() call, it would be nice to have it complete in half the time. I guess this would require me to fork the underlying library and integrate Mandala at that level.
[0] https://cadquery.readthedocs.io/en/latest/intro.html
[1] https://cadquery.readthedocs.io/en/latest/free-func.html
[2] https://build123d.readthedocs.io/en/latest/introduction.html...
[3] https://build123d.readthedocs.io/en/latest/key_concepts_alge...
7 years ago I made a project with 100 calculation dependencies, (in Python & SQL scripts) and the only thing that allowed not to loose track was Makefile + GraphViz.
I wanted to make something similar in Rust -- a static visualized of dependencies between structs. Things turned out way harder than expected.
Is mandala designed for notebook-style interactive environments only or also when running python scripts more traditionally, in which case could this be integrated in a gitops like environment? (push to git, new run in CI/CD, export log metrics as an artifact with an easy way to compare to previous runs)
Related
What's up Python? Django get background tasks, a new REPL, bye bye gunicorn
Several Python updates include Django's background task integration, a new lightweight Python REPL, Uvicorn's multiprocessing support, and PyPI blocking outlook.com emails to combat bot registrations, enhancing Python development and security.
Show HN: ControlFlow – open-source AI workflows
ControlFlow is a Python framework for agentic AI workflows. The GitHub repository offers installation guidelines, examples, features, and development instructions. Users can find detailed guidance and support for building AI workflows.
Open-Source Perplexity – Omniplex
The Omniplex open-source project on GitHub focuses on core functionality, Plugins Development, and Multi-LLM Support. It utilizes TypeScript, React, Redux, Next.js, Firebase, and integrates with services like OpenAI and Firebase. Community contributions are welcomed.
Show HN: Improve LLM Performance by Maximizing Iterative Development
Palico AI is an LLM Development Framework on GitHub for streamlined LLM app development. It offers modular app creation, cloud deployment, integration, and management through Palico Studio, with various components and tools available.
Karpathy: Let's reproduce GPT-2 (1.6B): one 8XH100 node 24h $672 in llm.c
The GitHub repository focuses on the "llm.c" project by Andrej Karpathy, aiming to implement Large Language Models in C/CUDA without extensive libraries. It emphasizes pretraining GPT-2 and GPT-3 models.