Launch HN: Outerport (YC S24) – Instant hot-swapping for AI model weights
Towaki and Allen are developing Outerport, a distribution network that optimizes AI model deployment, enabling quick model swapping and potentially reducing GPU costs by 40% through efficient management and orchestration.
Towaki and Allen are developing Outerport, a distribution network designed to optimize the deployment of AI model weights, enabling 'hot-swapping' of models to reduce GPU costs. This innovative approach allows different models to be served on the same GPU machine with swap times of approximately two seconds, significantly faster than traditional methods. The motivation behind Outerport stems from the high costs associated with running AI models on cloud GPUs, which are charged based on usage time. Long start-up times for loading large models into GPU memory necessitate overprovisioning, leading to inefficient hardware use. Outerport addresses these challenges by implementing a caching system for model weights, allowing for rapid loading into GPU memory and optimizing data transfer costs. The system is hierarchical, managing model weights across various storage types, and includes a dedicated daemon process for model management and orchestration. Initial simulations indicate that Outerport can achieve a 40% reduction in GPU running time costs by enabling a multi-model service scheme, which enhances horizontal scaling and reduces the need for additional machines. The founders, with backgrounds in machine learning and operations research, are excited about the potential of Outerport and plan to release it as an open-core model in the future.
- Outerport enables 'hot-swapping' of AI models, reducing GPU costs and improving efficiency.
- The system addresses long start-up times and overprovisioning issues associated with large AI models.
- Initial simulations show a potential 40% reduction in GPU running time costs.
- The founders aim to release Outerport as an open-core model.
- The project combines expertise in machine learning and operations research.
Related
20x Faster Background Removal in the Browser Using ONNX Runtime with WebGPU
Using ONNX Runtime with WebGPU and WebAssembly in browsers achieves 20x speedup for background removal, reducing server load, enhancing scalability, and improving data security. ONNX models run efficiently with WebGPU support, offering near real-time performance. Leveraging modern technology, IMG.LY aims to enhance design tools' accessibility and efficiency.
On Open-Weights Foundation Models
The FTC Technology Blog explores open-weights foundation models in AI, likening them to open-source software for innovation and cost-effectiveness. It notes benefits but warns of licensing restrictions and misuse risks.
Show HN: We made glhf.chat – run almost any open-source LLM, including 405B
The platform allows running various large language models via Hugging Face repo links using vLLM and GPU scheduler. Offers free beta access with plans for competitive pricing post-beta using multi-tenant model running.
Four co's are hoarding billions worth of Nvidia GPU chips. Meta has 350K of them
Meta has launched Llama 3.1, a large language model outperforming ChatGPT 4o on some benchmarks. The model's development involved significant investment in Nvidia GPUs, reflecting high demand for AI training resources.
Show HN: Attaching to a Virtual GPU over TCP
Thunder Compute provides a flexible, cost-efficient cloud-based GPU service with instant scaling, pay-per-use billing, high utilization rates, and strong security, benefiting enterprises by minimizing idle GPU time.
- Several users question how Outerport differs from existing methods of model deployment and management.
- There is interest in the integration of Outerport with various frameworks and inference servers.
- Some commenters express excitement about the potential cost savings and efficiency improvements for AI model deployment.
- Concerns are raised about competition with established companies in the model management space.
- Users inquire about the technical specifications and compatibility of different model architectures with Outerport.
model = get_model(\*model_config)
state_dict = torch.load(model_path, weights_only=True)
new_state_dict = {k.replace('_orig_mod.', ''): v for k, v in state_dict.items()}
model.load_state_dict(new_state_dict)
model.eval()
with torch.no_grad():
output = model(torch.FloatTensor(X))
probabilities = torch.softmax(output, dim=X)
return probabilities.numpy()
This is really cool. Are the costs to run this mainly storage or how much compute is actually tied up in it?
The time/cost to download models on a gpu cloud instance really add up when you are paying per second.
Do you imagine Outerport being a better fit for OSS model hosts like Replicate, Anyscale, etc. or for companies that are trying to host multiple models themselves?
Your use case mentioned speaks more to the latter, but it seems like the value at scale is with model hosting as a service companies.
Our inference stack is built using candle in Rust, how hard would it be to integrate?
Or can they be different types of models with different number of layers, etc?
Related
20x Faster Background Removal in the Browser Using ONNX Runtime with WebGPU
Using ONNX Runtime with WebGPU and WebAssembly in browsers achieves 20x speedup for background removal, reducing server load, enhancing scalability, and improving data security. ONNX models run efficiently with WebGPU support, offering near real-time performance. Leveraging modern technology, IMG.LY aims to enhance design tools' accessibility and efficiency.
On Open-Weights Foundation Models
The FTC Technology Blog explores open-weights foundation models in AI, likening them to open-source software for innovation and cost-effectiveness. It notes benefits but warns of licensing restrictions and misuse risks.
Show HN: We made glhf.chat – run almost any open-source LLM, including 405B
The platform allows running various large language models via Hugging Face repo links using vLLM and GPU scheduler. Offers free beta access with plans for competitive pricing post-beta using multi-tenant model running.
Four co's are hoarding billions worth of Nvidia GPU chips. Meta has 350K of them
Meta has launched Llama 3.1, a large language model outperforming ChatGPT 4o on some benchmarks. The model's development involved significant investment in Nvidia GPUs, reflecting high demand for AI training resources.
Show HN: Attaching to a Virtual GPU over TCP
Thunder Compute provides a flexible, cost-efficient cloud-based GPU service with instant scaling, pay-per-use billing, high utilization rates, and strong security, benefiting enterprises by minimizing idle GPU time.