August 21st, 2024

Launch HN: Outerport (YC S24) – Instant hot-swapping for AI model weights

Towaki and Allen are developing Outerport, a distribution network that optimizes AI model deployment, enabling quick model swapping and potentially reducing GPU costs by 40% through efficient management and orchestration.

CuriosityExcitementSkepticism

Launch HN: Outerport (YC S24) – Instant hot-swapping for AI model weights

Towaki and Allen are developing Outerport, a distribution network designed to optimize the deployment of AI model weights, enabling 'hot-swapping' of models to reduce GPU costs. This innovative approach allows different models to be served on the same GPU machine with swap times of approximately two seconds, significantly faster than traditional methods. The motivation behind Outerport stems from the high costs associated with running AI models on cloud GPUs, which are charged based on usage time. Long start-up times for loading large models into GPU memory necessitate overprovisioning, leading to inefficient hardware use. Outerport addresses these challenges by implementing a caching system for model weights, allowing for rapid loading into GPU memory and optimizing data transfer costs. The system is hierarchical, managing model weights across various storage types, and includes a dedicated daemon process for model management and orchestration. Initial simulations indicate that Outerport can achieve a 40% reduction in GPU running time costs by enabling a multi-model service scheme, which enhances horizontal scaling and reduces the need for additional machines. The founders, with backgrounds in machine learning and operations research, are excited about the potential of Outerport and plan to release it as an open-core model in the future.

- Outerport enables 'hot-swapping' of AI models, reducing GPU costs and improving efficiency.

- The system addresses long start-up times and overprovisioning issues associated with large AI models.

- Initial simulations show a potential 40% reduction in GPU running time costs.

- The founders aim to release Outerport as an open-core model.

- The project combines expertise in machine learning and operations research.

20x Faster Background Removal in the Browser Using ONNX Runtime with WebGPU

Using ONNX Runtime with WebGPU and WebAssembly in browsers achieves 20x speedup for background removal, reducing server load, enhancing scalability, and improving data security. ONNX models run efficiently with WebGPU support, offering near real-time performance. Leveraging modern technology, IMG.LY aims to enhance design tools' accessibility and efficiency.

On Open-Weights Foundation Models

The FTC Technology Blog explores open-weights foundation models in AI, likening them to open-source software for innovation and cost-effectiveness. It notes benefits but warns of licensing restrictions and misuse risks.

Show HN: We made glhf.chat – run almost any open-source LLM, including 405B

The platform allows running various large language models via Hugging Face repo links using vLLM and GPU scheduler. Offers free beta access with plans for competitive pricing post-beta using multi-tenant model running.

Four co's are hoarding billions worth of Nvidia GPU chips. Meta has 350K of them

Meta has launched Llama 3.1, a large language model outperforming ChatGPT 4o on some benchmarks. The model's development involved significant investment in Nvidia GPUs, reflecting high demand for AI training resources.

Show HN: Attaching to a Virtual GPU over TCP

Thunder Compute provides a flexible, cost-efficient cloud-based GPU service with instant scaling, pay-per-use billing, high utilization rates, and strong security, benefiting enterprises by minimizing idle GPU time.

AI: What people are saying

The comments on the article about Outerport reveal a mix of curiosity and skepticism regarding its functionality and market potential.

Several users question how Outerport differs from existing methods of model deployment and management.
There is interest in the integration of Outerport with various frameworks and inference servers.
Some commenters express excitement about the potential cost savings and efficiency improvements for AI model deployment.
Concerns are raised about competition with established companies in the model management space.
Users inquire about the technical specifications and compatibility of different model architectures with Outerport.

11 comments

By @phyalow - 8 months

Genuine question, whats the difference between your startup and just calling the below code with a different model on a cloud machine, other than some ML/Dev OP's engineer not knowing what they are doing...?

  model = get_model(\*model_config)
  state_dict = torch.load(model_path, weights_only=True)
  new_state_dict = {k.replace('_orig_mod.', ''): v for k, v in state_dict.items()}
  model.load_state_dict(new_state_dict)
  model.eval()
  with torch.no_grad():
  output = model(torch.FloatTensor(X))
  probabilities = torch.softmax(output, dim=X)
  return probabilities.numpy()

By @harrisonjackson - 8 months

> Outerport is a caching system for model weights, allowing read-only models to be cached in pinned RAM for fast loading into GPU. Outerport is also hierarchical, maintaining a cache across S3 to local SSD to RAM to GPU memory, optimizing for reduced data transfer costs and load balancing.

This is really cool. Are the costs to run this mainly storage or how much compute is actually tied up in it?

The time/cost to download models on a gpu cloud instance really add up when you are paying per second.

By @dbmikus - 8 months

This is very cool! Most of the work I've seen on reducing inference costs has been via things like LoRAX that lets multiple fine-tunes share the same underlying base model.

Do you imagine Outerport being a better fit for OSS model hosts like Replicate, Anyscale, etc. or for companies that are trying to host multiple models themselves?

Your use case mentioned speaks more to the latter, but it seems like the value at scale is with model hosting as a service companies.

By @CuriouslyC - 8 months

This seems useful but honestly I think you guys are better off getting IP protection and licensing out the technology. This is a classic "feature not a product" and I don't see you competing against google/microsoft/huggingface in the model management space.

By @volkopat - 8 months

This is really exciting! I was hoping for someone to tackle inference time and this product will definitely be a boost to some of our use cases in medical imaging.

By @zackangelo - 8 months

Is this tied to a specific framework like pytorch or an inference server like vLLM?

Our inference stack is built using candle in Rust, how hard would it be to integrate?

By @bravura - 8 months

Do all variations of the model need to have the same architecture?

Or can they be different types of models with different number of layers, etc?

By @parrot987 - 8 months

This looks awesome! will try it out

By @mr_yoni - 8 months

Nice! Will this work for Triton instances ie can I swap the model loaded to the Triton instance? Or am I miss understanding the concept? EDIT: typo

By @raghavbali - 8 months

Yet to go through in detail but this is really powerful. Initiatives such as these are what we need to further democratize DL. Kudos team

By @astroalex - 8 months

Cool! Will this work for multi-GPU inference?

Launch HN: Outerport (YC S24) – Instant hot-swapping for AI model weights

Related

20x Faster Background Removal in the Browser Using ONNX Runtime with WebGPU

On Open-Weights Foundation Models

Show HN: We made glhf.chat – run almost any open-source LLM, including 405B

Four co's are hoarding billions worth of Nvidia GPU chips. Meta has 350K of them

Show HN: Attaching to a Virtual GPU over TCP

Related

20x Faster Background Removal in the Browser Using ONNX Runtime with WebGPU

On Open-Weights Foundation Models

Show HN: We made glhf.chat – run almost any open-source LLM, including 405B

Four co's are hoarding billions worth of Nvidia GPU chips. Meta has 350K of them

Show HN: Attaching to a Virtual GPU over TCP