August 19th, 2024

Show HN: Hotshot – 4 Person Team Builds a State of the Art Video Model

A four-person team developed Hotshot, a text-to-video model generating 10-second videos at 720p, achieving 70% user preference. The project faced significant data and infrastructure challenges over four months.

Read original article

Show HN: Hotshot – 4 Person Team Builds a State of the Art Video Model

A four-person team has developed Hotshot, a large-scale diffusion transformer model designed for text-to-video generation. Hotshot is noted for its prompt alignment, consistency, and ability to produce videos of up to 10 seconds at 720p resolution. The team has trained three models over the past 13 months, starting with Hotshot-XL, which generated 1-second videos, and progressing to Hotshot Act-One, which produced 3-second videos. The latest model, Hotshot, has shown a 70% preference rate among users compared to other text-to-video models. The team faced significant challenges in data engineering, requiring the scaling of their dataset to 600 million video clips and 1 billion images. They also developed a video captioner to enhance temporal understanding, which involved managing thousands of GPUs and optimizing their training processes. The training phase revealed complexities in infrastructure management and optimization, with the team likening the experience to launching a rocket. They implemented various strategies to handle GPU failures and data streaming issues, ultimately compressing their dataset to ensure efficient training. The entire process took four months and millions of GPU hours, showcasing the team's commitment to pushing the boundaries of AI-generated video content.

- Hotshot is a text-to-video model capable of generating 10-second videos at 720p.

- The team trained three models over 13 months, with Hotshot showing a 70% user preference over competitors.

- Significant challenges included scaling datasets and managing thousands of GPUs.

- The training process involved complex infrastructure and optimization strategies.

- The project took four months and millions of GPU hours to complete.

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision

A new attention mechanism, FlashAttention-3, boosts Transformer speed and accuracy on Hopper GPUs by up to 75%. Leveraging asynchrony and low-precision computing, it achieves 1.5-2x faster processing, utilizing FP8 for quicker computations and reduced costs. FlashAttention-3 optimizes for new hardware features, enhancing efficiency and AI capabilities. Integration into PyTorch is planned.

Four co's are hoarding billions worth of Nvidia GPU chips. Meta has 350K of them

Meta has launched Llama 3.1, a large language model outperforming ChatGPT 4o on some benchmarks. The model's development involved significant investment in Nvidia GPUs, reflecting high demand for AI training resources.

Diffusion Training from Scratch on a Micro-Budget

The paper presents a cost-effective method for training text-to-image generative models by masking image patches and using synthetic images, achieving competitive performance at significantly lower costs.

Black Forest Labs – FLUX.1 open weights SOTA text to image model

Black Forest Labs has launched to develop generative deep learning models for media, securing $31 million in funding. Their FLUX.1 suite includes three model variants, outperforming competitors in image synthesis.

Forget Midjourney – Flux is the new king of AI image generation

Flux, an open-source AI image generator by Black Forest Labs, competes with Midjourney and Stable Diffusion, offering three versions and a developing text-to-video model for enhanced media production.

6 comments

By @souravdxb - 8 months

The quality of output is best so far

https://optimus.hotshot.co/shot/PFhq

By @FractalHQ - 8 months

Wait why does the title say you built “Sora”? Isn’t that the OpenAI project?

By @juliawu - 8 months

This is mindblowing, both in terms of quality and how quickly it was built with lean resources. Congrats on the launch!

By @vectoral - 8 months

Great to see this launch -- excited to see what you guys do!

By @sonny3690 - 8 months

i’ve tried a bunch of the models and this one is unbelievably good! congrats - can’t believe this was done by a 4 person team

By @souravdxb - 8 months

Kudos on this launch. Much awaited!!

Show HN: Hotshot – 4 Person Team Builds a State of the Art Video Model

Related

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision

Four co's are hoarding billions worth of Nvidia GPU chips. Meta has 350K of them

Diffusion Training from Scratch on a Micro-Budget

Black Forest Labs – FLUX.1 open weights SOTA text to image model

Forget Midjourney – Flux is the new king of AI image generation

Related

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision

Four co's are hoarding billions worth of Nvidia GPU chips. Meta has 350K of them

Diffusion Training from Scratch on a Micro-Budget

Black Forest Labs – FLUX.1 open weights SOTA text to image model

Forget Midjourney – Flux is the new king of AI image generation