August 3rd, 2024

AiOla open-sources ultra-fast 'multi-head' speech recognition model

aiOla has launched Whisper-Medusa, an open-source AI model that enhances speech recognition, achieving over 50% faster performance. It supports real-time understanding of industry jargon and operates in over 100 languages.

Read original article

AiOla open-sources ultra-fast 'multi-head' speech recognition model

aiOla has introduced Whisper-Medusa, an open-source AI model that enhances automatic speech recognition by combining OpenAI’s Whisper technology with aiOla’s innovations, achieving over 50% faster performance without sacrificing accuracy. Whisper-Medusa operates by predicting ten tokens simultaneously, compared to Whisper's single-token prediction, which significantly accelerates speech processing, particularly for long-form audio. The model is currently available as a 10-head version, with plans for a 20-head version in the future. The model's weights and code are accessible on platforms like Hugging Face and GitHub.

Whisper-Medusa is designed to support businesses by streamlining operations and improving efficiency through its ability to understand industry-specific jargon in real-time, without the need for prior retraining. The technology allows frontline workers to complete tasks via voice or touch, transforming unstructured speech data into actionable insights. This capability is beneficial across various sectors, including aviation, food manufacturing, logistics, and healthcare, as it can comprehend over 100 languages and adapt to different accents and acoustic environments.

With a reported accuracy of over 95%, Whisper-Medusa empowers businesses to optimize processes, reduce costs, and enhance resource allocation, all while maintaining existing workflows. The introduction of this model marks a significant advancement in speech recognition technology, providing organizations with a powerful tool to improve operational efficiency.

WhatsApp Android beta reveals Llama 3 405B option

WhatsApp is updating to version 2.24.14.7 on Google Play Beta, introducing the Meta AI Llama model for enhanced user interactions. Users can choose between different Llama models for tailored AI experiences.

OpenAI slashes the cost of using its AI with a "mini" model

OpenAI launches GPT-4o mini, a cheaper model enhancing AI accessibility. Meta to release Llama 3. Market sees a mix of small and large models for cost-effective AI solutions.

Large Enough – Mistral AI

Mistral AI released Mistral Large 2, enhancing code generation, reasoning, and multilingual support with 123 billion parameters. It outperforms competitors and is available for research use via various cloud platforms.

Big tech wants to make AI cost nothing

Meta has open-sourced its Llama 3.1 language model for organizations with fewer than 700 million users, aiming to enhance its public image and increase product demand amid rising AI infrastructure costs.

OpenAI rolls out voice mode after delaying it for safety reasons

OpenAI is launching a new voice mode for ChatGPT, capable of detecting tones and processing audio directly. It will be available to paying customers by fall, starting with limited users.

5 comments

By @BetterWhisper - 9 months

Does it do speaker recognition/ diarization? Can't see it from the repo readme

By @gronky_ - 9 months

GH repo: https://github.com/aiola-lab/whisper-medusa

By @Doohickey-d - 9 months

I'm curious which of the Whisper derivatives is actually the fastest ?

Since faster-whisper claims 4x speedup over base Whisper, and I've found WhisperX to be faster still (for longer audio where it can do batch inference), at least on consumer GPUs.

So with AiOla saying "50% speedup", is that actually noteworthy?

By @phkahler - 9 months

IIRC Whisper works on wave files. Can this do real time low latency continuous ASR?

By @qwertox - 9 months

Nothing of interest here, it's an ad.

If you're interested, you might as well check out Gladia, at least they have a pricing section and allow you to use it as a developer, unlike just asking you to "Request a Demo".

And while a sibling comment links to the GitHub repository, their entire website does not contain such a link.

---

Edit: My bad, for some reason I first checked the website instead of the blog post. Looks much more interesting now.

WhatsApp Android beta reveals Llama 3 405B option

OpenAI slashes the cost of using its AI with a "mini" model

OpenAI launches GPT-4o mini, a cheaper model enhancing AI accessibility. Meta to release Llama 3. Market sees a mix of small and large models for cost-effective AI solutions.

Large Enough – Mistral AI

Big tech wants to make AI cost nothing

OpenAI rolls out voice mode after delaying it for safety reasons

OpenAI is launching a new voice mode for ChatGPT, capable of detecting tones and processing audio directly. It will be available to paying customers by fall, starting with limited users.

AiOla open-sources ultra-fast 'multi-head' speech recognition model

Related

WhatsApp Android beta reveals Llama 3 405B option

OpenAI slashes the cost of using its AI with a "mini" model

Large Enough – Mistral AI

Big tech wants to make AI cost nothing

OpenAI rolls out voice mode after delaying it for safety reasons

Related

WhatsApp Android beta reveals Llama 3 405B option

OpenAI slashes the cost of using its AI with a "mini" model

Large Enough – Mistral AI

Big tech wants to make AI cost nothing

OpenAI rolls out voice mode after delaying it for safety reasons