August 3rd, 2024

AiOla open-sources ultra-fast 'multi-head' speech recognition model

aiOla has launched Whisper-Medusa, an open-source AI model that enhances speech recognition, achieving over 50% faster performance. It supports real-time understanding of industry jargon and operates in over 100 languages.

Read original articleLink Icon
AiOla open-sources ultra-fast 'multi-head' speech recognition model

aiOla has introduced Whisper-Medusa, an open-source AI model that enhances automatic speech recognition by combining OpenAI’s Whisper technology with aiOla’s innovations, achieving over 50% faster performance without sacrificing accuracy. Whisper-Medusa operates by predicting ten tokens simultaneously, compared to Whisper's single-token prediction, which significantly accelerates speech processing, particularly for long-form audio. The model is currently available as a 10-head version, with plans for a 20-head version in the future. The model's weights and code are accessible on platforms like Hugging Face and GitHub.

Whisper-Medusa is designed to support businesses by streamlining operations and improving efficiency through its ability to understand industry-specific jargon in real-time, without the need for prior retraining. The technology allows frontline workers to complete tasks via voice or touch, transforming unstructured speech data into actionable insights. This capability is beneficial across various sectors, including aviation, food manufacturing, logistics, and healthcare, as it can comprehend over 100 languages and adapt to different accents and acoustic environments.

With a reported accuracy of over 95%, Whisper-Medusa empowers businesses to optimize processes, reduce costs, and enhance resource allocation, all while maintaining existing workflows. The introduction of this model marks a significant advancement in speech recognition technology, providing organizations with a powerful tool to improve operational efficiency.

Link Icon 5 comments
By @BetterWhisper - 9 months
Does it do speaker recognition/ diarization? Can't see it from the repo readme
By @gronky_ - 9 months
By @Doohickey-d - 9 months
I'm curious which of the Whisper derivatives is actually the fastest ?

Since faster-whisper claims 4x speedup over base Whisper, and I've found WhisperX to be faster still (for longer audio where it can do batch inference), at least on consumer GPUs.

So with AiOla saying "50% speedup", is that actually noteworthy?

By @phkahler - 9 months
IIRC Whisper works on wave files. Can this do real time low latency continuous ASR?
By @qwertox - 9 months
Nothing of interest here, it's an ad.

If you're interested, you might as well check out Gladia, at least they have a pricing section and allow you to use it as a developer, unlike just asking you to "Request a Demo".

And while a sibling comment links to the GitHub repository, their entire website does not contain such a link.

---

Edit: My bad, for some reason I first checked the website instead of the blog post. Looks much more interesting now.