Language model can listen while speaking
Recent advancements in speech language models led to the development of the listening-while-speaking language model (LSLM), enabling real-time interaction and robust performance in interactive speech dialogue systems.
Read original articleRecent advancements in speech language models (SLM) have improved speech-based conversational AI, but these models typically operate in a turn-based manner, limiting real-time interaction. To overcome this, researchers have introduced the listening-while-speaking language model (LSLM), which enables full duplex communication in interactive speech language models (iSLM). This end-to-end system integrates both listening and speaking capabilities, utilizing a token-based decoder-only text-to-speech (TTS) for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. The LSLM employs three fusion strategies—early, middle, and late fusion—with middle fusion yielding the best balance between speech generation and real-time interaction. Experimental results from command-based and voice-based full duplex modeling demonstrate LSLM's robustness against noise and its sensitivity to various instructions. The findings suggest that LSLM can facilitate duplex communication with minimal disruption to existing systems, thereby enhancing the development of interactive speech dialogue systems for practical applications.
- The listening-while-speaking language model (LSLM) enables real-time interaction in speech-based AI.
- LSLM integrates both listening and speaking channels for full duplex communication.
- Middle fusion strategy optimally balances speech generation and real-time interaction.
- Experimental results show LSLM's robustness to noise and sensitivity to diverse instructions.
- The study aims to improve the applicability of interactive speech dialogue systems in real-world contexts.