FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

Jiaqi Li1, Chaoren Wang1, Xiaohai Tian2, Mingjie Chen1, Xinyu Liang1, Xu Li1, Yufan Lin1, Junwen Qiu1, Jun Zhang2, Lu Lu2, Haizhou Li1, Zhizheng Wu1

1The Chinese University of Hong Kong, Shenzhen  2ByteDance

FlexiSLM — Overview

Existing spoken language models (SLMs) typically use a fixed speech-token frame rate (for example, 25 Hz or 12.5 Hz). This fixed-rate design cannot adapt to time-varying speech complexity and does not offer a direct speed-quality trade-off at inference time. We introduce FlexiSLM, the first SLM that supports dynamic and controllable frame rates on both speech input and output. A single trained model can be steered from 12.5 Hz down to 4.0 Hz without retraining.

Key contributions

FlexiSLM architecture
Overall FlexiSLM architecture: a Thinker-Talker model with dynamic frame-rate compression on speech input and controllable frame-rate generation on speech output.
The architecture of FlexiSLM is shown in the Figure above. Its training progresses in 3 stages:
  1. Talker pre-training. Freeze the LLM backbone and train only the randomly initialized Talker end to end on about 100K hours of English TTS.
  2. Multi-task LoRA fine-tuning. Activate the input-side Frame Merging Module, Thinker, and Talker; apply LoRA to the Thinker and train on mixed speech tasks.
  3. Full fine-tuning. Continue from Stage 2, merge the LoRA updates into the LLM, train all parameters, and enable the Talker-to-Thinker connection to improve speech perception and generation quality.

Comparison with Baseline Systems — Speech QA Performance

Each card is one evaluation case. By default we show one sample per dataset; pick a specific dataset in the filter to see all of its samples.

Notation. FlexiSLM A → B means A Hz input and B Hz output frame rates. For example, FlexiSLM 6.25 → 12.5 accepts speech encoded at 6.25 Hz and emits speech tokens at 12.5 Hz.
Loading manifest...

FlexiSLM-Data Showcase

FlexiSLM-Data is our large-scale speech-to-speech dialogue corpus used to train FlexiSLM. This section presents curated user-assistant speech pairs to give an intuitive view of the data style and interaction quality. Each card shows the user's spoken prompt and the assistant's spoken response, together with transcript text and rough duration / audio-token statistics from the manifest.

The corpus is built with a single-turn construction pipeline:

TTS Demo

This section shows repeat-after-text TTS samples from LibriSpeech prompts. Each card includes the prompt text, the target repeated sentence, and generated audio at five output frame rates (4.0, 5.0, 6.25, 8.0, 12.5 Hz).

Audio Understanding Demo

Seven random samples from the LLaSO-Eval understanding benchmark, with one sample per task. Each card shows the prompt, FlexiSLM text response, ground-truth label, and the source input audio.