May 5, 2026arXiv:2605.03937

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

AI Summary

MiniMind-O, a 0.1B-scale open omni model, is introduced, accepting text, speech, and image inputs and producing text and streaming speech outputs. The model leverages a full MiniMind backbone as the "Thinker" and a lightweight four-layer "Talker" built from MiniMind blocks, along with frozen SenseVoice-Small and SigLIP2 encoders for speech and image feature extraction. Key design choices for small omni models are identified, including middle-layer semantic bridging, a released multimodal sequence format, and a parameter-efficient eight-codebook interface, achieving CERs around 0.09 and voice-cloning similarities around 0.59.

Key Contribution

Open-sourcing a 0.1B-scale speech-native omni model lets you directly inspect the complete interaction loop and reveals critical design choices for building effective small multimodal models.

Abstract

MiniMind-O is an open 0.1B-scale omni model built on the MiniMind language model. It accepts text, speech, and image inputs, and returns both text and streaming speech. The release includes model code, checkpoints, and the main Parquet training datasets for text-to-audio, image-to-text, and audio-to-audio training, making the complete interaction loop directly inspectable. The model uses a full MiniMind backbone as the Thinker and an independent four-layer Talker made from MiniMind blocks. Frozen SenseVoice-Small and SigLIP2 encoders provide speech and image features, which are mapped by lightweight MLP projectors and injected at modality-placeholder positions. The Talker reads a middle-layer Thinker state together with an autoregressive eight-layer Mimi-code buffer. Speaker control is handled by a dedicated speaker token, right-aligned reference codec prompts, and precomputed CAM++ speaker embeddings, so voice conditioning remains part of the audio-code context rather than a separate TTS module. With a 768-dimensional Talker, the dense and MoE variants reach average CERs of 0.0897 and 0.0900 in Thinker--Talker consistency evaluation, with overall voice-cloning similarities of 0.5995 and 0.5937. Beyond reporting a working system, the paper identifies three scale-critical design choices for small omni models: middle-layer semantic bridging, a released multimodal sequence format, and a parameter-efficient eight-codebook interface.

Multimodal Models Open-Source Models & Weights Speech & Audio

Citation Metrics

Citations0

Influential citations0

References23

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

Related Papers