Marvis-AI

Marvis-AI/marvis-tts-100m-v0.2

Marvis is built on the Sesame CSM-1B (Conversational Speech Model) architecture, a multimodal transformer that operates directly on Residual...

Model Documentation

Introduction

[code]

Marvis is a cutting-edge conversational speech model designed to enable real-time streaming text-to-speech synthesis. Built with efficiency and accessibility in mind, Marvis addresses the growing need for high-quality, real-time voice synthesis that can run on consumer devices such as Apple Silicon, iPhones, iPads, Macs and others.

Key Features



  • Real-time Streaming: Stream audio chunks as text is processed, enabling natural conversational flow
  • Compact Size: Only 300MB when quantized, enabling on-device inference
  • Edge deployment: Optimized for real-time Speech-to-Speech (STS) on mobile devices (i.e., iPad, iPhone and etc)
  • Natural Audio Flow: Process entire text context for coherent speech synthesis without chunking artifacts
  • Multimodal Architecture: Seamlessly handles interleaved text and audio tokens


  • Supported Languages



    Currently optimized for English, French, and German.

    Quick Start



    Using MLX



    Real audio streaming:

    bash
    pip install -U mlx-audio
    mlx_audio.tts.generate --model Marvis-AI/marvis-tts-100m-v0.2  --stream \
     --text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices."
    


    Voice cloning:

    bash
    mlx_audio.tts.generate --model Marvis-AI/marvis-tts-100m-v0.2  --stream \
     --text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices." --ref_audio ./conversational_a.wav
    


    You can pass any audio to clone the voice from or select sample audio file from here.

    Model Description



    Marvis is built on the Sesame CSM-1B (Conversational Speech Model) architecture, a multimodal transformer that operates directly on Residual Vector Quantization (RVQ) tokens and uses Kyutai's mimi codec. The architecture enables end-to-end training while maintaining low-latency generation and employs a dual-transformer approach:

  • Multimodal Backbone (100M parameters): Processes interleaved text and audio sequences to model the zeroth codebook level, providing semantic understanding and context.


  • Audio Decoder (60M parameters): A smaller, specialized transformer that models the remaining 31 codebook levels to reconstruct high-quality speech from the backbone's representations.


  • Unlike models that require text chunking based on regex patterns, Marvis processes entire text sequences contextually, resulting in more natural speech flow and intonation.

    Use Cases



  • Real-time Voice Assistants: Deploy natural-sounding voice interfaces with custom voices
  • Content Creation: Generate voiceovers and narration with personalized voices
  • Accessibility Tools: Create personalized speech synthesis for communication aids
  • Interactive Applications: Build conversational AI with consistent voice identity
  • Podcast & Media: Generate natural-sounding speech for automated content


  • Legal and Ethical Considerations:



  • Users are responsible for complying with local laws regarding voice synthesis and impersonation
  • Consider intellectual property rights when cloning voices of public figures
  • Respect privacy laws and regulations in your jurisdiction
  • Obtain appropriate consent and permissions before deployment


  • License & Agreement



    * Apache 2.0

    Citation



    If you use Marvis in your research or applications, please cite:

    bibtex
    @misc{marvis-tts-2025,
      title={Marvis-TTS: Efficient Real-time Voice Cloning with Streaming Speech Synthesis},
      author={Prince Canuma and Lucas Newman},
      year={2025}
    }
    


    Acknowledgments



    Special thanks to Sesame and Kyutai for their groundbreaking open-source contributions that inspired our work, and to the broader open-source community for their unwavering support and collaboration.

    ---

    Version: 0.2

    Release Date: 20/10/2025

    Creators: Prince Canuma & Lucas Newman

    Files & Weights

    FilenameSizeAction
    model.safetensors 1.37 GB