Vita 1.5 – how to train an end-to-end speech model

GitHub – VITA-MLLM/VITA: ✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Vita 1.5 is a decent MLLM with speech and vision capabilities and comes with a good paper that gives a clear recipe for how to take a solid LLM (in this case Qwen 2.5) and add audio and vision to enable it to do end-to-end speech interaction (you can see a video of it in practice!)

It can handle images and speech audio as input, and can handle video by sampling frames, treating them as sequential images, which allows it to manage video inputs without a dedicated video processing module. The interesting part of the paper is the training recipe they provide. They used generally available datasets, and some ones they sourced- 11k hours of speech-transcription data and 3k hours of text-speech data. The recipe goes in multiple stages of fine-tuning:

First, train the image understanding and question answering capability:

  1. Take 20% of image-text caption data, pass images through a pretrained visual encoder, a vision adaptor that connects the encoder to the LLM, and send the text to the LLM directly. Freeze the encoder and the LLM, and just train the adapter.
  2. Take all the image-text caption data and train the LLM, adapter and vision encoder.
  3. Mix in 20% of the caption data and all of the image question/answer data and train all three components again

Next train the audio input:

  1. Take speech-transcription data and train a speech encoder, a task guidance token, and freeze the LLM
  2. Then they take a bit of the caption and QA data and train the speech encoder, speech adaptor, vision encoder, vision adapter, and the state head (that helps the model keep track of which modality its working in) and LLM at the same time. Half the time they replace the text in the caption or QA with the speech equivalent generated from a TTS system to exercise the speech system.

Finally, they train the audio output

  1. Use the text speech data to train the codec encoder to speech tokens then from the decoder back to speech.
  2. Encode the text with the LLM embeddings, run it through two speech decoders.

Discover more from Ian’s Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading