Build a Frontier Model
🎨
Step 8 of 11

Multimodal

Native any-to-any is the new default.

2025-2026 frontier models are natively multimodal: text, images, audio, and (increasingly) video share a unified token space and a single backbone. GPT-5.5 unifies all four modalities in one model. Gemini 2.5 Pro handles 1M-token multimodal contexts. Vision encoders evolved from CLIP → SigLIP → SigLIP 2. Real-time voice (Moshi, GPT-4o) and generative video (Sora, Veo 3) crossed into product reality.

Why it matters

  • Native multimodal training (Chameleon, GPT-4o, Gemini, Llama 4) outperforms bolt-on vision adapters.
  • Real-time voice (200ms latency) requires interleaved audio-text tokenization — not a separate ASR + TTS pipeline.
  • Video generation (Sora, Veo 3, Movie Gen) at minute-length, audio-synchronized output reshaped creative tooling.
  • Spatial intelligence and world models (Genie 2, Cosmos, World Labs) are the substrate for embodied AI and robotics.

State of the art

2025-2026
  • GPT-5.5 (2025) — unified text/image/audio/video in one architecture, end-to-end.
  • Gemini 2.5 Pro — 1M-token multimodal context, ranked #1 on long-context multimodal benchmarks.
  • Sora (OpenAI), Veo 3 (Google, native audio sync), Movie Gen (Meta) — minute-length video generation.
  • Moshi (Kyutai, Oct 2024) — full-duplex 200ms-latency voice via Mimi RVQ tokenization.
  • SigLIP 2 (Feb 2025) — current-best open vision encoder for downstream multimodal training.
  • Genie 2 (DeepMind, Dec 2024) and NVIDIA Cosmos (Jan 2025) — generative world models for robotics.

The recipe

A frontier-grade implementation, in order.

1

Vision encoder

SigLIP 2 for late-fusion. NaViT (native resolution) for any-aspect images. Unified VQ-VAE/FSQ tokens for early fusion.

2

Fusion strategy

Late fusion (LLaVA-style) for retrofit + cheap. Early/native fusion (Chameleon, GPT-4o) for new pretrains — strictly better at scale.

3

Audio

Mimi or EnCodec RVQ at 12.5-25 Hz. Interleave audio + text tokens for real-time dialogue (Moshi-style). Whisper for transcription-only paths.

4

Video

DiT (Diffusion Transformer) backbone. Spacetime VAE / MAGVIT-v2 tokenizer. Joint audio synthesis (Veo 3) requires audio-aware training.

5

3D / world models

Generative video conditioned on actions (Genie 2). Cosmos for robotics-targeted physics-grounded generation.

6

Multimodal eval

MMMU, MMBench, ChartQA for vision QA; LiveAudioBench for voice; VBench for video. New: spatial-reasoning evals (3DSRBench, World Labs benchmarks).

⚠️

Common pitfalls

Multimodal training is data-bound — clean image-text pairs are scarcer than text. Synthetic captioning (e.g., Llama-rewrite) helps.
Tokenization mismatch across modalities causes severe imbalance — text dominates loss unless you reweight.
Vision encoders developed for CLIP-style retrieval underperform for VLM tasks — use SigLIP/SigLIP 2.
Video gen evaluation is unsolved — VBench helps but human pref still rules.