Build a Frontier Model

🎨

Step 8 of 11

Multimodal

Native any-to-any is the new default.

2025-2026 frontier models are natively multimodal: text, images, audio, and (increasingly) video share a unified token space and a single backbone. GPT-5.5 unifies all four modalities in one model. Gemini 2.5 Pro handles 1M-token multimodal contexts. Vision encoders evolved from CLIP → SigLIP → SigLIP 2. Real-time voice (Moshi, GPT-4o) and generative video (Sora, Veo 3) crossed into product reality.

Why it matters

Native multimodal training (Chameleon, GPT-4o, Gemini, Llama 4) outperforms bolt-on vision adapters.
Real-time voice (200ms latency) requires interleaved audio-text tokenization — not a separate ASR + TTS pipeline.
Video generation (Sora, Veo 3, Movie Gen) at minute-length, audio-synchronized output reshaped creative tooling.
Spatial intelligence and world models (Genie 2, Cosmos, World Labs) are the substrate for embodied AI and robotics.

State of the art

2025-2026

GPT-5.5 (2025) — unified text/image/audio/video in one architecture, end-to-end.
Gemini 2.5 Pro — 1M-token multimodal context, ranked #1 on long-context multimodal benchmarks.
Sora (OpenAI), Veo 3 (Google, native audio sync), Movie Gen (Meta) — minute-length video generation.
Moshi (Kyutai, Oct 2024) — full-duplex 200ms-latency voice via Mimi RVQ tokenization.
SigLIP 2 (Feb 2025) — current-best open vision encoder for downstream multimodal training.
Genie 2 (DeepMind, Dec 2024) and NVIDIA Cosmos (Jan 2025) — generative world models for robotics.

The recipe

A frontier-grade implementation, in order.

1

Vision encoder

SigLIP 2 for late-fusion. NaViT (native resolution) for any-aspect images. Unified VQ-VAE/FSQ tokens for early fusion.

2

Fusion strategy

Late fusion (LLaVA-style) for retrofit + cheap. Early/native fusion (Chameleon, GPT-4o) for new pretrains — strictly better at scale.

3

Audio

Mimi or EnCodec RVQ at 12.5-25 Hz. Interleave audio + text tokens for real-time dialogue (Moshi-style). Whisper for transcription-only paths.

4

Video

DiT (Diffusion Transformer) backbone. Spacetime VAE / MAGVIT-v2 tokenizer. Joint audio synthesis (Veo 3) requires audio-aware training.

5

3D / world models

Generative video conditioned on actions (Genie 2). Cosmos for robotics-targeted physics-grounded generation.

6

Multimodal eval

MMMU, MMBench, ChartQA for vision QA; LiveAudioBench for voice; VBench for video. New: spatial-reasoning evals (3DSRBench, World Labs benchmarks).

⚠️

Common pitfalls

Multimodal training is data-bound — clean image-text pairs are scarcer than text. Synthetic captioning (e.g., Llama-rewrite) helps.

Tokenization mismatch across modalities causes severe imbalance — text dominates loss unless you reweight.

Vision encoders developed for CLIP-style retrieval underperform for VLM tasks — use SigLIP/SigLIP 2.

Video gen evaluation is unsolved — VBench helps but human pref still rules.

Papers to read

In rough order of foundational → recent. Click any title to open the arXiv abstract.

Learning Transferable Visual Models From Natural Language Supervision (CLIP)

Radford et al. (OpenAI) · 2021

Sigmoid Loss for Language Image Pre-Training (SigLIP)

Zhai et al. (Google) · 2023

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding

Tschannen et al. (Google) · 2025

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Meta · 2024

Visual Instruction Tuning (LLaVA)

Liu et al. · 2023

BLIP-2: Bootstrapping Language-Image Pre-training

Li et al. · 2023

Moshi: a speech-text foundation model for real-time dialogue

Kyutai · 2024

MAGVIT-v2: Language Model Beats Diffusion — Tokenizer Is Key

Yu et al. (Google) · 2023

Scalable Diffusion Models with Transformers (DiT)

Peebles, Xie · 2022

Movie Gen: A Cast of Media Foundation Models

Meta · 2024

Cosmos World Foundation Model Platform for Physical AI

NVIDIA · 2025

NaViT: Patch n' Pack for Native-Resolution ViT

Dehghani et al. · 2023

Previous · Step 7

← Tool Use & Agents

Inference & Serving →