Native any-to-any is the new default.
2025-2026 frontier models are natively multimodal: text, images, audio, and (increasingly) video share a unified token space and a single backbone. GPT-5.5 unifies all four modalities in one model. Gemini 2.5 Pro handles 1M-token multimodal contexts. Vision encoders evolved from CLIP → SigLIP → SigLIP 2. Real-time voice (Moshi, GPT-4o) and generative video (Sora, Veo 3) crossed into product reality.
A frontier-grade implementation, in order.
SigLIP 2 for late-fusion. NaViT (native resolution) for any-aspect images. Unified VQ-VAE/FSQ tokens for early fusion.
Late fusion (LLaVA-style) for retrofit + cheap. Early/native fusion (Chameleon, GPT-4o) for new pretrains — strictly better at scale.
Mimi or EnCodec RVQ at 12.5-25 Hz. Interleave audio + text tokens for real-time dialogue (Moshi-style). Whisper for transcription-only paths.
DiT (Diffusion Transformer) backbone. Spacetime VAE / MAGVIT-v2 tokenizer. Joint audio synthesis (Veo 3) requires audio-aware training.
Generative video conditioned on actions (Genie 2). Cosmos for robotics-targeted physics-grounded generation.
MMMU, MMBench, ChartQA for vision QA; LiveAudioBench for voice; VBench for video. New: spatial-reasoning evals (3DSRBench, World Labs benchmarks).
In rough order of foundational → recent. Click any title to open the arXiv abstract.
Radford et al. (OpenAI) · 2021
Zhai et al. (Google) · 2023
Tschannen et al. (Google) · 2025
Meta · 2024
Liu et al. · 2023
Li et al. · 2023
Kyutai · 2024
Yu et al. (Google) · 2023
Peebles, Xie · 2022
Meta · 2024
NVIDIA · 2025
Dehghani et al. · 2023