A comprehensive tutorial on the architecture design, representation learning, training dynamics, and evaluation of unified multimodal models that integrate understanding and generation within a single framework.
A systematic taxonomy of UMM architectures — External Expert Integration, Modular Joint Modeling, and End-to-End Unified Modeling — with trade-off analysis between autoregressive, diffusion, and hybrid approaches.
The "Unified Tokenizer" debate: continuous representations (e.g., CLIP) vs. discrete tokens (e.g., VQ-VAE), and hybrid encoding strategies balancing semantic understanding with generative fidelity.
The full training lifecycle — from constructing interleaved image-text data to unified pre-training objectives and advanced post-training alignment methods such as DPO and GRPO.
Tracing the evolution of multimodal AI from isolated expertise to Unified Multimodal Models. We introduce the core motivations driving unification — particularly the mutual reinforcement between understanding and generation — and provide a rigorous definition of UMMs.
A systematic taxonomy including External Expert Integration, Modular Joint Modeling, and End-to-End Unified Modeling. Deep dive into trade-offs between autoregressive, diffusion, and emerging AR-Diffusion hybrid approaches.
Comparing continuous representations versus discrete tokenization schemes. Review of encoding/decoding strategies and state-of-the-art hybrid approaches — cascade and dual-branch designs — bridging semantic richness with generative fidelity.
Constructing high-quality modality-interleaved datasets, unified pre-training objectives, and advanced post-training alignment methods including preference-based approaches such as DPO and GRPO.
Reviewing existing benchmarks for standardized evaluation, discussing real-world applications in robotics and autonomous driving, and highlighting open challenges including scalable unified tokenizers and unified world models.
A practical walkthrough of our unified multimodal codebase, explaining how core components — tokenizers, multimodal encoders, and generative backbones — are organized and connected in practice.
From isolated multimodal understanding or generation systems to unified multimodal foundation models capable of handling both tasks simultaneously.
A taxonomy of architectures including External Expert Integration, Modular Joint Modeling, and End-to-End Unified Modeling, with comparisons between autoregressive, diffusion, and hybrid approaches.
Continuous versus discrete representations, their advantages and limitations, and emerging hybrid encoding strategies that balance semantic understanding and generative fidelity.
Construction of modality-interleaved datasets, unified pre-training objectives, and post-training alignment methods such as DPO and GRPO.
Evaluation protocols, real-world applications in robotics and autonomous driving, and future directions such as scalable unified tokenizers and unified world models.
All presentation slides will be made publicly available on this website following the event.
Coming SoonAn annotated compilation of all references discussed in the tutorial as a comprehensive reading list.
Coming SoonOpen-source unified multimodal codebase with annotated pointers to models (e.g., Emu, Janus) and datasets.
Coming Soon