Lumina-mGPT 2.0: A resurgence of autoregressive models, rivaling state-of-the-art diffusion models

he Shanghai Artificial Intelligence Laboratory and other teams proposed Lumina-mGPT 2.0 – a standalone, decoder-only autoregressive model that unifies a wide range of tasks including text-based graphs, image pair generation, agent-driven generation, multi-round image editing, controllable generation, and dense prediction.

Lumina-mGPT 2.0: A resurgence of autoregressive models, rivaling state-of-the-art diffusion models

 

Core Technologies and Breakthroughs

Completely Independent Training Architecture

Unlike traditional approaches that rely on pre-trained weights, Lumina-mGPT 2.0 utilizes a decoder-only Transformer architecture, enabling completely independent training from parameter initialization onward. This offers three key advantages: unrestricted architectural design (two versions with 2 billion and 7 billion parameters are available), circumventing licensing restrictions (such as Chameleon’s copyright issues), and mitigating the inherent bias of pre-trained models.

Lumina-mGPT 2.0: A resurgence of autoregressive models, rivaling state-of-the-art diffusion models

 

For image word segmentation, after comparing various solutions such as VQGAN and ViT-VQGAN, we ultimately chose SBER-MoVQGAN, which achieved the best reconstruction quality on the MS-COCO dataset, laying the foundation for high-quality generation.

 

Unified Multi-Task Processing Framework

 

An innovative unified image segmentation scheme treats image-based tasks as a single image by stitching them together. This is controlled through prompt descriptions, achieving consistency between multi-task training and text-based image training. This enables a single model to seamlessly support the following tasks:

  • Text-based images
  • Subject-driven generation
  • Image editing
  • Controllable generation (e.g., contour-based/depth-based generation)
  • Dense prediction

This design avoids the tedious switching between different frameworks required by traditional models, allowing flexible control of task types through system prompts.

Lumina-mGPT 2.0: A resurgence of autoregressive models, rivaling state-of-the-art diffusion models

 

Efficient Inference Strategy

To address the slow generation speed of autoregressive models, the team introduced two optimizations:

  • Model Quantization: Quantizes model weights to 4-bit integers while maintaining activation tensors as bfloat16. This optimization is achieved without changing the model architecture using native compilation tools in PyTorch 2.0 and the reduce-overhead mode of torch.compile.
  • Speculative Jacobi Sampling: Utilizing a static KV cache and static causal attention mask, SJD is compatible with static compilation frameworks, accelerating sampling while avoiding dynamic cache adjustments. Combined with 4-bit quantization, this reduces GPU memory usage by 60% and accelerates generation through parallel decoding.

Experiments show that the optimized model significantly improves generation efficiency while maintaining quality.

Lumina-mGPT 2.0: A resurgence of autoregressive models, rivaling state-of-the-art diffusion models

 

Text-to-Image Experimental Results

In the field of text-to-image generation, Lumina-mGPT 2.0 achieved outstanding performance across multiple benchmarks, matching or even surpassing diffusion and autoregressive models such as SANA and Janus Pro. It particularly excelled in the “Two Objects” and “Color Attributes” tests, achieving a GenEval score of 0.80 and placing it among the top generative models.

Lumina-mGPT 2.0: A resurgence of autoregressive models, rivaling state-of-the-art diffusion models

 

In addition, in terms of actual generated effects, Lumina-mGPT 2.0 outperforms its predecessors Lumina-mGPT and Janus Pro in terms of realism, detail, and coherence, and is more visually appealing and natural.

 

Lumina-mGPT 2.0: A resurgence of autoregressive models, rivaling state-of-the-art diffusion models

 

Multi-Task Experimental Results

Lumina-mGPT 2.0 achieved excellent performance on the Graph200K multi-task benchmark (controllable generation and object-driven generation), demonstrating the feasibility of using a purely autoregressive model to

address multimodal generation tasks within a single framework.

Lumina-mGPT 2.0: A resurgence of autoregressive models, rivaling state-of-the-art diffusion models

 

 

In addition, the team conducted practical comparisons with other multi-task generation models. Lumina-mGPT 2.0 performed outstandingly in controllable generation and topic-driven generation tasks, demonstrating excellent generation capabilities and flexibility compared to models such as Lumina-mGPT, OneDiffusion, and OmniGen.

Lumina-mGPT 2.0: A resurgence of autoregressive models, rivaling state-of-the-art diffusion models

 

Future Directions

Even after optimizing inference, Lumina-mGPT 2.0 still suffers from long sampling times, similar to other autoregressive-based generative models. This impacts user experience and will be further optimized in the future. Currently, Lumina-mGPT 2.0 focuses on multimodal generation, but plans are in place to expand its scope to multimodal understanding to improve its overall functionality and performance. This will make Lumina-mGPT 2.0 even more comprehensive in meeting user needs.

© Copyright notes

Related posts