Mar 30, 2026arXiv:2603.28088

GEMS: Agent-Native Multimodal Generation with Memory and Skills

Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, T. Zhu, Tong Zhu, Yu Cheng, Yang Yang

AI Summary

GEMS, a novel agent-native multimodal generation framework, is introduced to overcome the limitations of foundational models in complex instruction following and specialized tasks. It employs a multi-agent loop for iterative refinement, a hierarchical memory system for storing factual states and experiential summaries, and an extensible skill library for domain-specific expertise. Experiments across multiple generative backends show GEMS significantly improves performance, enabling a 6B model to outperform a state-of-the-art model on GenEval2.

Key Contribution

A lightweight 6B model, when harnessed within the GEMS agent framework, leapfrogs state-of-the-art models in multimodal generation, suggesting architectural innovations in agents can compensate for raw parameter count.

Abstract

Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.

Code Generation & Program Synthesis Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GEMS: Agent-Native Multimodal Generation with Memory and Skills

Related Papers