NTUPolyUMay 27, 2026arXiv:2605.28579

MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

AI Summary

The paper introduces MUSE, a new benchmark for Text-to-CAD generation that emphasizes manufacturability, functionality, and assemblability, moving beyond simple geometric similarity. It uses a three-stage evaluation protocol including code checks, geometric validation, and design-intent alignment assessed via design-specific rubrics. Experiments with both closed-source and open-source LLMs on the MUSE benchmark reveal a significant performance gap, highlighting the limitations of current models in generating engineering-ready designs.

Key Contribution

Current text-to-CAD models can generate shapes, but they fail spectacularly when it comes to creating functional, manufacturable, and assemblable designs, as revealed by the new MUSE benchmark.

Abstract

Large language models (LLMs) have recently advanced text-driven 3D generation, yet Text-to-CAD remains far from supporting industrial product design. Existing benchmarks focus primarily on generating single-part CAD models and evaluate them using geometric similarity metrics that fail to capture functionality, manufacturability, and assemblability. To address this gap, we introduce MUSE, a Text-to-CAD benchmark focused on complex, editable boundary representation (B-Rep) assemblies. MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three-stage protocol: code check, geometric check, and design-intent alignment. The final stage uses design-specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality. To enable scalable evaluation, we use a rubric-based visual language model (VLM) judge and validate its reliability through human annotation. Experiments on closed-source and open-source LLMs reveal a clear failure cascade from executable code to valid geometry and finally to engineering-ready design, with even the strongest models achieving limited success on fine-grained engineering criteria. Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design. Our project website, including the leaderboard, dataset, and code, is available at https://dong7313.github.io/muse-benchmark/.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

Related Papers