Search papers, labs, and topics across Lattice.
This paper introduces BIM-Edit, a benchmark specifically designed to evaluate large language models (LLMs) on their ability to edit Building Information Models (BIM) in the Industry Foundation Classes (IFC) format. The benchmark comprises 324 editing tasks across realistic and synthetic building models, assessing LLM outputs on geometric accuracy, semantic validity, and topological consistency. Results reveal that even the best-performing model achieves only 49.5% on average across these metrics, highlighting a significant performance gap in LLMs when applied to structured engineering design tasks.
LLMs struggle with BIM editing, achieving less than 50% accuracy on critical design tasks, revealing a major shortfall in their practical application for engineering workflows.
Large language models (LLMs) are increasingly applied to computer-aided design (CAD) to generate design artifacts from textual instructions. In engineering practice, this requires more than creating new geometry, models must also understand existing scenes, edit them correctly, and preserve semantics and relations. However, many CAD benchmarks focus on creating new models rather than editing existing ones, and mostly evaluate geometric correctness. We introduce BIM-Edit, a benchmark for evaluating LLMs on natural-language editing of Building Information Models (BIM) represented in the Industry Foundation Classes (IFC) format. BIM provides a challenging testbed because building models encode geometry together with semantic and relational structure. BIM-Edit contains 324 editing tasks spanning 11 realistic building models and 36 synthetic scenes. Tasks are expressed using three instruction categories - direct, spatial, and topological - covering both explicit and scene-grounded edits. We evaluate outputs along three dimensions: geometric accuracy, semantic validity, and topological consistency. Across evaluated LLMs, the best-performing model achieves only 49.5% average score across the three metrics, and no model fully solves more than 3.4% of tasks. These results demonstrate a substantial gap between current LLM capabilities and the requirements of structured engineering design workflows.