Ohio StateApr 2, 2026arXiv:2604.01554

EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild

Yi Fan, Junsuh Won, Ding Zhu, Melih Sirlanci, Mahdi Khalili, Carter Yagemann

AI Summary

The paper introduces EXHIB, a new benchmark for Binary Function Similarity Detection (BFSD) comprising five realistic, diverse datasets collected from real-world applications. Evaluation of 9 representative BFSD models on EXHIB reveals significant performance degradations (up to 30%) on firmware and semantic datasets compared to standard settings, highlighting generalization gaps. The study demonstrates that robustness to low- and mid-level binary variations does not guarantee robustness to high-level semantic differences.

Key Contribution

Current binary function similarity detection models falter when faced with the semantic diversity of real-world applications, exposing a critical blind spot in existing evaluation practices.

Abstract

Binary Function Similarity Detection (BFSD) is a core problem in software security, supporting tasks such as vulnerability analysis, malware classification, and patch provenance. In the past few decades, numerous models and tools have been developed for this application; however, due to the lack of a comprehensive universal benchmark in this field, researchers have struggled to compare different models effectively. Existing datasets are limited in scope, often focusing on a narrow set of transformations or types of binaries, and fail to reflect the full diversity of real-world applications. We introduce EXHIB, a benchmark comprising five realistic datasets collected from the wild, each highlighting a distinct aspect of the BFSD problem space. We evaluate 9 representative models spanning multiple BFSD paradigms on EXHIB and observe performance degradations of up to 30% on firmware and semantic datasets compared to standard settings, revealing substantial generalization gaps. Our results show that robustness to low- and mid-level binary variations does not generalize to high-level semantic differences, underscoring a critical blind spot in current BFSD evaluation practices.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild

Related Papers