HKUZJUApr 6, 2026arXiv:2604.04812

SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics

Yuchen Cao, Hanlin Zhang, J. Keung, Jacky Wai Keung, Yang Chen

AI Summary

SysTradeBench (SysTB) is introduced as a novel benchmark to evaluate LLM-generated trading systems by iteratively building, testing, and patching code based on natural language strategy specifications. The benchmark incorporates drift-aware diagnostics, determinism checks, and anti-leakage measures, providing evidence bundles to guide constrained patching. Evaluation of 17 models across 12 strategies reveals that while top models achieve high validity, iterative patching leads to code convergence, highlighting the need for human oversight to ensure solution diversity and robustness.

Key Contribution

LLMs excel at rapid prototyping of trading strategies, but SysTradeBench reveals that iterative patching leads to code convergence, suggesting human oversight is still needed for critical strategies requiring solution diversity.

Abstract

Large language models (LLMs) are increasingly used as quantitative research copilots to translate natural-language strategy specifications into executable trading code. Yet most existing evaluations either focus on static financial knowledge or summarize performance with a single profitability metric, leaving a gap for benchmarking strategy-to-code trading systems as governed, auditable software. We introduce SysTradeBench (SysTB), an iterative build-test-patch benchmark that evaluates LLM-generated trading systems under drift-aware diagnostics. Given a standardized Base Strategy Doc and frozen semantics, each model must produce (i) a strategy card, (ii) executable code, and (iii) mandatory audit logs. A sandboxed harness runs determinism and anti-leakage checks, detects rule drift across iterations, and returns evidence bundles to support constrained patches. SysTradeBench reports multi-dimensional scorecards for spec fidelity, risk discipline, reliability, and out-of-sample robustness indicators, together with cost-effectiveness signals. We evaluate 17 models across 12 strategies. Top models achieve validity above 91.7 percent with strong aggregate scores, but evidence-driven iteration also induces code convergence by Iter2. These findings suggest that LLM iteration complements rather than replaces human quantitative researcher governance: LLMs excel at rapid prototyping and shallow bug fixes, while human oversight remains essential for critical strategies requiring solution diversity and ensemble robustness.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References55

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics

Related Papers