GSO Leaderboard

Challenging Software Optimization Tasks for Evaluating SWE-Agents

GSO Benchmark evaluates language models' capabilities in developing high-performance software through 102 challenging optimization tasks across 10 codebases. The benchmark measures runtime efficiency improvements against expert developer optimizations.

Visit the official GSO website for complete details, research paper, and dataset downloads.

Evaluation Metrics

Opt@1: Estimator of fraction of tasks where a single attempt achieves ≥95% human speedup and passes correctness tests.

Opt@K: Estimator of fraction of tasks where at least one attempt among K tries achieves ≥95% human speedup and passes correctness tests.

Rank Model Scaffold Setting Score Date
Loading GSO leaderboard from official website...
View Full GSO Leaderboard