GSO Leaderboard - Challenging Software Optimization Tasks

GSO Benchmark evaluates language models' capabilities in developing high-performance software through 102 challenging optimization tasks across 10 codebases. The benchmark measures runtime efficiency improvements against expert developer optimizations.

Visit the official GSO website for complete details, research paper, and dataset downloads.

Evaluation Metrics

Opt@1: Estimator of fraction of tasks where a single attempt achieves ≥95% human speedup and passes correctness tests.

Opt@K: Estimator of fraction of tasks where at least one attempt among K tries achieves ≥95% human speedup and passes correctness tests.

Rank	Model	Scaffold	Setting	Score	Date
Loading GSO leaderboard from official website...

View Full GSO Leaderboard