Evaluation Metrics
Opt@1: Estimator of fraction of tasks where a single attempt achieves ≥95% human speedup and passes correctness tests.
Opt@K: Estimator of fraction of tasks where at least one attempt among K tries achieves ≥95% human speedup and passes correctness tests.
Rank | Model | Scaffold | Setting | Score | Date |
---|---|---|---|---|---|
Loading GSO leaderboard from official website... |