New benchmarks show semantic code graphs helping coding agents find change locations faster and complete updates more ...
Migrated PaperBench code-only grading that runs entirely on a local machine (1×node, 8×AMD MI300X), using a local SGLang-served model as the judge over an OpenAI-compatible API — instead of TRAPI / ...
You can now configure and run Evals directly in the OpenAI Dashboard. Get started → Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs. We offer an ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results