SWE-bench & SWE-bench Verified Benchmarks

SWE-bench In their 2023 paper "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?", researchers from Princeton University, Princeton Language and Intelligence, and University of Chicago wrote: Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. In the meantime, SWE-bench has become on of the most popular benchmarks to evaluate LLMs' performance in software engineering. In a nutshell, SWE-bench tests the AI's ability to automatically solve GitHub issues. More precisely, the benchmark and is executed by giving an AI coding agent a code repository and issue description, and asking it to generate a patch that resolves the issue. Test Data The SWE-bench benchmark contains a dataset of 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Each sample in the test set contains the Pull Request (PR) with the solution code and unit tests that verify the code's correctness. These unit tests are called FAIL_TO_PASS because they fail before the solution code in the PR is added but pass afterwards. In addition, each sample contains PASS_TO_PASS tests. These tests pass both before and after the PR is merged and are used to verify that the PR does not cause regressions (i.e., does not brake existing functionality). Benchmark Execution During the evaluation, the AI coding agents are given the original text from the GitHub issue and access to the codebase. The agents must edit the files in the codebase to resolve the issue. The tests are not shown to the agents. Once the AI coding agents is finished with editing the codebase, the changes are evaluated by executing both the FAIL_TO_PASS and PASS_TO_PASS tests. Both tests sets are required to pass for the changes to be considered to resolve the original GitHub issue. All in all, resolving SWE-bench challenges is a challenging task because it requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously. Coding agents have made impressive progress on SWE-bench. At the writing of the paper (it's latest version v3 on arXiv is dated 11 November 2024), the then-best-performing model, Claude 2, was able to solve a mere 1.96% of the issues. As of April 2025, the best-performing agent based on the Claude 3.7 Sonnet model solves 33.83% of the issues (SWE-bench full). The leaderboard, which is regularly updated, can be found at www.swebench.com. SWE-bench Verified On 13 August, 2024, OpenAI published a blog post Introducing SWE-bench Verified. While working with SWE-bench, the OpenAI team discovered that issues from the test set were hard or even impossible to solve, leading to SWE-bench benchmark to systematically underestimate LLMs' software engineering capabilities. OpenAI then collaborated with SWE-bench authors to address this problem in a new release of the benchmark. The major areas for improvement identified by the team were: The unit tests used to evaluate the correctness of a solution were often overly specific. In some cases, the unit tests were even unrelated to the GitHub issue, potentially causing correct solutions to be rejected Many samples had an issue description that was underspecified, leading to ambiguity on what the problem was and how it should have been solved Sometimes, it was challenging to reliably set up the SWE-bench development environments for the agents, causing unit tests to fail regardless of the solution. Thus, perfectly valid solutions might have been graded as incorrect So, the OpenAI team and the researchers worked with 93 experienced Python developers who manually screened SWE-bench test samples for quality: appropriately scoped unit tests and well-specified issue descriptions. They annotated 1,699 random samples from the SWE-bench test set to produce SWE-bench Verified. Together with the original authors of SWE-bench, the OpenAI team released SWE-bench Verified, a subset of the original test set composed of 500 samples verified to be non-problematic by the human annotators. SWE-bench Verified supersedes the original SWE-bench and SWE-bench Lite test sets, and can be downloaded from Hugging Face. In addition, OpenAI released the human annotations for all SWE-bench test samples. These annotations allow to slice the dataset by difficulty: The 'easy' subset is composed of 196 tasks taking up to 15

Apr 6, 2025 - 22:10
 0
SWE-bench & SWE-bench Verified Benchmarks

SWE-bench

In their 2023 paper "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?", researchers from Princeton University, Princeton Language and Intelligence, and University of Chicago wrote:

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue.

In the meantime, SWE-bench has become on of the most popular benchmarks to evaluate LLMs' performance in software engineering. In a nutshell, SWE-bench tests the AI's ability to automatically solve GitHub issues. More precisely, the benchmark and is executed by giving an AI coding agent a code repository and issue description, and asking it to generate a patch that resolves the issue.

Test Data

The SWE-bench benchmark contains a dataset of 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Each sample in the test set contains the Pull Request (PR) with the solution code and unit tests that verify the code's correctness.

These unit tests are called FAIL_TO_PASS because they fail before the solution code in the PR is added but pass afterwards. In addition, each sample contains PASS_TO_PASS tests. These tests pass both before and after the PR is merged and are used to verify that the PR does not cause regressions (i.e., does not brake existing functionality).

Benchmark Execution

During the evaluation, the AI coding agents are given the original text from the GitHub issue and access to the codebase. The agents must edit the files in the codebase to resolve the issue. The tests are not shown to the agents.

Once the AI coding agents is finished with editing the codebase, the changes are evaluated by executing both the FAIL_TO_PASS and PASS_TO_PASS tests. Both tests sets are required to pass for the changes to be considered to resolve the original GitHub issue.

All in all, resolving SWE-bench challenges is a challenging task because it requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously.

Coding agents have made impressive progress on SWE-bench. At the writing of the paper (it's latest version v3 on arXiv is dated 11 November 2024), the then-best-performing model, Claude 2, was able to solve a mere 1.96% of the issues. As of April 2025, the best-performing agent based on the Claude 3.7 Sonnet model solves 33.83% of the issues (SWE-bench full).

The leaderboard, which is regularly updated, can be found at www.swebench.com.

Screenshot of the SWE-bench leaderboard for the SWE-bench full benchmark variant

SWE-bench Verified

On 13 August, 2024, OpenAI published a blog post Introducing SWE-bench Verified. While working with SWE-bench, the OpenAI team discovered that issues from the test set were hard or even impossible to solve, leading to SWE-bench benchmark to systematically underestimate LLMs' software engineering capabilities. OpenAI then collaborated with SWE-bench authors to address this problem in a new release of the benchmark.

The major areas for improvement identified by the team were:

  • The unit tests used to evaluate the correctness of a solution were often overly specific. In some cases, the unit tests were even unrelated to the GitHub issue, potentially causing correct solutions to be rejected
  • Many samples had an issue description that was underspecified, leading to ambiguity on what the problem was and how it should have been solved
  • Sometimes, it was challenging to reliably set up the SWE-bench development environments for the agents, causing unit tests to fail regardless of the solution. Thus, perfectly valid solutions might have been graded as incorrect

So, the OpenAI team and the researchers worked with 93 experienced Python developers who manually screened SWE-bench test samples for quality: appropriately scoped unit tests and well-specified issue descriptions. They annotated 1,699 random samples from the SWE-bench test set to produce SWE-bench Verified.

Together with the original authors of SWE-bench, the OpenAI team released SWE-bench Verified, a subset of the original test set composed of 500 samples verified to be non-problematic by the human annotators. SWE-bench Verified supersedes the original SWE-bench and SWE-bench Lite test sets, and can be downloaded from Hugging Face.

In addition, OpenAI released the human annotations for all SWE-bench test samples. These annotations allow to slice the dataset by difficulty:

  • The 'easy' subset is composed of 196 tasks taking up to 15 minutes to fix
  • The 'hard' subset is composed of 45 tasks taking more than 1 hour to fix