AI Reasoning Benchmark: MathR-Eval

We designed a new benchmark, Mathematical Reasoning Eval: MathR-Eval, to test the LLMs’ reasoning abilities, with 100 logical mathematics questions. Benchmark results Results show that OpenAI’s o1 and o3-mini are the best performing LLMs in our benchmark. Methodology Our dataset includes 100 mathematics questions, which do not include advanced calculus but require reasoning and problem-solving

Mar 11, 2025 - 22:11

We designed a new benchmark, Mathematical Reasoning Eval: MathR-Eval, to test the LLMs’ reasoning abilities, with 100 logical mathematics questions. Benchmark results Results show that OpenAI’s o1 and o3-mini are the best performing LLMs in our benchmark. Methodology Our dataset includes 100 mathematics questions, which do not include advanced calculus but require reasoning and problem-solving