May 21, 2024 – Google Unveils Math-Focused Gemini 1.5 Pro, Achieving Remarkable Results in Mathematical Challenges
In a significant breakthrough, Google has announced that its Gemini 1.5 Pro model, after undergoing specialized training in the mathematical domain, has achieved remarkable improvements in mathematical performance, successfully tackling problems from the International Mathematical Olympiad.
The Gemini 1.5 Pro model was tailor-trained by Google to address mathematical scenarios, undergoing rigorous testing through the MATH benchmark, the American Invitational Mathematics Examination (AIME), and Google’s internal HiddenMath benchmark. The results indicate that the math-focused Gemini 1.5 Pro exhibits a performance “comparable to human experts” in mathematical benchmarks.
Compared to the standard non-math-oriented Gemini 1.5 Pro, the math-specific variant solved significantly more problems in the AIME benchmark and also achieved higher scores in other benchmarks. Google shared three examples, two of which were solved correctly by the math-dedicated Gemini 1.5 Pro, while the third was incorrectly handled by a standard Gemini 1.5 Pro variant. These problems typically required recalling fundamental mathematical formulas in algebra and relying on piecewise and other mathematical rules to arrive at the correct answers.
In addition to the problem-solving capabilities, Google also revealed crucial details about the Gemini 1.5 Pro’s benchmark testing. The data indicates that Gemini 1.5 Pro outperforms both GPT-4 Turbo and Amazon’s Claude in all five benchmark tests.
Specifically, the math-derived version of Gemini 1.5 Pro achieved an impressive accuracy rate of 80.6% on the MATH benchmark with a single sample. When sampling 256 solutions and selecting a candidate answer (rm@256), the accuracy soared to 91.1%, demonstrating the model’s exceptional mathematical prowess.
With this advancement, Gemini 1.5 Pro emerges as a formidable contender in the realm of artificial intelligence, particularly in mathematical domains, further blurring the line between human and machine capabilities.