Google Gemini 3 Benchmarks (Explained) — Vellum

Summary

An analysis of Google Gemini 3 Pro's performance across reasoning, math, multimodal, and agentic benchmarks, comparing it to GPT-5.1 and Claude 4.5.

Key quotes

Gemini 3 Pro scores 91.9% on GPQA Diamond (and 93.8% with Deep Think), giving it a nearly 4-point lead over GPT-5.1 (88.1%) on advanced scientific questions.

The results on Vending-Bench 2, where Gemini 3 Pro’s mean net worth is $5,478.16 (272% higher than GPT-5.1), are arguably the most indicative of practical utility.

The article breaks down Gemini 3 Pro’s performance in several categories, including a 1M token context window and a 64K output window. It highlights the model’s strengths in abstract visual reasoning (ARC-AGI-2) and long-horizon planning.