Simon Willison's Pelican Test Marks November 2025 as AI Inflection Point
The informal benchmark provides a concrete, visual measure of how quickly image generation quality improved across both proprietary and open-weight models in just six months.
Reporting from 1 sources: GIGAZINE.
Developer Simon Willison presented his 'pelican riding a bicycle' benchmark results at PyCon US 2026, tracking model improvements from November 2025 to May 2026. He identified November 2025 as an inflection point, with models like Gemini 3, GPT-5.1, and Claude Sonnet 4.5 showing increasingly natural pelican-bicycle relationships, while open-weight models like Qwen3.6-35B-A3B matched top-tier performance.
Simon Willison has been running the same test since June 2025: ask a large language model to draw a pelican riding a bicycle. At PyCon US 2026, he presented the results from November 2025 through May 2026, calling the start of that period an inflection point. Early outputs from Claude Sonnet 4.5 and GPT-5.1 showed distorted pelican-bicycle relationships. By February 2026, Gemini 3.1 Pro produced a natural image, though it added an unprompted fish in the basket. Open-weight models also improved. Qwen3.6-35B-A3B, a 20.9 GB model that runs on a laptop, generated a more natural image than Anthropic's top-tier Claude Opus 4.7. Willison noted that GLM-5.1, a 1.51 TB open-weight model, handled the still image well but produced distorted motion when asked to animate it.
Synthesized by Yomimono from the 1 cited source below, including Japanese-language reporting where cited, then editorially reviewed before publishing.