← all stories other 1 sources · Jun 5 · June 5, 2026

Estonian Government Benchmark Ranks Claude Opus 4.7 Best at Resisting Russian Propaganda

The benchmark, developed by a government institute, provides a structured evaluation of how base models handle propaganda narratives on topics central to Russian strategic communication.

Key Facts

Claude Opus 4.7 scored highest on 77% of questions and averaged 94.9 out of 100 on the Estonian Institute of Language's propaganda resistance benchmark.
The benchmark evaluated 75 questions in three languages across 14 types of Russian propaganda narratives, with answers scored from 1 to 5.
OpenAI's GPT-5.4 scored highest on 54% of questions with an average of 88.9, while GPT-3.5 Turbo ranked last.
Google's Gemini 2.5 Pro scored 66.1 on malicious prompts and 75.5 in Russian-language questions.
The judging model matched human expert ratings within 1 point 88% to 100% of the time.

Reporting from 1 source: GIGAZINE.

Estonian Government Benchmark Ranks Claude Opus 4.7 Best at Resisting Russian Propaganda

The Estonian Institute of Language released a "Propaganda Resistance" benchmark measuring how well large language models resist Russian propaganda. Anthropic's Claude Opus 4.7 ranked first overall, with NVIDIA's Nemotron 3 Super 120B and Alibaba's Qwen 3.6 Plus also scoring high. OpenAI's GPT-5.4 performed best among its models, while GPT-3.5 Turbo ranked last.

The benchmark evaluated 75 questions in three languages across 14 types of Russian propaganda narratives. Questions were divided into neutral, biased with false premises, and malicious prompts attempting to elicit explicit disinformation. Answers received scores from 1 to 5, with 5 indicating a balanced and insightful response and 1 indicating one that amplifies propaganda.

Claude Opus 4.7 received the highest score on 77% of questions and averaged 94.9 out of 100. Anthropic's Sonnet and Opus models occupied six of the top 10 spots. Among open-weight models, NVIDIA's Nemotron 3 Super 120B and Alibaba's Qwen 3.6 Plus approached the top model's level. OpenAI's GPT-5.4 scored highest on 54% of questions with an average of 88.9, while GPT-3.5 Turbo ranked at the bottom of the table.

Google's Gemini models showed weaknesses in malicious prompts and Russian-language questions. Gemini 2.5 Pro scored 66.1 on malicious questions and 75.5 in Russian. The judging model used for evaluation matched human expert ratings within 1 point 88% to 100% of the time.

Synthesized by Yomimono from the 1 cited source below, including Japanese-language reporting where cited, then editorially reviewed before publishing.

Sources

GIGAZINE 「どのLLMがロシアのプロパガンダに対抗するのに優れているか？」がわかるベンチマークをエストニア政府が発表

Key Facts

Claude Opus 4.7 scored highest on 77% of questions and averaged 94.9 out of 100 on the Estonian Institute of Language's propaganda resistance benchmark.
The benchmark evaluated 75 questions in three languages across 14 types of Russian propaganda narratives, with answers scored from 1 to 5.
OpenAI's GPT-5.4 scored highest on 54% of questions with an average of 88.9, while GPT-3.5 Turbo ranked last.
Google's Gemini 2.5 Pro scored 66.1 on malicious prompts and 75.5 in Russian-language questions.
The judging model matched human expert ratings within 1 point 88% to 100% of the time.

Estonian Government Benchmark Ranks Claude Opus 4.7 Best at Resisting Russian Propaganda

Key Facts

More on this

Sources