AI fact-checking tests of five latest models show 67 percent disagreement

The results show that even the latest AI may not reach consistent fact-checking conclusions on the same claim. [Photo: Shutterstock]

Can AI take over fact-checking? Tests of 5 latest artificial intelligence models asked to verify 1,000 identical claims found their judgments differed in more than 2 out of 3 cases. As moves spread to use AI as a fact-checking tool, the results show conclusions can vary widely by model.

On June 1 (local time), online media outlet Gigazine reported that fact-checking service Lenz recently analysed how often major large language models delivered matching verdicts on 1,000 claims submitted by users.

The test involved 5 models: GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro+Search and Sona Pro. Each model rated each claim by choosing one of four options: "true," "mostly true," "misleading" or "false."

The results showed larger gaps than expected. Out of 1,000 claims, cases in which all 5 models reached the same conclusion totalled 328. By contrast, in 672 cases at least one model delivered a different judgment. In 132 cases, verdicts were split in so many directions that no rating secured a majority.

That goes beyond a small number of models giving different answers. It means there were many cases in which no shared conclusion formed even on the same claim.

Differences also appeared in specific examples. One case Lenz released concerned whether Ukrainian President Volodymyr Zelenskiy (볼로디미르 젤렌스키) had been nominated as a candidate for the 2026 Nobel Peace Prize. GPT-5.4 and Gemini 3 Pro judged the claim "false," but Gemini 3 Pro+Search and Sona Pro rated it "true." A later check found Zelenskiy had in fact been nominated as a candidate for the 2026 Nobel Peace Prize.

Differences between models were also found on relatively verifiable matters such as whether a well-known figure made a statement, generalised claims related to psychology and World Bank statistics.

Models also differed clearly in their judgment tendencies. GPT-5.4, Claude Opus 4.7 and Sona Pro relatively often chose middle ratings such as "mostly true" or "misleading." By contrast, the Gemini 3 Pro line showed a stronger tendency to deliver more categorical conclusions such as "true" or "false."

That means results can differ depending on whether a model approaches the task conservatively or makes binary judgments, even on the same fact-checking work.

Lenz explained that the purpose of the study was not to determine which model is best. The company said it is conducting additional research in which humans directly assign correct labels to the same claims and then evaluate each model's accuracy against that benchmark. It also said what matters is revealing the mismatches among models themselves, and that there is value in checking which types of claims trigger differences in opinion.

The results also show limits of AI-based search and fact-checking services. The fact that model judgments differed even on matters that can be checked relatively objectively, such as public data or facts about individuals, suggests it is difficult for users to accept an answer from a single AI model as fact. Models combined with search functions also did not always provide more accurate or consistent conclusions.

In the industry, analysts see future work that, based on human evaluations, examines which types of claims concentrate disagreements and which models most often diverge from human judgment as an important yardstick for assessing the reliability of AI-based fact-checking services.

AI is becoming a new tool for verifying information, but the case is seen as showing that, at least at the current stage, there is still a need to cross-check results across multiple models and undergo final human verification.

Jinju Hong hongjj@d-today.co.kr

Keyword