Google AI Overviews wrong 1 in 10 times, raising risk of tens of millions of errors a day

Accuracy assessments can fluctuate more if different models are used for each query even within the same service. [Photo: Google]

Google’s AI Overviews, shown at the top of Google Search, has been assessed at about 90 percent accuracy, but analysts said the remaining error rate could produce millions of incorrect answers an hour and tens of millions a day when combined with the scale of searches.

IT outlet Ars Technica, citing an analysis by The New York Times, reported on controversy over the factual accuracy of AI Overviews and Google’s position on the issue on April 7.

The New York Times, working with AI startup Oumi, checked the accuracy of AI Overviews using the SimpleQA benchmark. SimpleQA is an evaluation tool made up of more than 4,000 verifiable questions and is used to measure whether generative AI is factual.

The test found Google’s AI model showed about 85 percent accuracy when it was running Gemini 2.5. After the Gemini 3 update, accuracy improved to about 91 percent. That still means 1 in 10 answers is wrong, and critics said applying that rate to overall search traffic could produce errors on a significant scale.

It also identified specific incorrect answers. When asked when singer Bob Marley’s former home was converted into a museum, AI Overviews cited multiple sources, but some did not contain relevant information, and it picked the wrong year from conflicting information. When asked whether cellist Yo-Yo Ma was inducted into a "Classical Music Hall of Fame", it cited related sites but gave a contradictory answer saying the hall of fame does not exist.

Google countered that the benchmark itself has reliability problems. Spokesperson Ned Adriance (네드 아드리언스) said SimpleQA includes inaccurate data and that Google internally uses an evaluation approach similar to a more strictly validated "SimpleQA Verified". He said, "This study has serious flaws and does not reflect actual user search patterns."

The outlet also pointed to structural difficulties in evaluating generative AI. Results can change even for the same question when run repeatedly, and the tools used for evaluation can also produce errors. It also cited as a variable the fact that AI Overviews operates with multiple models rather than a single model. Google explained that it selects an appropriate model depending on the query type and in some cases uses lighter models that prioritize speed and cost efficiency instead of high-performance models.

The core of the dispute ultimately lies in changes in how search works. Critics said that unlike traditional search centered on "blue links", placing summarized AI answers at the top increases the risk that users accept wrong answers as they are. Google also displays a notice at the bottom of AI Overviews that says, "AI can make mistakes, so recheck the answer."

Jinju Hong hongjj@d-today.co.kr

Keyword