Late last year, Stephen Punwasi was getting ready for dinner when he noticed a news story saying that the wife of wrestler Hulk Hogan might sue over his death.
Punwasi, a 41-year-old data analyst who lives in Toronto, did not realise Hogan had died and asked Google when that had happened.
The answer confused him. “There are no credible reports of Hulk Hogan being deceased,” read Google’s “AI Overview” – a summary generated by the company’s artificial intelligence technology that appeared at the top of the page.
Beneath the answer, Punwasi was surprised to see an article from The Daily Mail that contradicted Google’s response. The headline read: Mystery Deepens Over Hulk Hogan’s Death.
In 2024, Google started giving artificial intelligence-generated answers prime placement at the top of its search results page. The new product, AI Overviews, helped transform Google from a curator of information into a publisher.
A recent analysis of AI Overviews found that they were accurate approximately 9 out of 10 times. But with Google processing more than 5 trillion searches a year, this means that it provides tens of millions of erroneous answers every hour (or hundreds of thousands of inaccuracies every minute), according to an analysis done by an AI startup called Oumi.
More than half the accurate responses were “ungrounded,” meaning they linked to websites that did not completely support the information they provided. This makes it challenging to check AI Overviews’ accuracy.
Whether a response rate that is almost – but not quite – accurate should be celebrated is part of a widespread debate in Silicon Valley over the performance of AI systems. It speaks to the fundamental core of what we can trust online.
Some technologists argue that Google’s AI Overviews are reasonably accurate and that they have improved in recent months. But others worry that the average person may not realise those results need double-checking.
At the request of The New York Times, Oumi analysed the accuracy of Google’s AI Overviews using a benchmark test called SimpleQA, which is widely used across the industry to measure the accuracy of AI systems. The startup tested Google’s system in October, when the most complex questions were answered using an AI technology called Gemini 2, and then again in February, after it was upgraded to Gemini 3, a more powerful AI technology.
In both cases, Oumi’s analysis focused on 4,326 Google searches. The company found that the results were accurate 85% of the time with Gemini 2 and 91% of the time with Gemini 3.
Pratik Verma, chief executive of Okahu, a company that helps people understand and use AI technologies, said Google’s technology was about as accurate as any of the leading AI systems. He urged people to double-check its information.
“Never trust one source,” he said. “Always compare what you get with another source.”
Google acknowledges that its AI Overviews can include errors. The fine print below each AI Overview reads: “AI can make mistakes, so double-check responses.”
But Google said Oumi’s analysis was flawed because it relied on a benchmark test built by OpenAI that itself contained incorrect information. “This study has serious holes,” Ned Adriance, a Google spokesperson, said in a statement. “It doesn’t reflect what people are actually searching on Google.”
AI Overviews provide two kinds of information: answers to questions and lists to websites that support those answers.
Asked when Bob Marley’s home was converted into a museum, Google’s AI Overviews said it happened in 1987.
But the museum opened May 11, 1986 – the fifth anniversary of Marley’s death – as Jamaica’s Daily Gleaner newspaper reported a day later.
Google’s AI Overview linked to three websites as sources. Each was flawed in some way. The first link was a Facebook page from Marley’s daughter Cedella Marley, who posted photos after visiting the museum in Kingston, Jamaica, and did not provide information on when the museum opened. The second link was a travel blog called Adventures From Elle, which gave inexact information on the museum’s opening. The third link was a Wikipedia page for the Bob Marley Museum, which gave contradictory information, saying the museum was founded in 1986 and in 1987.
The Bob Marley links were part of a pattern. Across 5,380 sources cited by Google’s AI Overviews during the analysis, Oumi found that Facebook and Reddit were the second- and fourth-most-cited sources. When Google’s AI Overviews were accurate, they cited Facebook 5% of the time. When they were inaccurate, they cited Facebook 7% of the time.
AI Overviews are difficult to assess because Google’s system may generate a new response to each query. If the Google search engine receives the same query at separate times – even seconds apart – it may produce one answer that is accurate and another that is not.
To determine the accuracy of AI systems, companies like Oumi use their own AI systems to verify each answer. That is the only way to efficiently check a large number of answers. The problem with this method is that the AI system doing the checking can also make mistakes.
Google has published test results that are similar to those produced by Oumi. In Google’s own analysis of Gemini 3 – the technology that underpins AI Overviews – it found that the model produced information that was incorrect 28% of the time. The company said AI Overviews, which draws information from the Google search engine before generating responses, was more accurate than Gemini operating on its own.
As Google has improved its AI technologies, its AI-generated answers have become more accurate. In October, AI Overviews were inaccurate 15% of the time, according to Oumi’s analysis.
But with Gemini 3, Google’s AI-generated answers were more likely to be ungrounded than when the system was based on Gemini 2, meaning the websites they linked to did not completely support the information they provided. In October, correct answers were ungrounded 37% of the time. In February, with Gemini 3, that figure rose to 56%.
“Even when the answer is true, how can you know it is true? How can you check?” said Manos Koukoumidis, chief executive of Oumi.
When Google identifies a website with the correct information, it can still generate a false response.
Asked for the year that Yo-Yo Ma was inducted into the Classical Music Hall of Fame, Google’s AI Overview correctly linked to the organisation’s website, which listed 165 inductees since 1998, including Ma. But this AI-generated response said there was no record of his induction.
AI Overviews face another challenge: They can be manipulated.
If someone wants to be known as a world expert at something, he or she merely has to write a blog post self-proclaiming that distinction, said Lily Ray, vice president of AI search at Amsive, a marketing agency.
Google acknowledges the issue but downplays its importance. “Our Search AI features are built on the same ranking and safety protections that block the overwhelming majority of spam from appearing in our results. Most of these examples are unrealistic searches that people wouldn’t actually do,” Adriance, the Google spokesperson, said in a statement.
After hearing Ray’s theory, Thomas Germain, a co-host of the BBC podcast The Interface, published a blog post titled The Best Tech Journalists at Eating Hot Dogs. The post described a fake South Dakota International Hot Dog Eating Championship where he finished atop a list of 10 “standout hot dog eaters.”
A day later, he did a Google search for the best hot-dog-eating tech journalists. Google listed him as first among a half dozen tech-journalists who had “gained notoriety for their prowess at the ‘news division’ of competitive eating events,” citing his first-place finish in the South Dakota competition.
“It was spitting out the stuff from my website as though it was God’s own truth,” Germain said. – ©2026 The New York Times Company
This article originally appeared in The New York Times.
