A study found that the latest large language models (LLMs) including ChatGPT and Claude posted lower-than-expected results in the Stroop test, a classic psychology experiment that measures human attention and executive control. The researchers analysed the findings as a possible example of structural limits in today’s transformer-based AI.
On June 4 local time, IT outlet TechRadar reported the study was published in the journal PNAS Nexus. The researchers conducted Stroop effect experiments on OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet.
The Stroop effect is a representative cognitive psychology test used to measure selective attention and executive control in humans. For example, if the word “red” is written in blue ink, a person experiences a cognitive conflict between the word’s meaning and the actual colour. The Stroop effect refers to the phenomenon in which reaction speed and accuracy fall when participants must name the ink colour accurately rather than read the word.
The researchers had the AI models perform a word-reading task and a colour-naming task. Both models, like humans, showed high accuracy on word reading, but performance fell sharply under conditions where word meaning and colour conflicted.
The decline became more pronounced as the number of questions increased. GPT-4o recorded accuracy of about 91 percent in a 5-item test, but it fell to 57 percent with 10 items, 22 percent with 20 items and 15 percent with 40 items.
Claude 3.5 Sonnet showed relatively better performance but a similar pattern. It maintained accuracy of around 76 percent up to 20 items, but it dropped to 24 percent with 40 items.
The researchers analysed the results as showing a structural limitation in executive attention, rather than a simple performance drop. Humans can distinguish conflicting information and select only what matches a goal, but current large language models struggle with such control processes.
However, the study’s focus on GPT-4o and Claude 3.5 Sonnet was cited as a limitation. When the research was made public, newer models such as GPT-5, Claude Opus 4.1 and Gemini 2.5 Pro had already appeared.
The researchers therefore ran follow-up experiments. They said additional tests on GPT-5, Claude Opus 4.1 and Gemini 2.5 Pro also showed limited improvement over the previous generation, and a lack of executive attention was still observed.
The paper argued that such results may not be a problem solved by simple generational turnover. It said today’s transformer-based architectures are steadily improving memory and information storage, but remain relatively weak in executive control mechanisms that filter conflicting information and act in a goal-oriented way.
There was also an interesting exception. GPT-5 effectively solved the Stroop test almost perfectly by using a method of writing and running code in “Thinking” mode. The researchers interpreted this not as a fundamental improvement in cognitive ability, but as a case of bypassing the problem using external tools.
The researchers suggested future AI development should focus on strengthening executive control rather than simply expanding memory. They stressed that to move a step closer to artificial general intelligence (AGI), AI must be able to handle conflicting information efficiently by introducing structures similar to human attention systems.
The study is seen as another example showing that even as generative AI develops rapidly and demonstrates strong language generation abilities, it operates differently from human cognitive systems.