ChatGPT Images 2.0 rated best in crowded illustration test, beating Nano Banana and Opus

An image generated by ChatGPT Images 2.0 [Photo: Simon Willison’s Weblog]

OpenAI has unveiled a new image generation feature, ChatGPT Images 2.0, and a test comparing the ability to create complex crowd illustrations found it delivered more complete results than rival models.

On April 22, online media outlet Gigazine reported that software engineer Simon Willison (사이먼 윌리슨) ran an experiment generating Where's Wally? style images using multiple image generation AI models.

The prompt was to depict a scene in which viewers must find a raccoon holding an amateur radio in a crowd. The focus was not on simple image generation, but on whether a specific target could be hidden naturally among many people and elements.

OpenAI's existing model, gpt-image-1, partly reproduced the original style in overall mood but showed limits in detail. Faces and bodies appeared smudged or distorted. The key condition, a raccoon holding a radio, was also not clearly identifiable. The assessment said it was difficult to find the target even when examining the image closely.

Anthropic's Claude Opus 4.7 was also used for image analysis, but the result was not much different. The model mentioned the possibility that a raccoon was included, but it did not clearly identify an individual holding an amateur radio. It was cited as an example showing limits not only in generation but also in how interpretable the output is.

Google-linked models showed similar problems. Gemini-based Nano Banana 2 placed an "amateur radio club" booth in the center and included a raccoon inside it, but failed to hide it naturally in the crowd. Nano Banana Pro placed a large raccoon in a striped outfit at the center, producing an output that emphasized a main character rather than a hidden-object search.

ChatGPT 2.0, by contrast, delivered a differentiated result. In an image generated at 3840x2160 resolution, it placed a raccoon holding an amateur radio naturally in the lower-left corner. The size ratio versus surrounding people was not excessive, and it hid the target at a level that could actually be found without harming the context of the crowd scene, the assessment said.

Willison rated the result as "a fairly high level of completeness" compared with other image generation AI, and said that, for now, ChatGPT Images 2.0 appears to be ahead of rival models. He added that complex compositions such as Where's Wally? are a somewhat demanding way to test model performance, but are useful for checking how precisely text instructions can be translated into visual structure.

Cost information was also disclosed. The output tokens used to generate one image totaled about 13,342, putting the per-image cost at about $0.4.

The comparison shows that competition in image generation AI is shifting beyond simple image quality or style reproduction to how accurately complex instructions can be implemented as scene structure. As the ability to place specific elements naturally in scenes involving many objects and people emerges as a new evaluation standard, analysis has also said OpenAI's latest model took an edge in early competition.

Jinju Hong hongjj@d-today.co.kr

Keyword