Nvidia unveils fast object-detection AI that reads photos, UI and documents

Nvidia unveiled the object-detection AI "LocateAnything". [Photo: Nvidia]

[DigitalToday reporter Jinju Hong (홍진주)] Nvidia has unveiled an artificial intelligence (AI) model called LocateAnything that quickly finds objects in photos and screenshots. The model can recognise UI elements and the locations of text by learning not only general photos but also application screens and documents.

On May 29 local time, online media outlet Gigazine reported that Nvidia introduced LocateAnything as a vision-language model (VLM) specialised for high-speed object detection.

The key is speed and versatility. In a demonstration video released by Nvidia, objects on the screen are identified very quickly. Unlike existing object-recognition models trained mainly on general photos, LocateAnything includes application screenshots and documents in its training data. As a result, it can find not only items in an image but also UI elements such as app menus, buttons and text areas.

Nvidia said performance comparisons showed LocateAnything had a finer ability to distinguish objects than existing models. While Qwen3-VL and REX-Omni showed limits in separating repeated objects such as windows or pieces of wood into individual units, LocateAnything detected them accurately, it said. It also claimed text-recognition accuracy was higher than the two models.

Potential uses are drawing particular attention in robotics and PC automation. Tasks such as finding and clicking a specific button on a screen or extracting needed items from a document rely on technology that quickly and accurately pinpoints object locations. Nvidia also said LocateAnything could be used in areas such as robot control and automated software operation.

A real-use demo was also released. Users enter what they want to find along with an image and press "Run Inference", and the object location is displayed immediately. In one example, entering "video-game" in a photo detected all game packages, and on a notepad screenshot the locations of the "File", "Edit" and "View" menus were identified at the same time.

Its release method is also drawing attention. Nvidia distributed LocateAnything as an open model, and it can be downloaded via Hugging Face. It is also providing a demo application separately.

LocateAnything is expanding its scope beyond simple image recognition to cover screen understanding and document processing. In particular, with its ability to handle UI elements and text at the same time, its potential use is expected to grow in future PC agent and software automation markets.

This #CVPR2026 paper from our research team is trending #1 on @HuggingFace Meet LocateAnything: a vision-language detection model that rethinks bounding box prediction. For AI agents and robots, “seeing” is only useful if a model can pinpoint where something is fast enough to… pic.twitter.com/2OGaQnUCnX

Jinju Hong hongjj@d-today.co.kr

Keyword