Nvidia has unveiled ArtiFixer, an artificial intelligence (AI) model that generates 3D scenes based on multiple photos.
Online media outlet Gigazine reported on June 23 (local time) that the model focuses on reducing quality degradation and scene mismatch problems seen in existing 3D reconstruction methods by generating and filling areas not present in the reference images.
ArtiFixer is a model with about 16.9 billion parameters developed based on Wan 2.1, a video-generation AI. Its core approach is to create and fill parts not captured in photos. Nvidia said it generated high-quality 3D scenes through this.
The research targets limitations of 3D Gaussian Splatting, which creates 3D scenes from multiple photos. Existing methods often fail to maintain scene consistency, and shapes can easily collapse in areas not photographed. Nvidia explained that ArtiFixer addresses these issues with a structure that "generates and inserts parts not in the photos".
Training was conducted in two stages. It first trained a model capable of generating and filling missing parts, then distilled it in the next stage into an autoregressive model that produces hundreds of frames from a single frame. The process focused not on simple reconstruction but on stitching together a broader scene using only limited viewpoint inputs.
Nvidia released the model in three forms. The base ArtiFixer is an autoregressive model that generates images from multiple viewpoints. ArtiFixer3D distills its output into a 3D representation. It also introduced ArtiFixer3D+, which applies the autoregressive model again as post-processing.
It also released comparative results. Nvidia explained that ArtiFixer produces relatively sharp scenes, while ArtiFixer3D is more consistent but somewhat blurry. ArtiFixer3D+, it said, generates scenes that are both sharp and highly consistent. In examples comparing it with other approaches such as 3DGUT, GenFusion and GSFixer, Nvidia presented ArtiFixer3D+ as producing higher-quality 3D scenes.
It is also seeking to broaden practical use. Nvidia said it can generate high-quality 3D scenes even in complex environments such as indoor photos with many objects. This shows it could be used in tasks that must reconstruct indoor spaces or complex object compositions from limited photos.
Nvidia also released related information through a research page and Hugging Face. The announcement shows that photo-based 3D reconstruction is moving beyond simple re-creation to a stage of filling unseen areas through generation. The key question going forward is how stably this approach can be applied to real 3D production pipelines and whether it can maintain both consistency and sharpness in complex scenes.