Nvidia CEO Jensen Huang speaks at APEC CEO Korea 2025 held last year. [Photo: Nvidia]

Nvidia has released 7 million Korean-language synthetic personas for free within a month, then soon after added the multimodal model Nemotron3 Nano Omni. It marks the arrival in South Korea of a four-step lock-in package linking models, data, frameworks and hardware. Some see Nvidia replicating at the model layer the way CUDA came to dominate the graphics processing unit market.

On April 20, Nvidia published on Hugging Face the 'Nemotron-Personas-Korea' dataset containing 7 million Korean-language synthetic personas. The dataset is a 7 million-entry synthetic collection based on statistics from Statistics Korea's Korean Statistical Information Service (KOSIS), the Supreme Court, the National Health Insurance Service, the Korea Rural Economic Institute and Naver Cloud. It is the first large-scale Korean-language persona dataset. It is released under a CC BY 4.0 licence, allowing commercial use.

The move to release Korean-language data for free resembles how CUDA came to dominate the GPU market. CUDA removed the cost barrier for developers and shut out alternatives to GPUs. Nemotron removes cost barriers for models, data and frameworks, effectively blocking optimal execution environments outside Nvidia hardware.

Nvidia's release of the persona dataset is seen as a signal of its entry into South Korea with a four-step lock-in package spanning models, datasets, frameworks and hardware. On the intent behind this open-source strategy, Brian Catanzaro (브라이언 카탄자로), Nvidia vice president of applied research, said in a lecture at Seoul National University on the 28th, "Whenever good things happen in AI, it is an opportunity for Nvidia to grow."

First comes the model. Nvidia plans to release Nemotron3 Ultra (about 500B) within weeks, following Nemotron3 Nano (30B) and Super (320B). On the 28th, it also added Nemotron3 Nano Omni, a multimodal reasoning model. Omni processes text, images, audio and video in a single system, and Nvidia said its throughput is 9 times higher than comparable open omni models.

Catanzaro also said the Super model ranked No. 1 among open models in the MMLU Pro benchmark without pre-optimisation. Cumulative downloads of the Nemotron3 family have already topped 50 million over the past year.

Model competitiveness leads to the data stage. The 7 million Korean personas can be seen as the Korea edition of Nvidia's synthetic dataset series produced on a global scale. In his Seoul National University lecture, Catanzaro said, "These synthetic datasets were very useful in products in other languages, and we judged they should be made in Korea as well."

Attributes including name, gender, age, marital status, education level, occupation and region of residence were synthesised to match the actual distribution of the Korean population. Catanzaro added that refining the pretraining dataset improved training efficiency on the same hardware by 4 times in a year.

With powerful models and precise datasets in place, the next step is tools to use them. Nvidia's Nemo framework fills that role. Nvidia released its entire post-training pipeline (supervised fine-tuning → reinforcement learning based on reward models → coding-specialised reinforcement learning → reinforcement learning from human feedback → rejection sampling), along with 'Pivot RL', a 5.5-times acceleration algorithm, and a multi-domain policy distillation technique. Catanzaro said, "If you can start from open-source technology and customise it, the option value (for developers) increases."

The final stage is hardware dependence that is absent from the free releases. Catanzaro said, "If Nvidia did not deeply understand neural network architectures, it would not have been able to make Blackwell." NVLink 72 introduced in the Blackwell generation, a structure that allows mutual access to the memory of 72 GPUs, was designed on the premise of maximising the efficiency of mixture-of-experts (MoE) models, and Nemotron3's 'Latent MoE' structure was built on the premise of using NVLink 72. Four-bit (FP4) pretraining was also designed to match the characteristics of Blackwell tensor cores.

◆South Korean companies face choice between short-term efficiency and long-term dependence

In the end, the most efficient place to run the models, data and framework Nvidia has released for free is Nvidia GPUs. The coalition system Catanzaro described can be seen as a stage aimed at forming de facto standards through joint development with global large companies. It is the opposite of the monetisation structure of the OpenAI and Anthropic camp, which recoups revenue through API sales for closed models.

This lock-in structure creates friction with the premise of the sovereign AI policy promoted by the government. Sovereign AI is the concept of securing AI sovereignty with a country's own language, its own data and its own infrastructure. But a structure in which a global GPU company synthesises Korean-language data and distributes it for free, and the most efficient place to train and run inference ends up being that same company's hardware, conflicts with the definition of sovereign AI.

As the Korean-ness of data and the foreign nature of infrastructure separate, the range of choices for South Korea's AI sector narrows. The Korean-language persona dataset is a resource that can be put immediately into training in-house models, but as data use accumulates, pressure also grows to shift in-house training and inference environments to the Nemo framework and the Nvidia GPU stack.

Going forward, groups with their own foundation models, such as Naver, Kakao and LG AI Research, will have to make decisions between the short-term efficiency of customising Nemotron bases and long-term dependence. An industry official said, "The structure itself, where a foreign company reconstructs Korean data and then distributes it back domestically, suggests the need to reorganise the sovereign AI initiative."

Keyword

#Nvidia #CUDA #Nemotron #Hugging Face #Blackwell
Copyright © DigitalToday. All rights reserved. Unauthorized reproduction and redistribution are prohibited.