South Korea's Ministry of Science and ICT said on Wednesday it has begun a public notice for an "AI training data upcycling" project to reprocess existing AI Hub training data to suit generative AI technology environments, together with the National Information Society Agency (NIA).
The project conducted a full analysis of 691 AI Hub datasets built through 2022 and, after an external expert review, made a final selection of 30. It includes 15 datasets each for large language models (LLMs) and physical AI, with a total budget of 3 billion won. The ministry explained it can increase policy effectiveness relative to budget spending compared with building new datasets.
The LLM data will be restructured by incorporating reasoning processes into existing text data, including question setting, evidence review, error verification and answer finalisation. It plans to expand the data so it can train diverse decision paths and self-verification processes, moving beyond presenting a single correct answer.
The physical AI data will upgrade existing image and video data into a structure that integrates visual information (V), language commands (L), and action and control (A). It will expand the data so it can understand changes in situations over time and interactions between objects beyond object recognition, and generate goal-based actions.
Choi Dong-won (최동원), director general for AI infrastructure policy at the ministry, said, "Through this upcycling project, we will be able to secure AI training data suited to the latest generative AI technology environments even at low cost." He added, "We will raise the value of use so that already accumulated data assets are not wasted."