A future AI evaluation system (AI-generated image). [Photo: KAIST]

KAIST said on Monday that a research team led by professor Eui-jong Hwang (황의종) of the School of Electrical Engineering developed a system that automatically evaluates and diagnoses large language models' time reasoning, through joint research with Microsoft Research.

Artificial intelligence must be able to accurately understand real-world information that changes from moment to moment. Existing evaluation methods have limits because they only check whether an answer matches the correct one or fail to sufficiently reflect complex temporal relationships.

To address this, the team introduced temporal database design theory to AI evaluation for the first time. By using the temporal flow of data and relational structures, it enabled a database to automatically generate 13 types of complex time-based questions without people having to write evaluation questions one by one.

A key feature is a shift away from a method in which people directly create questions, to one in which evaluation questions are automatically generated based on data. Using the database, it automated the entire process from question generation to deriving answers and verification, removing the need to revise questions individually as before.

When real-world information changes, updating the database automatically reflects the changes in evaluation questions, answers and verification criteria. The input of the latest information itself is done through external data or administrators, and the structure then automatically carries out the overall evaluation after the data are updated.

The team also introduced a new metric that verifies the logical validity of dates or periods presented in the answering process. It found this detected temporal hallucinations, in which the temporal basis is incorrect, 21.7 percent more accurately on average than before. Because only the database needs updating when information changes, it can significantly reduce the cost of maintaining evaluations, and the volume of input data is also reduced by an average of 51 percent compared with before.

Hwang said it shows that classical database design theory can play an important role in solving reliability issues in the latest AI. He added he expects it to become a basis for verifying AI performance in various fields such as healthcare and law by converting vast specialised data into evaluation resources.

Supported by Microsoft Research, the National Research Foundation of Korea and the Institute of Information & Communications Technology Planning & Evaluation's (IITP) Global AI Frontier Lab project, the research included KAIST doctoral student So-yeon Kim (김소연) as first author. Microsoft Research's Jindong Wang (진동 왕) and Xing Xie (싱 시에) participated as co-authors. The results are due to be presented later this month at ICLR 2026, a top AI academic conference.

Keyword

#KAIST #Microsoft Research #Temporal Database #ICLR 2026 #IITP
Copyright © DigitalToday. All rights reserved. Unauthorized reproduction and redistribution are prohibited.