[Photo: Shutterstock]

[Digital Today reporter Chi-gyu Hwang (황치규)] The tech industry is stepping up its push into voice AI that can be heard and spoken, moving beyond text and images viewed with the eyes.

Voice AI is evolving beyond simply converting text into speech, or TTS. It is advancing into voice agents that can converse with emotions like a human and carry out actual work tasks.

In the past, voice AI was little more than a reader that took text input and read it out in a robotic voice. Recently, voice AI has focused on real-time conversations. Industry officials say it is expanding beyond basic TTS to include real-time speech-to-speech, speech recognition, voice cloning and voice agents that carry out tasks on calls or in apps.

Contact centres, or call centres, are seen as the biggest source of demand for voice AI. Adoption is rising for voice AI with advanced functions such as checking policy compliance and linking to CRM systems, beyond simply replacing or assisting agents.

From 2025, the competitive landscape around voice agents is widening. Previously, the process involved complex steps: converting speech to text, having AI understand it, generating text again and synthesising it into speech. Recently, the shift has become concrete toward a single real-time voice-to-voice model that integrates this process into one model.

Grand View Research forecasts the conversational AI market will grow at a compound annual rate of 23.7 percent, from about $11.5 billion in 2024 to about $41.4 billion, or about 58 trillion won, by 2030.

Major tech companies are also putting voice AI at the forefront. OpenAI launched its Realtime API as an official release, highlighting production-grade voice agent functions that can connect to telephone networks and accept image inputs.

Google expanded real-time voice conversation services in more than 45 languages through Gemini Live. It is also moving to roll out Gemini for Home in the smart home area to replace the existing Google Assistant.

Amazon is also accelerating the expansion of voice AI through Alexa+, which includes generative AI. Apple acquired Israeli audio AI startup Q.AI on the 30th. Q.AI is developing technology that detects whispered speech and makes audio clearer even in noise. Apple has been adding AI functions, including real-time translation, to AirPods since last year. Q.AI also has technology to detect subtle facial muscle activity, which could help improve the Apple Vision Pro headset.

Moves by South Korean companies are also taking shape. South Korean startup Humelo unveiled its DIVE engine, short for Deep-context Interactive Voice Engine, in 2025.

The company said DIVE goes beyond simply reading text and identifies conversational context and the other party's emotions. If a customer raises a complaint in an angry voice, the AI recognises it and apologises and responds in a calm, empathetic tone.

Humelo CEO Yong-seok Kwon (권용석) said, "As the government's will to foster AI meets companies' technological innovation, South Korea is forming a global leading group rather than being a latecomer in voice AI." He added, "Humelo's DIVE engine will raise the status of K-AI in the global market as the most human technology that understands human emotions and helps communication."

NeoSapiens, which provides the AI voice actor service Typecast, has recently received 16.5 billion won in pre-IPO funding.

Keyword

#OpenAI #Google #Amazon #Apple #NeoSapiens
Copyright © DigitalToday. All rights reserved. Unauthorized reproduction and redistribution are prohibited.