Cloudera highlights synthetic data as an alternative to AI privacy risks

Sergio Gago, Cloudera CTO. [Photo by CTO Gago's X (Twitter) account]

Data platform company Cloudera is highlighting synthetic data to companies as a way to reduce privacy risks from using large language models (LLMs).

The company said that as AI becomes deeply integrated across business operations, LLMs are being used in a range of tasks, including customer support, data analysis, developer productivity and knowledge management. With AI agents also rising, AI is evolving beyond searching for information and reasoning to performing practical work.

As AI use expands, concerns about privacy risks are also being raised. Data needed to improve AI model performance often includes personally identifiable information (PII), regulated information and a company's own business context, such as support chat records, transaction histories and operational logs.

Synthetic data is algorithmically generated data that reflects key patterns of real datasets while not reproducing actual records. Companies can use it to proceed with AI development and testing while reducing exposure of sensitive information.

Synthetic data has evolved beyond the stage of generating simple tabular data. Recently, companies can generate synthetic instruction data, synthetic conversation data, synthetic incident tickets and synthetic question-and-answer data that reflect real workflow structures without using the original data.

Cloudera cited three AI development areas where synthetic data has important meaning.

First is supervised fine-tuning (SFT) and domain adaptation. Companies want AI models to operate in specific domains. That means understanding and reflecting an organisation's own terminology, policy rules, product catalogue structures and escalation logic. But training data needed for such fine-tuning often contains sensitive information, limiting its use. Cloudera said synthetic datasets can provide a safe training environment that reflects real work intent and formats while minimising the risk of personal data exposure.

Next is large-scale AI model evaluation. Bottlenecks in enterprise AI programmes frequently occur at the model evaluation stage. Teams must test models across varied situations, including routine queries, edge cases, error scenarios and compliance-sensitive topics.

Cloudera said synthetic task generation can help build broad, repeatable evaluation sets faster than manual methods. If done effectively, it can raise confidence in model behaviour before real-world deployment and reduce the need to handle sensitive original data during testing.

Lastly, it pointed to retrieval-augmented generation (RAG) and customised data curation for AI agents. RAG and agent workflows depend heavily on the quality of knowledge bases and test prompts. Synthetic data can generate realistic queries, variations and multi-step interactions to thoroughly validate search and tool-use behaviour. This can reduce how often sensitive real conversation data must be used as input data.

Sergio Gago (세르지오 가고), Cloudera's CTO, said, "When systematically managed, synthetic data is a risk-reduction tool that lets you proceed with model development while reducing personal data exposure." He added, "As deployment of LLMs and agent AI expands, synthetic data will become a practical path to reduce reliance on sensitive personal information."

Seung-chul Choi (최승철), head of Cloudera Korea, said, "As a series of major data leaks continues recently, Korean companies face the task of driving AI innovation while strictly complying with data security." He added, "Synthetic data will become a strategic tool that can secure AI competitiveness while minimising data security risks."

Chi-gyu Hwang delight@d-today.co.kr

Keyword