FEATURE

Rosemary J Thomas, Senior Technical Consultant, AI Labs, Version 1

Data quality, diversity and volume

One of the most urgent challenges in AI development today is ensuring that models are not just accurate but fair, explainable and robust. That requires data that is representative across a wide range of demographics, scenarios and environments. Synthetic data meets this demand because it can be created at scale, tailored to specific use cases, generated free from personal identifiers and capable of modelling rare or nonexistent real-world scenarios.

It can also fill gaps for underrepresented groups and rare scenarios in datasets. For example, in healthcare AI, if an algorithm is designed to detect skin cancer from imagery but most training images are from lighter-skinned patients, the model may underperform for darker skin tones. Synthetic data can generate realistic images simulating variations in skin tone, lighting and lesion presentation, improving fairness.

Where real data is authentic but limited by privacy, availability and bias, synthetic data is flexible, scalable and targeted to specific scenarios. It excels when modelling rare safety incidents in manufacturing, testing financial systems against emerging fraud schemes, preparing AI for climate disaster scenarios or building multilingual datasets quickly. Recent failures, such as Google’ s Gemini generating historically inaccurate images, underscore the risks of not achieving balance in data diversity.

Volume is also critical. Large language models require trillions of tokens, yet real-world data at this magnitude is expensive and restricted. Synthetic data can expand datasets without breaching compliance rules.

Simulations deliver proven solutions

At the heart of synthetic data generation are simulations. These digital environments mimic real-world dynamics, creating controlled scenarios from which synthetic data can be drawn. They are particularly valuable in sensitive sectors like healthcare and financial services where real data is both scarce and sensitive.

Simulations are already being applied in industries such as automotive, healthcare, finance and retail. In automotive, self-driving car simulations replicate rare road conditions. In healthcare, synthetic patient records support safe research. In finance, synthetic fraud data trains detection systems. In retail, predictive modelling of shopping behaviour can be tested at scale. Synthetic datasets can be 100 % free of personal identifiers, ensuring regulatory readiness.

Advanced techniques like GANs and VAEs push further. GANs, through competitive training between generator and discriminator models, can produce highly realistic synthetic data. VAEs, meanwhile, are more stable and interpretable, making them particularly valuable when explainability is a priority. In some cases, MIT studies have shown that models trained on highquality synthetic data outperform those trained solely on real-world data. The best-performing models combine both real data for grounding and synthetic data for diversity, scale and rare events.

Responsible innovation

Synthetic data supports not only more powerful AI but also more responsible AI. It removes personally identifiable information, making it

26 www. intelligentcxo. com

Intelligent CXO Issue 54 | Page 26

FEATURE