Synthetic data defined
Synthetic data is artificially generated information that can be used in place of real historic data to train AI models when actual data sets are lacking in quality, volume, or variety. Synthetic data can also be a vital tool for enterprise AI efforts when available data doesn’t meet business needs or could create privacy issues if used to train machine learning models, test software, or the like.
According to Gartner analyst Svetlana Sicular, by 2024, 60% of the data used for the development of AI and analytics solutions will be synthetically generated, up from 1% in 2021.
Synthetic data use cases
Artificial data has many uses in enterprise AI strategies. As a stand-in for real data, synthetic data can be helpful in the following scenarios:
For training models when real-world data is lacking: AI and ML systems require massive amounts of data. For some use cases, there just isn’t enough data available, either because the use case happens very infrequently, or the use case is new and there isn’t much historical data available yet. Synthetic data can also lower costs when collecting or buying real-world data is prohibitively expensive.
To fill gaps in training data: Some data sets don’t fully reflect a company’s use cases. For example, a system trained to recognize phone numbers may not have enough international numbers to work with.