Synthetic Data and Respondents: The Devil is in the Details
Synthetic data and respondents are often associated with AI and advanced modeling, offering innovative solutions to challenges in survey research. However, as Joel Anderson and Kevin Karty recently discussed on LinkedIn, not all synthetic data is created equal, and some approaches fail to meet the promise of providing new insights. This article explores one flawed approach to synthetic data generation, duplicating or resampling existing data, and highlights why it falls short of the rigorous standards synthetic data should meet.
It is important to note that this article focuses on a specific method and should not detract from the excellent work being done by AI experts to develop innovative, effective, and reliable synthetic data solutions. Addressing these challenges only enhances the value and potential of synthetic data in advancing research.
What is the Problem?
Imagine you want to use synthetic respondents to bolster a survey dataset. Maybe you are dealing with a low response rate or a particularly hard-to-reach population. One approach some have used is taking existing data, duplicating it randomly, and calling it synthetic data or respondents. Voilà! The results appear statistically valid: the numbers align with the original dataset, and trends remain consistent.
But here is the problem: duplicating or resampling data does not create any new information. It is like copying a recipe word for word and claiming you have invented a new dish. The results may look convincing, but they fail to offer any new understanding about the population under study.
Poorly implemented synthetic data boosts can reduce effective sample size, meaning they introduce noise and degrade the statistical power of the dataset. Instead of increasing insights, they diminish the quality of the research.
Introducing synthetic respondents without proper rigor can decrease the effective sample size, adding noise instead of value. This is like throwing away time and money while ruining the reliability of your data.
Inserting “Missing” Data at Scale
In survey research, replacing missing data with the mean, median, or some other calculated value is a common practice to address small gaps in a dataset. This approach allows researchers to preserve respondents’ data without discarding them from the dataset. Researchers typically use this approach sparingly, perhaps only for one or two questions when a respondent has answered the vast majority of the survey. When used sparingly, this approach has minimal impact because it does not drastically change the overall data distribution or the validity of the data.
However, when this approach is applied at scale, as is the case when existing data is randomly duplicated, a number of problems are introduced:
- The biases or limitations in the original dataset are magnified. For example, if those age 65+ are overrepresented in the sample, synthetic respondents based on duplication will continue this overrepresentation.
- The data may look statistically valid but are, in fact, artificial and misleading.
- The dataset no longer represents real-world variability, and the dataset is distorted undermining its validity and utility.
What Should We Expect of Synthetic Data or Respondents?
True synthetic data or respondents should:
- Introduce new insights, especially for hard-to-reach populations.
- Be generated in a way that maintains statistical rigor while avoiding biases from over-represented or duplicated information.
- Add value beyond what existing data can provide and not just copy and paste.
To achieve this it will be necessary to:
- Use advanced models to simulate responses that align with likely opinions, behaviors, preferences, or circumstances
- Counteract biases when simulating data or respondents
- Ensure that there is diversity in the dataset by introducing natural variability in the population
- Ensure that synthetic data is statistically defensible and adds meaningful nuance to the dataset
Researchers and developers must validate synthetic data generation methods rigorously to ensure they increase, rather than decrease, the statistical power and representativeness of the dataset.
The Devil is in the Details
It is important to note that the extent to which duplicating or resampling data occurs is currently unknown. However, this should not overshadow the excellent and groundbreaking work being done by AI experts in the field of synthetic data and respondents. Their innovations are driving meaningful progress, and addressing these challenges only strengthens the credibility and value of their contributions.
In Conclusion
While imputing missing values can be useful in small doses, scaling this approach to fill datasets with synthetic respondents based on duplication or resampling introduces significant problems. True synthetic data should enhance a dataset by introducing new, meaningful variability and insights and not just inflate the numbers with recycled content. While synthetic boosts hold theoretical promise, they must be tested and validated thoroughly. Unvalidated synthetic data methods risk not only failing to provide insights but also undermining the integrity of the research.
Kirsty Nunez is the President and Chief Research Strategist at Q2 Insights a research and innovation consulting firm with international reach and offices in San Diego. Q2 Insights specializes in many areas of research and predictive analytics, and actively uses AI products to enhance the speed and quality of insights delivery while still leveraging human researcher expertise and experience.