October 20, 2024
Abstract
Purpose: Reproducibility is a core principle of science and access to a study’s data is essential to reproduce its findings. However, data sharing is uncommon in the field of Communication Sciences and Disorders (CSD), often due to concerns related to privacy and disclosure risks. Synthetic data offers a potential solution to this barrier by generating artificial datasets that do not represent real individuals yet retain statistical properties and relationships from the original data. This study evaluates the performance of synthetic data generation using open data from previously published studies across the American Speech-Language-Hearing Association (ASHA) ‘Big Nine’ domains.
Method: Open datasets were obtained from previously published research within the ASHA domains of articulation, cognition, communication, fluency, hearing, language, social communication, voice and resonance, and swallowing. Synthetic datasets were generated with the synthpop R package. Inferential statistics (p-values) and effect sizes from synthetic datasets were compared to those from the original datasets.
Results: Synthetic datasets maintained the direction of p-values in six of the nine studies and effect size categorizations in five of the studies. In cases where synthetic datasets did not maintain 95% of the inferential or effect size results, the absolute mean difference between synthetic and original datasets was relatively low, suggesting that the distribution of results from synthetic datasets closely approximated the alpha or effect size categorization threshold.
Conclusion: Findings suggest that synthetic data can effectively maintain statistical properties and relationships across a wide range of data commonly seen in the field of CSD. While some studies with fewer observations than recommended (i.e., n < 130) showed lower agreement and greater variability in p-values and effect size estimates, this was not consistently appreciated. Therefore, researchers who use synthetic data should assess its stability in preserving their results. This study concludes with a general framework on sharing open data to facilitate computational reproducibility and foster a cumulative science in the field of CSD.
Read on OSF Preprints