Synthetic Data Generation for Cancer Prognosis Prediction
Project information
- Category: Synthetic Data - GAN's & CNNs
- Done for: Dissertation Project
- Project date: 2024
- Project URL: https://github.com/Daniel2tio/SyntheticPrognosis
(Showcase of data used not possible due to data privacy agreement)
Data limitation and privacy is a key problem in medical imaging. This project explores the efficiency of using a custom Generative Adversarial Networks (GANs) in augmenting limited datasets to enhance classification accuracy within the context of cancer type classification. Utilizing a state-of-the-art GAN architecture (StyleGAN2- ADA), we synthesized diverse cancer types to expand the training set, addressing the challenge of data scarcity common in medical imaging datasets. A Convolutional Neural Network (CNN) based on ResNet was then employed to classify cancer types, leveraging variations of original and synthetic data both independently and in combination. Our results demonstrate that augmenting limited data with GAN-generated samples substantially improves classification accuracy, highlighting the effectiveness of this approach.
Utilizing a synthetic data generation strategy not only boosted the efficiency of predictive modeling by upholding both statistical accuracy and clinical relevance but also mitigates privacy concerns. The findings demonstrate the efficiency of the augmentation technique in generating medical data and significantly enhancing results quality, particularly in scenarios where training data is limited. It is important to note that while augmentation proves valuable in such contexts, it should not serve as a replacement for authentic data. Prioritizing the collection of extensive, high-quality training datasets remains paramount, with augmentation serving as a supplementary strategy to address data deficiencies. As potential future work, it would be beneficial to investigate the optimal combination of augmentation techniques, including recent advancements such as the U-net discriminator or a multi-modal generator, to further mitigate the challenges posed by limited data availability. Synthetic data has the potential of revolutionizing data-driven problems across various domains, ranging from healthcare to finance and beyond. Organizations can unlock valuable insights while safeguarding sensitive information, thereby seeking innovation and progress in a privacy-conscious era.