Bootcamp briefing 2023

Motivation

Rare events, high costs of data acquisition and privacy limitations lead to data scarcity. All machine learning tuning and optimisation approaches at the end will reach their limits. At the end the data matters the most. Nowadays AI practitioners are given amazing tools to synthesize the data. The proposed task aims to explore the potential of synthetic data. It's an opportunity to investigate how creating our own data can revolutionize AI capabilities and potentially enhance machine learning model performance.

Task

Explore the data augmentation property of synthetic data by creating a mixed dataset (original data with synthetic) and measure performance of already working downstream ML pipelines.

Approach

  1. Go to Kaggle, find competition with a tabular dataset

  2. Take one of participants' solutions with an already given downstream ML implementation. It will be your baseline

  3. Take the data from the competition

  4. Train a CTGAN model on it and generate synthetic dataset

  5. Blend original data with synthetic copy and use it to train the solution you found on the Kaggle

  6. Evaluate the results of the retrained original solution

Limitations

  1. You cannot change hyperparameters of the original solution

  2. You can change hyperparameters of a CTGAN model

  3. The only thing that can be changed from Kaggle competition is the dataset, which must be replaced with “new dataset”. Here "new dataset" is a mix of real train data + synthetic data.

  4. Tricky part is smartly blending the real data with synthetic copy in a way that the downstream pipeline shows better predictions.

    Some hints for the experiment:

    • Find limited observations in real data and add synthetic copies so that the downstream machine learning model can generalize better.

    • Train CTGAN model with different hyperparameters

    • Running several “data blending” approaches at the same time with different conditions

Expected outcome

Discovered methods of mixing original data with synthetic that:

  1. Bring negative impact of the downstream ML pipeline performance with examples that can be replicated

  2. Bring positive impact of the downstream ML pipeline performance with examples that can be replicated

What is given

  • Example of CTGAN pipeline on tabular dataset that can be reused to generate synthetic copy.

  • Example of MLOps tool (WandB) to track results of your experiments.

Contact

The main contact persons for the project are Max Fediushkin, Anna Chechulina and Mike Shubov from AITAU.


About AITAU

AITAU is a start-up working with artificial intelligence (AI), focusing on synthetic data. Synthetic is a statistical twin of real data, mimicking its patterns while maintaining privacy and anonymity. Given the growing need for diverse and accurate data, we offer solutions to create, evaluate, and share synthetic data to drive the evolution of data-driven decisions and accelerate AI innovation worldwide. The ultimate goal is to unlock the full potential of synthetic data in data driven industries.

More about www.aitau.org