Data Augmentation
- Overview
Data augmentation is a technique that uses existing data to create new data samples to train machine learning (ML) models. It's a way to increase the size and diversity of a dataset, which can help improve the performance of ML models.
Data augmentation is useful for addressing challenges like:
- Limited training data: It can be difficult to source large, diverse datasets from the real world.
- Class imbalance: In some classification problems, some classes may be underrepresented in the training data. Data augmentation can help improve the model's ability to classify these underrepresented classes.
- Overfitting: Data augmentation can help reduce overfitting and improve model robustness.
Some examples of data augmentation techniques include:
- Random cropping: Randomly cropping images to create new examples with different scales and aspect ratios
- Flipping and rotation: Flipping images horizontally or vertically provides new viewpoints
- Text transformations: Randomly replacing words with synonyms, swapping words in a sentence, or inserting, deleting, or swapping words
- Time stretching: Altering the speed of audio without changing its pitch
- Pitch shifting: Modifying the pitch of audio while maintaining the same speed
- Adding noise: Introducing background noise to simulate real-world environments
Data augmentation is different from synthetic data, which is the automatic generation of entirely artificial data.
[More to come ...]