Weak Supervision vs Semi-Supervised ML Methods
- Overview
Semi-supervised learning (SSL) is a machine learning (ML) method that combines supervised and unsupervised learning. It uses a small amount of labeled data and a large amount of unlabeled data to train a model.
SSL uses labeled data to ground predictions, and unlabeled data to learn the shape of the larger data distribution. It can help practitioners achieve strong results with fractions of the labeled data, which can save time and money.
SSL provides the benefits of both unsupervised and supervised learning while avoiding the challenges of finding a large amount of labeled data.
Some limitations of SSL include:
- Lower accuracy: In some tasks, especially when labeled data is available, accuracy may be lower than with unsupervised learning.
- Limited in prediction: Unsupervised learning is used for research purposes, such as clustering or anomaly detection, and is not intended for prediction.
Semi-supervised learning is a way of training ML models when you only have a small amount of labeled data. By training the model on just the labeled subset of data and using it in a clever way to label the rest, you can avoid the difficulty of having a human being label everything.
Please refer to Wikipedia: Weak Supervision for more details.
- Why Semi-Supervised Learning?
Data deluge and data drought; as ML practitioners, we are often drowning in what we cannot use and despairing of what does not exist.
On the one hand, supervised learning is the foundation of ML techniques, but it is powered by labeled data, which is tedious and expensive. Alternatively, unsupervised learning uses unlabeled data, which is often plentiful without human annotation.
When used alone, either of these two strategies is often impractical for training a model to a deployment-ready baseline. Labeling entire datasets is time-consuming and expensive, and unlabeled data may not provide the required accuracy.
Semi-supervised learning (SSL) is a broad ML technique that uses both labeled and unlabeled data; as such, it is, as the name suggests, a hybrid technique between supervised and unsupervised learning.
The following is a summary of SSL:
- SSL is a broad category of ML that uses labeled data as the basis for prediction and unlabeled data to learn the shape of a larger data distribution.
- Practitioners can achieve great results with a small fraction of labeled data, saving valuable time and money.
- The intuition of popular SSL techniques is based on continuity, cluster, and manifold assumptions.
- Consistency regularization forces the model to make similar predictions for similar data points.
- Converting confident model predictions to one-hot labels helps achieve state-of-the-art results.
- More recently, holistic approaches and techniques involving other areas of machine learning such as self-supervision have become popular and successful.
The hardest ML problem is building datasets. Labeled data is expensive; unlabeled data is cheap. With both types of data, exposing your model to as much of the target sample space as possible is very powerful, and can help you achieve very high accuracy with small amounts of labeled data.
- SSL Methods
Semi-supervised learning (SSL) methods include:
- Self-training: A supervised classification or regression method can be modified to work semi-supervisedly by using labeled and unlabeled data. The model is first trained on labeled data, then used to generate labels for unlabeled data.
- Inductive learning: A model is trained on labeled data, and unlabeled data is used to improve the model's accuracy. Inductive learning considers all data points and tries to determine a function that maps labeled and unlabeled data points to their prospective labels.
- Text classification: A small amount of labeled data and a large amount of unlabeled text data can be used to train a text classification model.
Other semi-supervised techniques include:
- Pseudo-labeling
- Co-training
- Multi-view learning
- SSL using graph models
- Graph-based label propagation
SSL is used in many industries, including fintech, education, and entertainment. Examples of semi-supervised learning include:
[More to come ...]