Data Labeling in ML
- Overview
Data labeling, also known as data annotation, is the process of identifying raw data and adding labels to provide context for machine learning (ML) models. The labels give the data meaning and make it identifiable within a specific context. For example, labels might indicate whether a photo contains a bird or car, which words were uttered in an audio recording, or if an x-ray contains a tumor.
Data labeling is a fundamental requirement for supervised ML applications. It's involved in many applications, such as computer vision, natural language processing (NLP), and speech recognition.
Data labeling can generally refer to tasks that include data tagging, annotation, classification, moderation, transcription, or processing. For example, for an image, this might be telling a model that there is a person or a tree. For an audio recording, an annotator writes the words that are being said.
When making a choice, data science practitioners suggest considering such factors as setup complexity, labeling speed, and accuracy.
- Data Labeling Approaches
Data labeling can be expensive and time-consuming, and it can also be prone to human error, which can decrease the quality of the data.
Labeled data is fundamental because it forms the basis for supervised learning, a popular approach to training more accurate and effective ML models. For example, if an image has been labeled with the words "person" or "tree", the ML model will learn by example and be able to spot a person or tree on a photo without labels.
Some common data labeling approaches include:
- Internal manual labeling: This approach involves manually examining each data point and using subject-matter expertise to label it.
- Bounding boxes: For many applications, bounding boxes provide sufficient accuracy for a machine learning model with minimal effort.
- Polygons: Some applications require the increased accuracy of polygons at the expense of a more costly and less efficient annotation.
- Data Labeling and Fully Labeled Training Data
Fully labeled means that every example in the training dataset is labeled with the answer the algorithm should come up with on its own. So a dataset of labeled flower images tells the model which photos are roses, daisies, and daffodils. When shown a new image, the model compares it to the training examples to predict the correct label.
In supervised learning, the machine is taught by examples. An operator provides a ML algorithm with a known data set containing desired inputs and outputs, and the algorithm must find a way to determine how to obtain those inputs and outputs. When the operator knows the correct answer to a question, the algorithm identifies patterns in the data, learns from observations and makes predictions.
The algorithm makes predictions and the operator makes corrections - a process that continues until the algorithm reaches a high level of accuracy/performance.
- Data Labeling, ML Models, and Model Training
Most practical ML models today use supervised learning, which applies an algorithm to map an input to an output. For supervised learning to work, you need a set of labeled data from which the model can learn to make correct decisions.
Data labeling often begins by asking humans to make judgments given unlabeled data. For example, a tagger might be asked to tag all images in a dataset where "does the photo contain a bird" is true. Markings can be as coarse as a simple yes/no, or as fine as identifying specific pixels in an image that are associated with birds.
ML models use human-supplied labels to learn latent patterns in a process called "model training." The result is a trained model that can be used to make predictions on new data.
In ML, the correctly labeled dataset that you use as an objective standard for training and evaluating a given model is often called the "ground truth". The accuracy of the trained model will depend on the accuracy of the ground truth, so it is critical to spend time and resources ensuring highly accurate data labeling.
[More to come ...]