Data Collection Layer
- Overview
Machine learning starts with data. But to make this data useful, several processes must be performed. One of them is data collection.
Simply put, data collection is the process of collecting data relevant to the goals of an AI project. You end up with a dataset, which is essentially a collection of your data, all of which will be trained and fed into the ML model. How hard can it be?
It may seem simple at first, but data collection is actually the first and most fundamental step in the machine learning pipeline. It is part of the complex data processing phase of the ML lifecycle. This leads to another important point: data collection directly affects the performance and final result of the ML model.
- The Role of Data Collection
As new technologies unfold in an era of exciting innovations, collecting data is undoubtedly important for any organization. Data fuels analytical insights and artificial intelligence that are difficult to achieve otherwise.
As a society, we are generating data at an unprecedented rate. This data can be numeric (temperature, loan amount, customer retention rate), categorical (gender, skin color, highest degree earned), or even free text (think doctor's notes or opinion surveys). Data collection is the process of collecting and measuring information from countless different sources. In order to use the data we collect to develop practical artificial intelligence (AI) and machine learning solutions, it must be collected and stored in a way that makes sense for the business problem at hand.
Collecting data allows you to capture records of past events so that we can use data analytics to find recurring patterns. Based on these patterns, you can use machine learning algorithms to build predictive models to look for trends and predict future changes.
- Data Collection for AI/ML/DL
Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data.
Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research.
- Data Collection and the Model Accuracy in Machine Learning
The accuracy of the predictions or recommendations produced by a machine learning system depends on the training data. However, several issues that can affect accuracy can arise during ML data collection:
- Bias. Data bias is difficult to prevent and eliminate because of the tendency of the people who build ML models to be biased.
- The data is inaccurate. The collected data may not be relevant to the ML problem statement.
- Missing data. For some classes of predictions, missing data may represent nulls or missing images in columns.
- The data is unbalanced. Certain groups or categories in the data are at risk of being underrepresented in the model due to the size of the corresponding samples being too large or too small.
Understanding data collection tools and methods is critical to understanding how to create datasets and gaining a deeper understanding of machine learning as a whole. Having high-quality data at hand guarantees the success of any task in ML.
- Data Preparation for Automated Machine Learning
Predictive models are as good as the data that builds them, so good data collection practices are critical to developing high-performance models. The data needs to be free of errors (garbage in, garbage out) and contain relevant information for the task at hand. For example, a loan default model would not benefit from tiger population size, but would benefit from natural gas prices over time.
The quality of predictive output relies on the quality of input -- if you put good in, you’ll get good out. That’s why proper data preparation is such a critical success factor for achieving optimal machine learning results. The iterative process of preparing data for automated machine learning is both an art and a science.
[More to come ...]