Building ML Applications
- Putting the Machine Learning Pieces Together
Reading through a data science book or taking a course, it can feel like you have the individual pieces, but don’t quite know how to put them together.
Taking the next step and solving a complete machine learning problem can be daunting, but preserving and completing a first project will give you the confidence to tackle any data science problem.
The following will introduce a complete ML solution, which contains a real-world process that allows you to understand how all the parts are put together.
- Data cleaning and formatting
- Exploratory data analysis
- Feature engineering and selection
- Compare several machine learning models on a performance metric
- Perform hyperparameter tuning on the best model
- Evaluate the best model on the testing set
- Interpret the model results
- Draw conclusions and document work
- Building an ML Application
- Frame your core ML question based on what you observed and the answer you want your model to predict.
- Collect, clean, and prepare data for use with ML model training algorithms. Visualize and analyze data to run sanity checks to verify the quality of the data and understand it.
- Often, the raw data (input variables) and answers (targets) are not represented in a way that can be used to train highly predictive models. Therefore, you should generally try to build more predictive input representations or features from the original variables.
- The generated features are fed into a learning algorithm to build a model, and the quality of the model is evaluated based on the data provided for model building.
- Use the model to generate predictions of target answers for new data instances.
- Data
Data is a key component of machine learning and provides the foundation for machine learning algorithms. Machines need large amounts of data to learn in order to function and make informed decisions. Any unprocessed information, value, sound, image or text can be considered data.
The accuracy and effectiveness of a ML model depends largely on the quality and quantity of the data used for training.
When building a data set, make sure it has 5V characteristics:
- Volume: The amount of information required for a model to be accurate and effective is important. The accuracy of machine learning models will increase with the size of the data collected.
- Velocity: The speed at which data is generated and processed is also critical. In some cases on-the-fly data processing may be required to obtain accurate results.
- Variety: The data set should include diversity in formats such as structured, unstructured and semi-structured data.
- Veracity: Cleanliness, consistency, and error-freeness of data are aspects of data quality and accuracy. Only accurate data can result in accurate output.
- Value: The information in the data must be valuable before any conclusions can be drawn.
- Models
The model serves as the underlying core component of ML, representing the link between inputs and outputs to produce accurate and fresh data. It is trained on data sets to identify underlying patterns and produce accurate results.
After training, the model is tested to determine whether it can provide fresh and accurate data; if the test is successful, it is used in real-world applications.
Let us take an example to understand this further. You want to build a model that takes into account characteristics such as age, body mass index (BMI), and blood sugar levels to identify whether a person has diabetes.
We had to first compile a dataset of diabetes patients and related health indicators. The algorithm uses a dataset of diabetic patients and considers their health indicators to analyze patterns and relationships in the data and produce accurate results.
It identifies potential relationships between outcomes (diabetes status) and input characteristics (blood glucose levels, BMI, and age). After training, the model can use information such as blood sugar levels, weight and age to predict whether a new patient has diabetes.
- Algorithms
Train models using algorithms that learn hidden patterns from data, predict outputs, and improve performance with experience. It is an important component of ML because it powers the learning process and affects the accuracy and effectiveness of the model.
The training data set consists of input data and associated output values. Once patterns and associations in the data are identified, a variety of mathematical and statistical techniques are used to determine the underlying relationships between inputs and outputs.
For example, when we have a dataset of animal photos and their matching species labels, we need to train a ML model to identify the species of the animals in the photos. Convolutional Neural Networks (CNN) can be used for this purpose.
The CNN method breaks incoming visual data into multiple layers of mathematical operations to identify features such as edges, shapes, and patterns. These features are then used to classify the image into one of the species categories.
However, there are several alternatives, including decision trees, logistic regression, k-nearest neighbors, etc. The data set provided and the problem that must be solved determine your algorithm.
[More to come ...]