Personal tools

Building ML Models

Harvard (Charles River) IMG 7698
(Harvard University - Harvard Taiwan Student Association)


- Overview

Building a machine learning (ML) model is a multi-step process involving data collection and preparation, training, evaluation, and continuous iteration. 

Even for those with ML experience, building AI models can be complex and require diligence, experimentation, and creativity.  

But at a high level, the process of designing, deploying, and managing ML models often follows a common pattern. By understanding and following these steps, you will better understand the modeling process and best practices to guide your projects.

The process for a functional and useful system contains at least all of the following steps:

  1. Ideation and defining of your problem statement
  2. Acquiring (or labelling) of a dataset
  3. Exploration of your data to understand its characteristics
  4. Building a training pipeline for an initial version of your model
  5. Testing and performing error analysis on your model’s failure modes
  6. Iterating from this error analysis to build improved models
  7. Repeating steps 4-6 until you get the model performance you need
  8. Building the infrastructure to deploy your model with the runtime characteristics your users want
  9. Monitoring your model consistently and use that to repeat any of steps 2-8

 

- Understand the Business Problem and Define Success Criteria

The first stage of any ML project is understanding the business needs: you need to know what the problem is before you try to solve it. 

First, work with the project owner to determine the goals and requirements of the project. The goal is to translate this knowledge into a problem definition suitable for a ML project and develop a preliminary plan to achieve project goals.  

Key questions to answer include: 

  • What are the business goals? What parts of achieving this goal require machine learning methods?
  • What are the heuristic options - in other words, quick and dirty methods that don't require machine learning - and how much better does the model need to be than the heuristic?
  • What type of algorithm is best suited to the problem at hand—such as classification, regression, or clustering?
  • Have the relevant teams addressed all necessary technical, business, and deployment issues?
  • What are the success criteria for project definition? How will the organization measure the benefits of the model?
  • How does the team phase in the project during an iteration sprint?
  • Are there requirements for transparency, explainability, or bias reduction?
  • What are the ethical considerations?
  • What are the acceptable parameters for accuracy, precision, and confusion matrix values?
  • What are the expected inputs and outputs?

Setting specific, quantifiable goals will help you achieve measurable ROI from your ML projects, rather than implementing a proof-of-concept that will later be discarded. 

Chart titled "Is Your Machine Learning Project Feasible or Not Feasible?" Three criteria: business feasibility, data feasibility, and implementation feasibility.  

These goals should be related to business goals, not just ML. While you can include typical ML metrics such as precision, accuracy, recall, and mean square error, it's critical to prioritize specific KPIs that are relevant to your business.

 

- Training A ML Model

Here are some steps for training a machine learning (ML) model:

  • Data preparation: Collect, clean, and organize data before using it to train the model. The quality of the data affects the accuracy of the model's predictions.
  • Training: Model training is a key step in the development process for ML algorithms. Data scientists use tools to find the best weights and biases for an algorithm to minimize its loss function.
  • Evaluation: Model evaluation is a key step in ML. It assesses the quality of the data and helps users trust the model to be used in a particular dataset.
  • Choose a model: Select the right model architecture and algorithms to solve the problem.
  • Prediction: Train the model iteratively on a data set. In each iteration, the model makes a prediction, checks if it's correct, and calibrates itself for wrong predictions.
  • Test the loaded model: Select the document sets to use to train the model and specify the percentage of documents to use as training data, test data, and blind data. Explore the performance metrics to identify ways to improve the model.

- Building A ML Model

Here are some more steps to building a ML model: 

  • Data collection: Gather and measure information on targeted variables in an established system.
  • Data preparation: Transform raw data so a ML algorithm can learn, discover insights, and make predictions from the datasets. Data preparation involves six steps: accessing, ingesting, cleansing, formatting, combining, and then analyzing the data.
  • Model evaluation: Provides an unbiased estimate of the model's ability to generalize to new, unseen data. The choice of evaluation metrics depends on the specific problem type.
  • Parameter tuning: Further testing to further improve the training in any way by trying more values and parameters.
  • Data preprocessing: An important step before applying ML methods for energy or load prediction. The common steps include data imputation, data resolution processing, data normalization, outlier detection and data smoothing.

 

Other steps for building a ML model include: 

  • Contextualizing ML in your organization
  • Exploring the data and choosing the type of algorithm
  • Preparing and cleaning the dataset
  • Splitting the prepared dataset and performing cross validation
  • Performing ML optimization
  • Deploying the model

 

Funes_Dolomites_Italy_090223A
[Funes, Dolomites, Italy - World Landscapes]

- Training and Evaluating A ML Model in Python 

Here are the steps on how to train and evaluate a model in Python: 

Step 1. Load the data
The first step is to load the data that you want to train the model on. You can use the pandas library to load the data into a DataFrame.

Step 2. Split the data into training and test sets

Once the data is loaded, you need to split it into training and test sets. The training set will be used to train the model, and the test set will be used to evaluate the model's performance. You can use the train_test_split() function from the scikit-learn library to split the data.

Step 3. Choose a model
Next, you need to choose a model that you want to train. There are many different models available, so you need to choose one that is appropriate for the task that you are trying to solve.

Step 4. Train the model
Once you have chosen a model, you need to train it on the training data. You can use the fit() method to train the model.

Step 5. Evaluate the model
Once the model is trained, you need to evaluate its performance on the test set. You can use the score() method to evaluate the model.

Step 6. Deploy the model
Once the model is trained and evaluated, you can deploy it to production. This means making the model available to users so that they can use it to make predictions.

 

- Top ML Algorithms

ML algorithms are processes implemented in code and run on data. The ML model is output by the algorithm and consists of model data and prediction algorithm. ML algorithms provide a type of automated programming in which ML models represent programs.

Here are some ML algorithms:

  • K-means clustering: A clustering algorithm that groups similar data points into clusters. The number of groups is called K.
  • Reinforcement learning: A machine learning algorithm where a machine learns ideal behavior to maximize its performance.
  • Supervised learning: A machine learning algorithm that uses a known dataset to make predictions.
  • K-Nearest Neighbor (KNN): A machine learning algorithm that solves classification problems by assigning a new data point to a category.
  • Logistic regression: A machine learning classification algorithm that predicts the probability of certain classes based on dependent variables.
  • Support Vector Machines (SVMs): A supervised learning algorithm that is used for classification, regression, and outlier detection tasks.
  • Naive Bayes: A machine learning algorithm based on the idea of using Bayes' theorem to make predictions.
  • Random forest: A supervised learning algorithm that is used for classification and regression.
  • Decision tree: A machine learning algorithm for classification and regression problems.

 

 

[More to come ...]


Document Actions