Personal tools
You are here: Home Research Trends & Opportunities New Media and New Digital Economy Data Science and Analytics Workflows and Pipelines in AI and ML

Workflows and Pipelines in AI and ML

Duke University_010421B
[Duke University]
 

- Overview

In artificial intelligence (AI) and machine learning (ML), workflows and pipelines are structured processes that help manage and develop models. Workflows define the phases of a project, while pipelines are a series of components that automate those phases. 

ML pipelines automate the workflow by breaking down the ML task into multiple, connected steps. Each step can receive input data, perform a calculation or transformation, and then pass the output to the next step. This allows data scientists to focus on other tasks, such as model evaluation and data exploration. Pipelines can also help standardize best practices, improve model building efficiency, and reduce training costs. 

A typical ML workflow includes the following phases:

  • Data collection: Gathering raw data from various sources to train the model's algorithms.
  • Data preprocessing: Cleaning, preparing, and evaluating data sources to identify issues with quality.
  • Building datasets: Creating datasets.
  • Model training and improvements: Training and refining the model.
  • Evaluation: Evaluating the model.
  • Deployment and production: Deploying the model and putting it into production.

 

Developing production-ready AI and ML systems involves a structured workflow that governs how models are developed, deployed, monitored and maintained. Pipelines provide this structure, offering a repeatable, scalable development process comprising a series of interconnected stages. 

 

- Data Pipelines vs. ML Pipelines

Data pipelines and machine learning (ML) pipelines are both processes that link multiple modules and are essential components of organizations. However, they have different purposes and are built by different people.

Data pipelines used for reporting and analytics. Data pipelines transport data to a warehouse or lake and are typically built by data engineers for business users. Data pipelines may process data at regular intervals, such as every hour or 30 minutes, and store the results. They also need to be scalable, secure, and hosted on the cloud, and they require regular monitoring and maintenance. 

ML pipelines used to learn and make predictions. ML pipelines automate the process of building and deploying ML models and are typically built and used by data scientists. ML pipelines involve building, training, and deploying ML models, which may include offline predictions or batch jobs. A crucial part of ML pipelines is data cleaning and standardization, which may include tasks like removing null values, binning ages, and ensuring consistent date formats.

Data pipelines and ML pipelines have many similarities, including:

  • Both access data from corporate systems and intelligent devices
  • Both store collected data in data stores
  • Both go through data transformation to prepare it for analysis or learning
  • Both keep historical data


Joining a data pipeline and an ML pipeline together can create a collaborative platform that allows data engineers and data scientists to work together, and can benefit business users with predictive models.

 

[More to come ...]

Document Actions