Workflows and Pipelines in AI and ML
- Overview
Data pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion.
A data pipeline abstracts data transformation functions to integrate data sets from disparate sources. Just like aqueducts carry water from reservoirs to taps, data pipelines move data from collection points to storage. A data pipeline extracts data from a source, makes changes, and saves it in a specific destination.
In artificial intelligence (AI) and machine learning (ML), workflows and pipelines are structured processes that help manage and develop models. Workflows define the phases of a project, while pipelines are a series of components that automate those phases.
Please refer to the following for more information:
- Wikipedia: Data Pipeline
- Wikipedia: MLOps
- Data Wrangling
Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.
Data wrangling is the process of gathering, selecting, and transforming raw data into a more useful format for analysis. It involves six steps: data discovery, data structuring, data cleaning, data enriching, data validating, and data publishing.
Data wrangling is important because it ensures that data is reliable before it's analyzed.
Data wrangling is a manual process that's exploratory and iterative. Some say that data wrangling costs analytics professionals as much as 80% of their time, leaving only 20% for exploration and modeling.
- ML Pipelines
ML pipelines automate the workflow by breaking down the ML task into multiple, connected steps. Each step can receive input data, perform a calculation or transformation, and then pass the output to the next step. This allows data scientists to focus on other tasks, such as model evaluation and data exploration. Pipelines can also help standardize best practices, improve model building efficiency, and reduce training costs.
A typical ML workflow includes the following phases:
- Data collection: Gathering raw data from various sources to train the model's algorithms.
- Data preprocessing: Cleaning, preparing, and evaluating data sources to identify issues with quality.
- Building datasets: Creating datasets.
- Model training and improvements: Training and refining the model.
- Evaluation: Evaluating the model.
- Deployment and production: Deploying the model and putting it into production.
Developing production-ready AI and ML systems involves a structured workflow that governs how models are developed, deployed, monitored and maintained. Pipelines provide this structure, offering a repeatable, scalable development process comprising a series of interconnected stages.
- Data Pipelines
A data pipeline is a series of processing steps that prepare enterprise data for analysis. Organizations have vast amounts of data from a variety of sources, including applications, Internet of Things (IoT) devices, and other digital pipes.
However, raw data is useless; it must be moved, sorted, filtered, reformatted, and analyzed to gain business intelligence. Data pipelines include techniques for validating, summarizing, and finding patterns in data to inform business decisions. Well-organized data pipelines support a variety of big data projects such as data visualization, exploratory data analysis, and machine learning tasks.
Data pipelines are used to generate business insights. For example, data pipelines can be used for:
- ETL (Extract-Transform-Load) Processes
- Data Warehousing and Analytics
- Data Science and Machine Learning
- eCommerce Recommendation Engine
- Social Media Sentiment Analysis
- Fraud Detection in Financial Transactions
- IoT Data Processing
- Data Pipelines vs. ML Pipelines
Data pipelines and machine learning (ML) pipelines are both processes that link multiple modules and are essential components of organizations. However, they have different purposes and are built by different people.
Data pipelines used for reporting and analytics. Data pipelines transport data to a warehouse or lake and are typically built by data engineers for business users. Data pipelines may process data at regular intervals, such as every hour or 30 minutes, and store the results. They also need to be scalable, secure, and hosted on the cloud, and they require regular monitoring and maintenance.
ML pipelines used to learn and make predictions. ML pipelines automate the process of building and deploying ML models and are typically built and used by data scientists. ML pipelines involve building, training, and deploying ML models, which may include offline predictions or batch jobs. A crucial part of ML pipelines is data cleaning and standardization, which may include tasks like removing null values, binning ages, and ensuring consistent date formats.
Data pipelines and ML pipelines have many similarities, including:
- Both access data from corporate systems and intelligent devices
- Both store collected data in data stores
- Both go through data transformation to prepare it for analysis or learning
- Both keep historical data
Joining a data pipeline and an ML pipeline together can create a collaborative platform that allows data engineers and data scientists to work together, and can benefit business users with predictive models.
- MLOps
MLOps is a paradigm designed to reliably and efficiently deploy and maintain ML models in production. The word is a compound of “machine learning” and the continuous delivery practice (CI/CD) of DevOps in the software field.
ML models are tested and developed in isolated experimental systems. When the algorithm is ready to launch, MLOps exercises take place between data scientists, DevOps, and ML engineers to transition the algorithm to production systems.
Similar to DevOps or DataOps approaches, MLOps seeks to increase automation and improve the quality of production models, while also focusing on business and regulatory requirements.
- CI/CD Pipelines
A CI/CD pipelines, or Continuous Integration and Continuous Deployment pipeline for data, are becoming increasingly important for data engineering and data science. They can help data science teams deliver high-quality machine learning models to businesses in a timely manner.
A CI/CD pipeline is a software development or engineering process that combines automated code building and testing with deployment. A CI/CD pipeline is used to deploy new and updated software safely.
- DevOps, Data Pipelines, and CI/CD Pipelines
DevOps can be defined as the union of people, process, and products to enable continuous delivery of value to the business. It's an iterative process of "Developing", "Building & Testing", "Deploying", "Operating", "Monitoring and Learning" and "Planning and Tracking".
The application of DevOps principles to Data can be understood through the concepts of Data and CI/CD pipelines.
In DevOps, a data pipeline is a series of steps that prepare raw data for analysis, and a CI/CD pipeline is a way to continuously update data pipelines as new ideas are developed and tested.
Here's some more information about data pipelines and CI/CD pipelines:
- Data pipelines: Data pipelines are used to prepare data for analysis by moving, sorting, filtering, and reformatting it. They can help improve data quality by removing redundancy and standardizing formats. Data pipelines can be used for exploratory data analysis, data visualizations, and machine learning.
- CI/CD pipelines: CI/CD pipelines are a critical part of DevOps that continuously update data pipelines in different environments. They are often called innovation pipelines because they enable the change process.
- Data pipeline ownership: Data engineers are typically responsible for the data ingestion, transformation, and sharing processes that are part of data pipelines.
- CI/CD pipeline ownership: The Platform automation and operations team typically owns the maintenance of CI/CD pipelines.