Personal tools
You are here: Home Research Trends & Opportunities New Media and New Digital Economy Data Science and Analytics Workflows and Data Pipelines in Machine Learning

Workflows and Data Pipelines in Machine Learning

Duke University_010421B
[Duke University]

- Overview

Data is the oil of our time—the new electricity. It is collected, moved, refined.

Today, it seems every business is looking for ways to integrate data from multiple sources to gain business insights that will lead to a competitive advantage.

A data pipeline is a method of obtaining raw data from various sources and then porting it to a data store (such as a data lake or data warehouse) for analysis. Before the data flows into the data repository, it usually undergoes some data processing.

The data pipeline encompasses how data travels from point A to point B; from collection to refining; from storage to analysis. It covers the entire data moving process, from where the data is collected, such as on an edge device, where and how it is moved, such as through data streams or batch-processing, and where the data is moved to, such as a data lake or application.

Data pipelines should seamlessly transport data to its destination and allow business processes to function smoothly. If the pipeline is blocked, quarterly reports may be missed, key performance indicators (KPIs) cannot be understood, user behavior cannot be processed, advertising revenue may be lost, and more. Good pipelines can be the lifeblood of an organization.

Before you attempt to build or deploy a data pipeline, you must understand your business goals, specify data sources and targets, and have the right tools.


- Components of Data Pipelines

A data pipeline is a process that moves data from one location to another, while simultaneously optimizing and transforming the data. Data pipelines are sometimes called data connectors. 

A data pipeline is essentially the steps involved in aggregating, organizing, and moving data. Modern data pipelines automate many of the manual steps involved in transforming and optimizing continuous data loads. Typically, this involves loading raw data into a staging table for temporary storage and then changing it before final insertion into the target reporting table.

A data pipeline consists of three components: 

  • Source: The source is the source of the data. Common sources include relational database management systems like MySQL, CRMs like Salesforce and HubSpot, ERPs like SAP and Oracle, social media management tools, and even IoT device sensors.
  • Data Transformation Steps: Generally speaking, data is extracted from the source, manipulated and changed according to business needs, and then stored at the destination. Common processing steps include transformation, enhancement, filtering, grouping, and aggregation.
  • Destination: The destination is where the data arrives at the end of processing, typically a data lake or data warehouse for analysis.


- Benefits of Data Pipelines

Your organization may need to process large amounts of data. To analyze all your data, you need a single view of the entire data set. When this material resides in multiple systems and services, it needs to be combined in a way that is meaningful for in-depth analysis. 

The data flow itself can be unreliable: there are many points along the way from one system to another where corruption or bottlenecks can occur. As the breadth and scope of the role data plays continues to expand, the scale and impact of the problem will only grow. 

This is why data pipelines are critical. They eliminate most manual steps in the process and enable a smooth, automated flow of materials from one stage to another. They're critical for instant analysis, helping you make data-driven decisions faster.

Data pipelines are used to generate business insights. For example, data pipelines can be used for: 

  • ETL (Extract-Transform-Load) Processes
  • Data Warehousing and Analytics
  • Data Science and Machine Learning
  • eCommerce Recommendation Engine
  • Social Media Sentiment Analysis
  • Fraud Detection in Financial Transactions
  • IoT Data Processing

Data pipelines often include processes such as: ETL, Replication, Virtualization, Machine learning, Batch processing. 


Machine Learning Pipeline_110723A
[Machine Learning Pipeline - Medium]

- Data Pipelines in AI, ML, DL, and Neural Networks

A machine learning (ML) pipeline is a way to automate the workflow of producing machine learning models. It coordinates the data flow input and output of the machine learning model. Pipelines include:

  • Raw data input
  • Features
  • Output
  • Machine learning model and model parameters
  • Prediction output
Pipelines work by transforming and correlating a series of data in a model, and achieving results through testing and evaluation.
Developing efficient machine learning pipelines is key to successfully leveraging artificial intelligence (AL). Pipelines allow you to:
  • Streamline the process of taking raw data
  • Train a machine learning model
  • Evaluate performance
  • Integrate forecasting into business applications
Deep learning (DL) is a subset of machine learning, essentially a three- or multi-layer neural network. These neural networks attempt to simulate the behavior of the human brain, allowing it to "learn" from large amounts of data.

- Data Pipeline Process

To understand how data pipelines work, consider any pipe that receives data from a source and delivers it to a destination. What happens to the data along the way depends on the business use case and the destination itself. A data pipeline can be a simple process of extracting and loading data, or it can be designed to process data in a more advanced way, such as a training dataset for machine learning.

  • Source: Data sources may include relational databases and data from SaaS applications. Most pipelines obtain raw data from multiple sources through push mechanisms, API calls, replication engines that periodically pull data, or webhooks. Additionally, data can be synced instantly or at scheduled intervals.
  • Destination: The destination can be a data store, such as an on-premises or cloud-based data warehouse, data lake, or data mart, or it can be a BI or analytics application.
  • Transformation: Transformation refers to the operation of changing data and may include data standardization, sorting, deduplication, verification and validation. The ultimate goal is to make it possible to analyze data.
  • Processing: There are two data ingestion models: batch (source data is collected periodically and sent to target systems) and stream processing (data is acquired, manipulated, and loaded immediately after creation)
  • Workflow: Workflow involves the sequencing and dependency management of processes. Workflow dependencies can be technical or business-oriented.
  • Monitoring: Data pipelines must have monitoring components to ensure data integrity. Examples of potential failure scenarios include network congestion or a source or destination going offline. The pipeline must include a mechanism to alert administrators about such situations.


- Workflow Dependencies

Workflow involves the sequencing and dependency management of processes. Workflow is a general term for a carefully planned, repeatable pattern of activity. It may be described as a set of operations, the work of an individual or group, or the work of an organization of employees.

A workflow dependency is when one job must wait for another job to start or finish before it can start. This means waiting.

Workflows are made up of the same three basic components: triggers, jobs or activities, and results.

A data pipeline is a series of processes that move data from a source database to a target database. Data pipelines are particularly useful when working with large amounts of data or when working with data that needs to be constantly updated.

Workflow dependencies can be technical or business-oriented.
  • ​Technical Dependency: An example of technical dependency is that after material is obtained from the source, it is held in a central queue where it is further verified and then finally dumped to the destination.
  • Business Dependency: An example of a business dependency might be that information must be cross-validated from one source against another to maintain accuracy before being merged.

- In-house or in the Cloud?

Many companies build their own data pipelines. But there are challenges in developing internal pipelines. Different data sources provide different APIs and involve different technologies. Developers must write new code for each data source, which may need to be rewritten if the vendor changes its API or if the organization adopts a different data warehousing target.

Speed ​​and scalability are two other issues data engineers must address. For time-sensitive analytics or business intelligence applications, ensuring low latency is critical to delivering data that drives decisions. As data volume and velocity grow, the solution should be resilient. The high costs involved and the ongoing effort required to maintain them can be major barriers to building data pipelines in-house.


[More to come ...]

Document Actions