Personal tools

Pipeline Types and Use Cases

Caltech_072821A
[California Institute of Technology - Los Angeles Times]

- Overview

A data pipeline is an end-to-end sequence of digital processes for collecting, modifying, and delivering data. Organizations use data pipelines to copy or move data from one source to another so that it can be stored, used for analysis, or combined with other data.

Data pipelines ingest, process, prepare, transform and enrich structured, unstructured and semi-structured data in a controlled manner; this is called data integration.

Ultimately, data pipelines can help enterprises break down information silos and easily move and derive value from data in the form of insights and analytics.

Data pipelines are classified based on how they are used. Batch and just-in-time processing are the two most common pipeline types.

 

- Batch Processing Pipelins

A batch process is primarily used for traditional analytics use cases, where data is regularly collected, transformed, and moved to cloud data warehousing for business functions and traditional business intelligence use cases. 

Users can quickly move large amounts of data from siled sources to cloud data lakes or data warehouses and schedule operations for processing with minimal manual intervention. Batch processing allows users to collect and store data during events called batch windows, which helps manage large amounts of data and repetitive tasks efficiently.

 

- Streaming Pipelines

Streaming data pipelines enable users to ingest structured and unstructured data from a wide range of streaming sources such as Internet of Things (IoT), connected devices, social media feeds, sensor data, and mobile applications using a high-throughput messaging system making sure that data is captured accurately. 

Data transformation happens in real time using a streaming processing engine such as Spark streaming to drive real-time analytics for use cases such as fraud detection, predictive maintenance, targeted marketing campaigns, or proactive customer care.

Streaming data pipelines are used to:

  • Populate data lakes or data warehouses
  • Publish to a messaging system or data stream
  • Process, store, analyze, and act upon data streams as they're generated

Examples include:

  • Mobile Banking App
  • GPS application that recommends driving routes based on real-time traffic information
  • Smart watch that tracks steps and heart rate
  • Personalized instant recommendations in shopping or entertainment apps
  • Sensors needed in factories to monitor temperature or other conditions to prevent safety incidents

Streaming data pipelines are also known as event stream processing. Common use cases for streaming processing include:

  • Fraud detection
  • Detecting anomalous events
  • Tuning business application features
  • Managing location data
  • Personalizing customer experience
  • Stock market trading
  • Analyzing and responding to IT infrastructure events
  • Digital experience monitoring

Streaming data pipelines should have the following features:

  • Real-time data analytics and processing
  • Fool-proof architecture

To run a pipeline in streaming mode, you can:

  • Set the--streaming flag in the command line when you run your pipeline
  • Set the streaming mode programmatically when you construct your pipeline

Streaming data pipelines enable continuous data ingestion, processing, and movement from its source(s) into its destination as soon as the data is generated in real time. Similarly known as streaming ETL and real-time streaming, this technology is used across countless industries to turn databases into live feeds for streaming ingest and processing to accelerate data delivery, real-time insights, and analytics.

 
 

[More to come ...]

Document Actions