Apache Spark
- Overview
Apache Spark is an open-source data processing engine for large data sets. It's designed for big data applications, such as streaming data, graph data, machine learning, and artificial intelligence (AI).
Spark can:
- Perform processing tasks on very large data sets
- Distribute data processing tasks across multiple computers
- Handle both batches as well as real-time analytics and data processing workloads
- Utilize in-memory caching
- Optimize query execution for fast analytic queries against data of any size
Spark can run on:
- Apache Hadoop
- Apache Mesos
- Kubernetes
- On its own
- In the cloud
- Against diverse data sources
Spark started in 2009 as a research project at the University of California, Berkeley. Thousands of companies, including 80% of the Fortune 500, use Apache Spark.
[More to come ...]