Personal tools

Open Source Big Data Tools

Versailles_DSC_0500
(Versailles, France - Alvin Wei-Cheng Wong)

 

 

Big Data Platforms and Tools

 

Big Data tools bring cost efficiency, better time management into the data analytical tasks. 

 

- Apache Hadoop Ecosystem

Hadoop is an ecosystem of open source components that fundamentally changes the way enterprises store, process, and analyze data. Unlike traditional systems, Hadoop enables multiple types of analytic workloads to run on the same data, at the same time, at massive scale on industry-standard hardware.

Apache Hadoop is an open source framework intended to make interaction with big data easier, Hadoop has made its place in the industries and companies that need to work on large data sets which are sensitive and needs efficient handling. Hadoop is a framework that enables processing of large data sets which reside in the form of clusters. Being a framework, Hadoop is made up of several modules that are supported by a large ecosystem of technologies.

Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems. It includes Apache projects and various commercial tools and solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools or solutions are used to supplement or support these major elements. All these tools work collectively to provide services such as absorption, analysis, storage and maintenance of data etc.

Following are the components that collectively form a Hadoop ecosystem: 

  • HDFS: Hadoop Distributed File System 
  • YARN: Yet Another Resource Negotiator 
  • MapReduce: Programming based Data Processing 
  • Spark: In-Memory data processing 
  • PIG, HIVE: Query based processing of data services 
  • HBase: NoSQL Database 
  • Mahout, Spark MLLib: Machine Learning algorithm libraries 
  • Solar, Lucene: Searching and Indexing 
  • Zookeeper: Managing cluster 
  • Oozie: Job Scheduling

 

- Open Source Big Data Tools

Apache Hadoop is designed to support the processing of large data sets in a distributed computing environment. Hadoop can handle big batches of distributed information but most often there's a need for a real time processing of people generated data like Twitter or Facebook updates. Financial compliance monitoring is another area of our central time processing is needed, in particular to reduce market data. Social media and market data are two types of what we call high velocity data. 

Apache Storm and Spark are two other open source frameworks that handle such real time data generated at a fast rate. Both Storm and Spark can integrate data with any database or data storage technology. 

 

  • [Apache Hadoop]: The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. 
  • [Apache Storm]: Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language.
  • [Apache Spark]: Apache Spark is a fast and general engine for large-scale data processing. Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project.

 

 

[More to come ...]

  

Document Actions