Personal tools

Programming Models for Big Data

(University of Michigan at Ann Arbor)


A programming model is an abstraction or existing machinery or infrastructure. It is a set of abstract runtime libraries and programming languages that form a model of computation. This abstraction level can be low-level as in machine language in computers. Or very high as in high-level programming languages, for example, Java. So we can say, if the enabling infrastructure for big data analysis is distributed file systems as we mentioned, then the programming model for big data should enable the programmability of the operations within distributed file systems. What we mean by this being able to write computer programs that work efficiently on top of distributed file systems using big data and making it easy to cope with all the potential issues. 

Based on everything we discussed so far, let's describe the requirements for big data programming models. First of all, such a programming model for big data should support common big data operations like splitting large volumes of data. This means for partitioning and placement of data in and out of computer memory along with a model to synchronize the datasets later on. The access to data should be achieved in a fast way. It should allow fast distribution to nodes within a rack and these are potentially, the data nodes we moved the computation to. This means scheduling of many parallel tasks at once. It should also enable reliability of the computing and fault tolerance from failures. This means it should enable programmable replications and recovery of files when needed. It should be easily scalable to the distributed notes where the data gets produced. It should also enable adding new resources to take advantage of distributed computers and scale to more or faster data without losing performance. This is called scaling out if needed. Since there are a variety of different types of data, such as documents, graphs, tables, key values, etc. A programming model should enable operations over a particular set of these types. Not every type of data may be supported by a particular model, but the models should be optimized for at least one type. 




MapReduce is a big data programming model that supports all the requirements of big data modeling we mentioned. It can model processing large data, split complications into different parallel tasks and make efficient use of large commodity clusters and distributed file systems. In addition, it abstracts out the details of parallelzation, fault tolerance, data distribution, monitoring and load balancing. As a programming model, it has been implemented in a few different big data frameworks. 

To summarize, programming models for big data are abstractions over distributed file systems. The desired programming models for big data should handle large volumes and varieties of data, support fault tolerance and provide scale out functionality. MapReduce is one of these models, implemented in a variety of frameworks including Hadoop. 



Document Actions