Big data adoption continues to grow. Big Data covers data volumes from petabytes to exabytes and is essentially a distributed processing mechanism. MapReduce is a framework for data processing model. The greatest advantage of Hadoop is the easy scaling of data processing over multiple computing nodes. Many organizations use Hadoop for data storage across large pools of unstructured information called data lakes, and then load the most useful data into a relational warehouse for rapid and repetitive queries.
In this article, we will understand and study the transformation of Data Warehouse to the storage of Big Data into Hadoop and various MapReduce implementations in various file systems.
Sign up for your weekly dose of what's up in emerging technology.
File Systems specialize in managing unstructured information. Databases specialize in managing structured information. Traditionally, a database knows the format of the internal contents of the table. In a File System, data is directly stored in set of files.
Traditional models of data processing are being disrupted by following factors:
1. Increasing data volumes
2. Increasing data complexity
3. Changing analysis models
4. Increasing analytical complexity over wide and broad spectrum of data
5. Increasing capability and availability of cost-effective compute and storage
Data warehouses are essentially large database management systems that are optimized for read-only queries across structured data. They are relational databases and, as such, are very SQL-friendly. They provide fast performance and relatively easy administration. In the 1990s, Bill Inmon defined a design known as a data warehouse. In 2005, Gartner clarified and updated those definitions.
From these we summarize that a data warehouses have following features–
• Subject Oriented – The data is modelled after business concepts, organizing them into subjects’ areas like sales, finance and inventory. Each subject area contains detailed data.
• Integrated – the logical model is integrated and consistent. Data formats and values are standardized.
• Non-volatile – Data is stored in the data warehouse unmodified and retained for longer periods of time.
• Non-Virtual – The data warehouse is a physical, persistent repository.
The biggest negative aspects of Data Warehouse are cost and flexibility. Most data warehouses are built upon proprietary hardware and are many orders of magnitude more expensive than other approaches.
HDFS, the Hadoop Distributed File System, is a distributed file system designed so that it can hold a very large amount of data (terabytes or even petabytes) and provide high throughput access to this information.
HDFS is a master-slave architecture that is a backbone of the computing frameworks such as MapReduce. HDFS layer forms the foundation for data-driven business decisions through Hadoop’s breakthrough features in scalability, flexibility, cost optimizations, ability to handle the greater volume, and veracity of data over traditional platforms. HDFS is a modern data storage engine that can adapt to ever-changing and extremely variable data formats. HDFS effortlessly combines storage and compute workloads which allows combining and collapsing multiple workloads such as batch processing.
MapReduce is a data processing model. Its greatest advantage is the easy scaling of data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called as mappers and reducers. In the mapping phase, MapReduce takes the input data and feeds each data element to the mapper. In the reducing phase, the reducer processes all the outputs from the mapper and arrives at result. In simple terms, the mapper is used for filter and transform the input into something that the reducer can accumulate over.
The difference between Data Warehouse and Hadoop be as below –
|Parameters||Data Warehouse||Hadoop Ecosystem|
|System||Works well with batched data ingestion||Works well with batch or near real time or real time data ingestion|
|Hardware||Needs expensive hardware including specialized hardware in some cases||Uses commodity hardware|
|Built Type||Built primarily for performance||Built for extreme scalability|
|Data Processing||Static Data||Data on the fly|
|Data Type||Supports primarily modelled and structured data||Supports any type of data|
|Storage/ Capacity||Designed for very high volumes (GBs – TBs)||Designed for extremely high volumes (TBs and more)|
|Paradigm||Write many times read many times paradigm (ETL form)||Write once read many times paradigm|
Various MapReduce Implementations
MapReduce is both, the name of the programming model, the original framework, and the scalable distributed data-intensive application designed and implemented by Jeff Dean and Sanjay Ghemawat at Google. It provides fault tolerance while running on inexpensive commodity hardware, delivers high aggregate performance to large number of clients. Its code isn’t freely available, even though it is only used internally at Google and known to be written in C++, with interfaces in Python and Java. It is used in some of the largest MapReduce clusters to date. It has been studied in the literature that, on any given day, Google used to execute about 100,000 MapReduce jobs; each occupies about 400 servers and used to take about 5 to 10 minutes to finish. A GFS cluster consists of a single master and multiple chunk servers and is accessed by multiple clients. Each of these is typically a commodity Linux machine running a user-level server process. It is easy to run both a chunk server and a client on the same machine if machine resources permit, and the lower reliability caused by running possibly application code is acceptable.
Disco is an open source implementation of the MapReduce programming model, developed at Nokia Research Centre as a lightweight framework for rapid scripting of distributed data processing used for building a robust and fault-tolerant distributed application. As it is written in Python programming language, it lowers the entry barrier and makes it possible to write data processing code in few lines of code. It supports parallel computations over large data sets on unreliable clusters of computers.
Unlike Hadoop, Disco is only a minimal MapReduce implementation, it has been used in parsing and reformatting data, data clustering, probabilistic modelling, data mining, full-text indexing, and log analysis with hundreds of gigabytes of real-world data.
Skynet is an open source implementation of the MapReduce framework written in Ruby. But unlike other implementations, its administration is fully distributed and doesn’t have a single point of failure such as the master servers that can be found in Google MapReduce and Hadoop. It uses a peer recovery system and fault tolerant in which workers look out for each other. Peers track each other and, once failure is detected, can spawn a replica of the failed peer.
It is Microsoft’s research project using MapReduce. Dryad intends to be a more general-purpose environment to execute data-parallel applications. It has been studied in the literature that it includes other computation frameworks, including MapReduce. It is intended to be a super-set of the core Map-Reduce framework. Dryad programs are expressed as directed acyclic graphs (DAG) in which vertices are computations and edges are communication channels. Dryad has been deployed at Microsoft since 2006, where it runs on various clusters of more than 3000 nodes.
It is a general-purpose distributed file system. It is a commodity-based distributed file system that can be spread across hundreds to thousands of compute nodes. Gfarm file system federates local file systems of compute nodes and does not require any special configuration or additional hardware. Using the Hadoop-Gfarm plugin, the Hadoop Distributed File System can be built on Gfarm and the MapReduce can be used.
6. HadoopDB – A Hybrid MapReduce System
The HadoopDB project is a hybrid system that tries to combine the scalability of MapReduce with the performance and efficiency advantages of parallel databases. In this type of project, HadoopDB is to connect multiple single nodes database systems using Hadoop as the task coordinator. Queries are expressed in SQL but their execution is parallelized across nodes using the MapReduce framework.
In concluding remarks, the difference between Hadoop and data warehouse is just similar where an Architecture of Hadoop is a tool for handling and storing Big Data, whereas data warehouse is an architecture for organizing data to ensure integrity. One more point to state is, former makes the best use of relational and structured data whereas later excels in storing and managing unstructured data using MapReduce for data processing.
• Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI’04: Proceedings of the 6th Symposium on Operating Systems Design & Implementation, pages 10–10, Berkeley, CA, USA, 2004. USENIX Association
• Pisoni, A. (2009) Skynet: A Ruby MapReduce Framework https://developpaper.com/skynet-introduction-of-google-map-reduce-framework-in-ruby/
• Isard, M., Budiu, M., Yu, Y., Birrell, A., & Fetterly, D. (2007, March). Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS operating systems review (Vol. 41, No. 3, pp. 59-72). ACM. http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf
• Mikami, S., Ohta, K., & Tatebe, O. (2011, September). Using the Gfarm File System as a POSIX compatible storage platform for Hadoop MapReduce applications. In Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing (pp. 181-189). IEEE Computer Society.
• Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., & Kozyrakis, C. (2007, February). Evaluating mapreduce for multi-core and multiprocessor systems. In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on (pp. 13-24). IEEE 13-24.
• Yang, H. C., Dasdan, A., Hsiao, R. L., & Parker, D. S. (2007, June). Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data (pp. 1029-1040). ACM.