mapreduce

Table of Contents

MapReduce

“MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster.” Fair Use Source: https://wiki.apache.org/hadoop/MapReduce


MapReduce is a programming model and processing technique used for handling large-scale data in a distributed computing environment. It was first introduced by Google in 2004 as part of its proprietary big data processing system to process vast amounts of data across multiple machines. The model consists of two primary steps: the “map” function, which processes input data and generates intermediate key-value pairs, and the “reduce” function, which aggregates and processes the intermediate results to produce the final output. MapReduce is designed to scale easily across thousands of machines, making it an essential tool for big data processing.

https://en.wikipedia.org/wiki/MapReduce

MapReduce became popular as the core component of Hadoop, an open-source framework developed by the Apache Software Foundation in 2005 for processing large datasets in parallel across a distributed cluster of computers. Hadoop’s implementation of MapReduce allows for fault tolerance, data replication, and scalability. Hadoop MapReduce works by splitting input data into chunks, which are processed in parallel, and then using the reduce step to consolidate and produce final results. This architecture is particularly suitable for batch processing and can handle diverse data types like structured, semi-structured, and unstructured data.

https://hadoop.apache.org/

While MapReduce is effective for batch processing, it can be slower compared to more modern technologies like Apache Spark. MapReduce requires multiple passes through data to shuffle and sort intermediate outputs, which introduces latency. Apache Spark addresses this challenge by performing in-memory processing, thus reducing the number of I/O operations needed. Despite these challenges, MapReduce continues to be widely used in big data processing, especially when large-scale parallelization and fault tolerance are crucial.

https://en.wikipedia.org/wiki/MapReduce


MapReduce is a programming model and an associated implementation for processing and generating large data sets that can be parallelized across a distributed cluster of computers. Introduced by Google in 2004 as part of their research on distributed computing for large datasets, MapReduce simplifies the complexity of processing vast amounts of data by dividing the task into small chunks to be processed in parallel. The model consists of two main steps: the Map step, which performs filtering and sorting (such as sorting students by first name), and the Reduce step, which performs a summary operation (such as counting the number of students with the same first name). MapReduce has become a core component of Apache Hadoop, allowing developers and data scientists to write applications that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. This model has significantly influenced the design of many big data processing tools and frameworks, underlining its importance in the field of distributed computing and big data analysis.

The core concepts are described in Dean and Ghemawat.

External sites

  • MapReduce
  • MapReduce
Snippet from Wikipedia: MapReduce

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster.

A MapReduce program is composed of a map procedure, which performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a reduce method, which performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

The model is a specialization of the split-apply-combine strategy for data analysis. It is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as in their original forms. The key contributions of the MapReduce framework are not the actual map and reduce functions (which, for example, resemble the 1995 Message Passing Interface standard's reduce and scatter operations), but the scalability and fault-tolerance achieved for a variety of applications due to parallelization. As such, a single-threaded implementation of MapReduce is usually not faster than a traditional (non-MapReduce) implementation; any gains are usually only seen with multi-threaded implementations on multi-processor hardware. The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play. Optimizing the communication cost is essential to a good MapReduce algorithm.

MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology, but has since become a generic trademark. By 2014, Google was no longer using MapReduce as its primary big data processing model, and development on Apache Mahout had moved on to more capable and less disk-oriented mechanisms that incorporated full map and reduce capabilities.

mapreduce.txt · Last modified: 2025/02/01 06:42 by 127.0.0.1

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki