Tuesday, September 27, 2011

MapReduce: Simplified Data Processing on Large Clusters

Various groups within Google have implemented hundreds of special purpose computations that process large amounts of data.  These are usually parallelized on a large cluster of machines, but there are many issues to running a distributed algorithm on many servers, such has how to parallelize the computation and how to deal with failures.  MapReduce was developed by Google to solve these problems and hide the complexity from users.  MapReduce is a programming model and system implementation for processing large amounts of data.  users just need to define a map() function and a reduce() function, and the system efficiently takes care of the execution on a large cluster of commodity machines.  The runtime system handles the partitioning, scheduling, and failures so that users can easily utilize large clusters.  The MapReduce abstraction is inspired by Lisp primitives map and reduce, and the functional model allows effective parallelization and re-execution for fault tolerance.  Most computations already had a map phase where data was transformed to intermediate key/value pairs, and a reduce phase where a final operation was applied to all the values of the same key.

The MapReduce programming model is very simple for users to use.  Only a map() function and reduce() function need to be defined and the runtime system handles the rest.  MapReduce passes in strings to and from the user defined functions, so the user must convert to the appropriate types.  There are many examples which fit this model well: distributed grep, count of url access frequency, reverse web-link graph, and distributed sort.  The MapReduce execution model starts from the library in the user code.  The input files are split into M pieces for the M mappers.  A master is also forked, and assigns work to mappers and reducers.  A mapper reads key/values from the input, and generates the new intermediate key/value pairs on the local disk.  These local locations are passed to the master, which then notifies the reducers.  When the R reducers read all the data, they sort the keys, and pass it along to the UDF, and produces the final R output files.  To tolerate failures, the master occasionally pings the workers, to determine if they are still running.  On failures, the work is just restarted on a different machine.  Most of the UDFs are deterministic so the programs are correct in a distributed environment.  Locality of the GFS data is considered when assigning mappers to conserve network bandwidth.  Some map tasks can be slow to finish (stragglers), so when the map phase is near completion, the master will start backup tasks of the running tasks and wait for any of them to complete.  Another optimization is the combiner function which is run on the map worker to reduce the data before the real reducers access the data.  This can reduce the amount of data to transfer over the network.  In experiments on a cluster of 1800 machines, grep and sort workloads performed well.  The backup tasks for stragglers reduced the run time of the sort jobs, and killing 200 servers automatically started new jobs and completed the job with just 5% increase in time.  The MapReduce framework is widely used in Google today, and makes writing, running and maintaining distributed computations much easier.

The MapReduce model is very popular today and there are even open source versions like hadoop.  I think the future will not be as dependent on MapReduce but will new systems and programming models will certainly retain a lot of the runtime features.  The important runtime features are still fault tolerance, locality, and efficient shuffling.  I think MapReduce will still be used for certain applications, but new programming models will emerge to handle different types of parallel computation.  Something key to parallel computation on large data is that the data should not be moved much and the computation should go to the data.  However, networking is also improving which would mitigate the transfer costs.  I think there will be an interesting tradeoff between co-locating computation to the data, and incurring transfer costs but amortizing it over several uses.

No comments:

Post a Comment