Wednesday, October 12, 2011

DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language and FlumeJava: Easy, Efficient Data-Parallel Pipelines

DryadLINQ is a new programming model for large scale distributed programming, providing imperative and declarative operations on datasets within a traditional high-level language.   Programs are written sequentially as transformations on datasets in the .NET framework, and the runtime system runs the program in parallel on the Dryad platform.  DryadLINQ makes programming large scale distributed programs very simple, since the programmer can just write iterative programs and the system compiles it to reliability run on many machines.  By combining imperative and declarative programming, DryadLINQ provides more flexibility than SQL and optimization opportunities over imperative languages.  DryadLINQ exposes datasets as enumerable objects, uses lazy evaluation of the datasets, and provides strongly, statically typed constructs.  DryadLINQ adds new operators Apply() and Fork() to LINQ to add more parallelizable operators to the language.  Programs are translated to execution plan graph which is a DAG, and static optimizations and runtime dynamic optimizations are performed.  Static optimizations are pipelining, removing redundancy, eager aggregation, and IO reduction.  During runtime, the system can dynamically re-partition the data to optimize sorting.  DryadLINQ can leverage the advantages of PLINQ which parallelizes programs on multicore machines, and can also leverage the LINQ-to-SQL system to interoperate with SQL databases.  With experiments, terasort benchmark ran with almost constant scaleup properties.  The overhead of DryadLINQ over custom Dryad programs is about 10%.

FlumeJava is a similar system developed by Google.  Writing large scale parallel programs for massive amounts of data can be very hard, and FlumeJava solve that problem.  Many problems need pipelines of map-reduce jobs, so FlumeJava exports the idea of immutable parallel collections to ease the programming and allow automatic compilation and optimizations to map-reduce jobs.  The two central classes are Collection and Table, and main operators are parallelDo(), groupByKey(), combineValues(), and flatten().    Other more complex operations, like count(), join() and top() can be implemented with the previous basic operators.  The sequence of operators are not executed immediately but exhibit deferred evaluation, so the resulting program forms a DAG of operators.  Optimizations on the graph include, parallelDo fusion, map-shuffle-combine-reduce fusion, sinking flattens, and lifting combineValues.  At runtime, the system decides to run the MSCR operator locally or run it remotely and in parallel, with the associated startup latency.  Within Google, FlumeJava is becoming the primary java-based API for data parallel computations.  The optimizations usually reduce the stages by about 5x, and some can be reduced by 30x.  FlumeJava was compared with map reduce and sawzall, and for 4 applications, FlumeJava code was more concise.  After optimizations, FlumeJava could reduce the number of map-reduce stages down to the hand-optimized map reduce code.  In terms ok the running time, FlumeJava is very close to optimal runtime of the hand-optimized map reduce code.

The future language data intensive computation will look similar to DryadLINQ and FlumeJava and Pig, with the combination of declarative and imperative programming.  Declarative nature of SQL is very well defined, but at times it can be too restrictive, and difficult to express complicated algorithms.  Since programmers are already used to writing programs iteratively, so it makes sense to allow developers to write large scale data computations iteratively as well.  The functional aspects allow the semantics to be cleaner with the immutable datasets concepts.  This will allow more developers to create larger and more involved computations easily, which will be a huge advantage.  Optimizations will improve and include more cost based features from traditional databases.  Also, compiling down to map-reduce does not always make sense.  Dryad supports any DAG of dataflow, but there will probably be other lower level primitives similar to the map-reduce paradigm, which higher level programs can utilize.

No comments:

Post a Comment