Monday, October 31, 2011

Piccolo: Building Fast, Distributed Programs with Partitioned Tables and Spark

Piccolo is an in-memory programming model for large applications in data centers.  It allows programs to access mutable distributed in-memory state with a key-value data model.  Piccolo is different from many existing solutions based on MPI or data flow models providing no global state like MapReduce.  With Piccolo, developers just need to define kernel functions which will be distributed and access distributed data.  By exposing global shared state, it makes writing many types of applications easier.  The application can access state with simple get() and put() apis, and can define accumulator functions to handle conflicts.  The developer can define locality preferences so that kernels can operate locally on the data it mostly accesses, to reduce long network delays.  The master process coordinates all the kernels running on the workers, and to achieve load balancing, workers can "steal" un-started kernels from other workers.  To handle faults, Chandy-Lamport checkpoints are taken online, because restarting kernels will not work since state could have been changed.  Many different types of applications have been developed with Piccolo, such has distributed crawler, page rank, k-means clustering, and matrix multiplication.  In experiments, scaling was decent with more workers, and also for scaling the input size along with the workers on a local cluster.  On EC2, the scaling experiments were more linear.  Piccolo implementations of page rank and k-means clustering were compared to hadoop implementations and performed better, mostly because hadoop had to sort lots of data, serialize keys, and read/write to hdfs.  Piccolo reduces the effects of all of those issues.  Also, experiments with work stealing showed that performance can be improve greatly when there is an imbalance on load in the cluster.

Spark and Resilient distributed datasets (RDDs) is a distributed memory model for programmers to write in-memory applications on large clusters.  RDDs provide fault tolerance for the in-memory data by only allowing coarse grained transformations on the data.  RDDs are extremely useful for reusing data/computation and for iterative algorithms, since the data can still be kept in memory.  Existing systems have to provide fault tolerance with copying data or logging updates, but RDDs provide efficient fault tolerance by storing the lineage of transformations and be able to re-compute partitions.  Developers can control the partitioning of the RDDs and the persistence in memory.  The Spark programming interface for RDDs is similar to DryadLINQ and FlumeJava, by defining new RDDs with transformations.  Since the transformations are coarse grained, the RDDs are immutable and the lineage is remembered, RDDs behave very differently from distributed shared memory.  Representing RDDs require 5 pieces of information: partitions(), preferredLocations(), dependencies(), iterators(), and partitioner().  There are two classes or dependencies, narrow, where a partition of the parent RDD is used at most once by the child RDD, and wide, where a parent partition is used by many child partitions.  Narrow dependencies can be pipelined, and allows for more efficient recovery.  With experiments, Spark could perform iterative algorithms up to 20x faster than hadoop, and for analytics programs, could perform up to 40x faster.  Recovery was fast, because only the failed partitions of the RDDs has to be regenerated.  Also, a 1 TB dataset can be interactively queried within 5-7 seconds.

These two distributed in-memory systems solve the problem in different ways.  Piccolo focuses on mutable, fine-grained updates, where Spark handles coarse grained transformations.  There are tradeoffs between the systems, but clearly the future will still have needs of both types of models.  Certain applications will need asynchronous, fine-grained updates such as web applications, and certain applications will need batch processing on lots of data.  The fine-grained approach is most similar to traditional rdbmses, and the coarse-grained approach is most similar to olap systems.  In the future, the fine-grained approaches will have to incorporate more batch/performance techniques like in spark, and the coarse-grained approaches will try to incorporate more asynchronous and immediate updates and processing.

No comments:

Post a Comment