Sunday, November 6, 2011

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center and Hadoop NextGen

Mesos is a platform for resource sharing in a cluster of commodity servers.  Sharing between different frameworks like Hadoop or MPI is beneficial because it can increase utilization and data locality.  This will reduce costs and improve performance.  Mesos uses distributed two level scheduling where mesos offers resources to the frameworks, and frameworks chooses which ones to accept.  The common ways to share a cluster now is to statically partition the cluster, or to use VMs, but both methods cannot achieve high utilization nor efficient data sharing.  Scheduling and resource sharing can be done with a centralized global scheduler but this can get very complicated.  Mesos uses a distributed model with resource offers and the frameworks accepting what they need, and use the resources how they wish.  This allows the frameworks to implement their own scheduling, which many of them already do.  Pushing more control to the frameworks is beneficial because frameworks can implement more complicated solutions, and allows mesos to be simpler and more stable.  Mesos does not require frameworks to specify their requirements, but allow them to reject offers.  This allows Mesos to remain simple, and allow frameworks to implement complicated resource constraints.  Isolation between frameworks is achieved with pluggable isolation modules, which include linux containers.  To scale resource offers, frameworks can provide filters for resources, Mesos counts outstanding offers as part of the allocation, and rescinds long outstanding offers.  Mesos master maintains only soft state and can be reconstructed by just slaves and framework schedulers. The decentralized scheduling model incentivizes the frameworks to use short, elastic tasks, and to only accept resources they will actually use.  Experiments with 4 different frameworks showed that Mesos scheduling had better utilization and improved task completion times over static allocations.  The overhead of Mesos is about 4% and could scale up to 50000 nodes with sub second task launching overhead.


The current hadoop framework can support up to about 4000 nodes and that limits the size of the clusters that can be used.  The current hadoop architecture hits scalability limits, and also, it forces a system wide upgrade for any bug fixes or features.  This make it very difficult to maintain a large cluster, while still providing good uptimes.  In the nextgen hadoop, the two JobTracker's functionality are split into two separate components, the ResourceManager and the ApplicationMaster.  The ResourceManager only manages the clusters resources, and offers fine-grained resources, not slots like the current system.  This allows for better utilization.  The ApplicationMaster is manages the scheduling and monitoring of the tasks.  Offloading the management of the lifecycle of tasks to the applications themselves allows the ResourceManager to be much more scalable.  Also, this allows for the actual map reduce to be essentially part of the user library code, which makes easy upgrades.  The entire system does not have to be upgraded together.  The fine-grained resources allows for fungible resources so the utilization is improved.  The resource management is similar to Mesos, in that it has a two level scheduling component.  Application frameworks can accept and reject resources given by the ResourceManager.

The two level scheduling of both Mesos and nextgen hadoop is an efficient, and scalable way to manage resources in a large cluster, so will probably be more popular in the future.  Having a soft state resource manager will be important for scalability and performance because allocations can operate in memory and failures are easy to recover from.  However, I think there will have to be some sort of feedback mechanism on what individual frameworks require to run.  Also, some applications will need to define certain types of restrictions, like this server needs to run on the same rack as the database shard 4.  These types of restrictions are important for both performance and reliability.  Also, there will also have to be a way to isolate disk IO and throughput.  Being able to share and isolate disk IO could be useful for data intensive applications and adds complexity to cpu and ram resources.

No comments:

Post a Comment