Tuesday, September 13, 2011

The Datacenter Needs an Operating System

Clusters of commodity servers are becoming very popular, and it is important to be able to manage resources effectively.  The datacenter is becoming the new computer, and so an "operating system" is necessary to provide certain abstractions.  The main components for managing the datacenter are resource sharing, data sharing, programming abstractions, and debugging facilities.

Currently, resource sharing is available at a course-grained granularity, which is simple, but limits sharing with different types of applications.  Mesos is one step in the right direction which tries to provide fine-grained resource allocation.  Data sharing is most commonly done with distributed filesystems, but may limit performance.  RDDs provide read only partitions of data with known transformations to improve performance.  There are lots of programming frameworks like map reduce and pregel to help application developers.  However, there is need for more specialized APIs for systems programming for a large cluster.  Finally, there are currently not many good tools for debugging large distributed programs.

The datacenter will absolutely need an "operating system" or some sort of management of the resources.  Clusters may have thousands of machines, so there are many more options for resource allocation/sharing decisions.  However, I think simple methods for resource sharing can work for most applications.  Clusters are usually over-provisioned so the utilization is usually far from 100%.  Since there is usually lots of resources available, the scheduling only needs to avoid the worst options, and the optimal sharing is not required.  Also, adding additional resources is simple, by just adding machines, replacing slower servers with faster servers, or simply adding more datacenters.  Sharing will not be as much of an issue, but colocation will be.  Great performance benefits can be gained by colocating inter-dependent programs together, because of the longer network communication latencies.

Something that will be very important is the debugging and monitoring.  Debugging is already hard for single machine programs, but debugging over many machines becomes exponentially more difficult.  Good monitoring/alerting tools must be developed in order to reduce manual management of the cluster.  Clusters will be getting bigger, running increasing number of applications and frameworks.  Good monitoring tools will allow fewer people to manually interact with a growing number of machines.

No comments:

Post a Comment