Wednesday, September 7, 2011

Warehouse-Scale Computing: Entering the Teenage Decade

Warehouse scale computing, WSC, seems to be a recent phenomenon, but it is not exactly revolutionary.  In 1993, David Patterson gave a talk on datacenter computing where the building was the computer.  The difference between the recent warehouse scale computing is that the scale is much larger and runs many more internet services and applications.  Datacenter computing may have petabytes of storage, but WSC is on alert if there is only petabytes of storage free.  Bigger data and problems now require larger scale computing.

WSC is now in beginning in its teenage years.  About 10 years ago, Google released a paper which described their software and hardware, and the paper was only 7 pages long.  As the internet grew in popularity, the increasing need for farming out work and services to compute clusters became more apparent and gave rise to more innovations in WSC.  The increase of new internet applications was the driving force to create and manage larger and larger clusters, in order to provide the interactivity that the new applications required.  WSC is now past its infancy, and as it becomes more sophisticated, several challenges exist.

Warehouse scale computing is usually more cost effective if more but weaker CPUs are used, since the aggregate CPU performance is more important.  However, recently, faster CPUs have made a come back because of embarrassingly parallel solutions.  Simple parallelism is very easy to implement and reason about.  Google mostly uses request level parallelism, but if weaker CPUs are used, it will take longer to complete each request.  In order to speed them up, each request will have to be parallelized which requires more engineering effort.

WSC also have options for storage now.  Flash drives are becoming prevalent and provide far better random read performance.  However, write performance is sometimes worse than disk, and the latency tail is very long.  This can cause problems when applications are at Google's scale, because the latency tail many more delays.

WSC power efficiency is an important metric of interest.  There have been large improvements in the PUE (power usage effectiveness) of datacenters, and Google has reached about PUE of around 1.09 - 1.16.  Machines have also gotten better in terms of the power efficiency relative to peak output.  However, it is difficult for an entire building to use near the available amount of power, since there will always be fluctuations and they could cause an overload.  One solution is for each machine or rack to have a UPS, which has many advantages.  It can help survive power outages, allow machines to ride power load leaks, and allow the reliable usage of ficke renewable energy sources.

Networking is becoming a bottleneck in these systems.  Storage is getting faster, but the network connectivity cannot keep up to be able to disaggregate resources in the datacenter.  Disks are slow enough to tolerate the networks, but with the emergence of flash technology, the networking needs to improve.

A lot of more research has to be conducted in order to reduce latencies.  IOs are actually getting faster, but the software stack is still the same as from the past.  This would be fine if CPUs continued to grow faster, but the clock speeds are leveling off.  At Google any query pattern still creates a very bursty request pattern at the leaf nodes.  This is because most of the software and research concentrated on throughput (batching, nagle mode) instead of latency.  Therefore, there is plenty of research opportunities to minimize latencies.

No comments:

Post a Comment