Tuesday, September 6, 2011

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (chapters 3, 4, 7)

chapter 3

Clusters of low-end servers are preferred over clusters of high-end servers for warehouse scale computing because of the economies of scale.  Low-end servers can take advantage of high-volume personal computing hardware which can drive the prices down.  However, by building a large cluster with many cores (either with low-end or high-end machines), communication costs will greatly affect performance.  With high communication pattern, a single large SMP machine may outperform a cluster by 10x.  However, if a problem is too large even for a single SMP machine, a cluster can be formed by high-end machines.  Experiments show that as the cluster grows, the performance advantage of the high-end cluster diminishes.  The cluster of low-end machines can perform within 5% of a cluster of high-end machines.  This shows that the price premium for large high-end machines is not worth it for large clusters.

There have been arguments that this model can continue further to use more, single-core, but slower CPUs to cut the costs of the hardware.  However, there is a point of diminishing returns.  The hardware may be cheaper, but it will take more parallel tasks and network coordination and optimization to finish a task in the same amount of time as with multi-core CPUs.  If there is global state, using smaller/slower cores will require much more communication or require the use of more conservative (and less efficient) heuristics.

chapter 4

Datacenters are very large systems which consume a lot of power and generate a lot of heat.  Most of the costs going into a datacenter goes to power distribution and cooling.  There are different tiers of datacenters, according to the amount of reliability and redundancy of the power and cooling paths.

Utility power enters the datacenter and feeds into the uninterruptible power supply (UPS).  The UPS also has an input from generator power.  The UPS contains a transfer switch to detect utility power failure, and switch over to the generator power for input.  During the outage time, the UPS uses its battery or flywheel to provide power long enough for the generator to kick in.  

CRAC (computer room ac) uses ac units to take in the hot air in the server room, and cool it down and push out cold air through the floor.  The cold air goes through the floor and back into the racks and flows through the servers, and the hot air gets expelled, and the cycle continues.  The actual cooling units use chilled water to cool down the air.  Some datacenters use free cooling which cools the water in a more efficient way than using a chiller.  Cooling towers can be used to cool the warm water by evaporation, and this method works well in dry climates.

The airflow through a server and rack of servers determines the amount of airflow the floor must push out, and that is determined by the fans of the CRAC units.  More servers mean more airflow must be supported and eventually there may be limits were it becomes impractical to increase the airflow any further.  In-rack cooling involves bringing chilled water to each rack, and having the warm air exchange at each rack.  Container based datacenters usually use methods similar to CRAC cooling but in more modular sizes.

chapter 7

In large clusters, failures are much more probable than a single machine, so all machines in the cluster will rarely all be functional.  This means software will have to fault tolerant.  By having the software be fault tolerant, the hardware does not have to be as reliable, and also, there are more options in dealing with hardware upgrades.  Usually, the minimum requirement for the hardware is to be able to detect failures.  Surveys on internet services and Google services have shown that actual hardware failures causing outages is relatively small compared to configuration problems and software bugs.

Using studies of machine reboots at Google, if a service uses 2,000 servers, there would be a reboot about every 2.5 hours, or about 10 reboots a day.  Since most reboots complete within 2 minutes, about 20 additional spare machines will have to be provisioned to safely keep the service available.  There is significant indirect evidence that software-induced machine crashes are more common than those induced by hardware, but the most common hardware causes are DRAM errors, and disk errors.

An efficient repair process is very important for such large scale clusters for warehouse scale computers.  However, some features of WSC's can help the repair system.  Since there is usually a lot of spare capacity in the cluster, a technician can more efficiently repair all machines at a schedule, instead of repairing every failure immediately.  Also, with all the possible failures, the data can be collected and analyzed.  Google uses machine learning to determine the best course of action for a repair, and detect faulty hardware or firmware.


No comments:

Post a Comment