Tuesday, September 27, 2011

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Dryad is developed at Microsoft to solve the problem of making distributed computation easier to develop.  Large scale services need large scale computation and Dryad provides a simple programming model as well as a runtime system that provides reliability, efficiency, and scalability.  Dryad forces the application developer to think about the data parallelism, and provides flexibility to design the computation.  Developers can define an arbitrary DAG and the communication mechanisms for the edges.  Dryad is more flexible because it can allow arbitrary DAGs and communication mechanisms and input sizes and types, so simpler models can be implemented on top of Dryad.

The Dryad tasks are coordinated by a job manager, which talks to the cluster nodes.  The job manager has the DAG of functions, and sends the vertices to the appropriate nodes.  The vertices communicate with each other through the data plane by the developer specified communication mechanism.  Defining a Dryad DAG is through a C++ library.  A new vertex is created by a static "factory" and edges are created through simple composition rules and graph merges.  Graphs have a distinction between input and output nodes to ensure it is an acyclic graph.  Communication mechanisms include files, shared memory, and TCP.  Encapsulation is a method to group several vertices together into one process and to run on the same node.  When the inputs to a vertex are all completed, then the vertex is added to the greedy scheduler queue.  Similar to Google's MapReduce, tasks are re-executed on failures, and some are duplicated to take care of stragglers.  Runtime local aggregation and partial aggregation optimizations are performed to reduce network traffic and improve performance.  Experiments show that for a simple SQL query, Dryad improved the speedup over SQLServer even on one machine.  The Dryad versions had nearly linear speed-up as the cluster size increased, and the in-memory version about twice as fast as the NTFS version.  For data mining workloads, the communication topology had to be optimized to improve the scalability.

Dryad provides more flexibility than Google's MapReduce, but statically defining a communication topology is difficult and burdensome.  It is nice to be able to define a DAG of operations instead of a simple map and reduce phase, but a more declarative interface will be necessary as computation gets more complicated.  If a more declarative approach is available, then there will be room for more interesting optimizations, similar to query optimizations in databases.  The future of distributed programming models should have an easy way to define the UDFs, but have a simpler way to define the DAG of data flow.  The runtime system will take care of the resource allocation in the cluster, and optimizing the DAG with respect to the cluster state and configuration.

MapReduce: Simplified Data Processing on Large Clusters

Various groups within Google have implemented hundreds of special purpose computations that process large amounts of data.  These are usually parallelized on a large cluster of machines, but there are many issues to running a distributed algorithm on many servers, such has how to parallelize the computation and how to deal with failures.  MapReduce was developed by Google to solve these problems and hide the complexity from users.  MapReduce is a programming model and system implementation for processing large amounts of data.  users just need to define a map() function and a reduce() function, and the system efficiently takes care of the execution on a large cluster of commodity machines.  The runtime system handles the partitioning, scheduling, and failures so that users can easily utilize large clusters.  The MapReduce abstraction is inspired by Lisp primitives map and reduce, and the functional model allows effective parallelization and re-execution for fault tolerance.  Most computations already had a map phase where data was transformed to intermediate key/value pairs, and a reduce phase where a final operation was applied to all the values of the same key.

The MapReduce programming model is very simple for users to use.  Only a map() function and reduce() function need to be defined and the runtime system handles the rest.  MapReduce passes in strings to and from the user defined functions, so the user must convert to the appropriate types.  There are many examples which fit this model well: distributed grep, count of url access frequency, reverse web-link graph, and distributed sort.  The MapReduce execution model starts from the library in the user code.  The input files are split into M pieces for the M mappers.  A master is also forked, and assigns work to mappers and reducers.  A mapper reads key/values from the input, and generates the new intermediate key/value pairs on the local disk.  These local locations are passed to the master, which then notifies the reducers.  When the R reducers read all the data, they sort the keys, and pass it along to the UDF, and produces the final R output files.  To tolerate failures, the master occasionally pings the workers, to determine if they are still running.  On failures, the work is just restarted on a different machine.  Most of the UDFs are deterministic so the programs are correct in a distributed environment.  Locality of the GFS data is considered when assigning mappers to conserve network bandwidth.  Some map tasks can be slow to finish (stragglers), so when the map phase is near completion, the master will start backup tasks of the running tasks and wait for any of them to complete.  Another optimization is the combiner function which is run on the map worker to reduce the data before the real reducers access the data.  This can reduce the amount of data to transfer over the network.  In experiments on a cluster of 1800 machines, grep and sort workloads performed well.  The backup tasks for stragglers reduced the run time of the sort jobs, and killing 200 servers automatically started new jobs and completed the job with just 5% increase in time.  The MapReduce framework is widely used in Google today, and makes writing, running and maintaining distributed computations much easier.

The MapReduce model is very popular today and there are even open source versions like hadoop.  I think the future will not be as dependent on MapReduce but will new systems and programming models will certainly retain a lot of the runtime features.  The important runtime features are still fault tolerance, locality, and efficient shuffling.  I think MapReduce will still be used for certain applications, but new programming models will emerge to handle different types of parallel computation.  Something key to parallel computation on large data is that the data should not be moved much and the computation should go to the data.  However, networking is also improving which would mitigate the transfer costs.  I think there will be an interesting tradeoff between co-locating computation to the data, and incurring transfer costs but amortizing it over several uses.

Sunday, September 25, 2011

The Google File System

GFS is a large, reliable, distributed file system running on many commodity servers and commodity disks.  Google developed GFS to provide high aggregate throughput for data intensive applications.  GFS also addresses the issues on running a file system on commodity servers which often fail.  The developers examined application workloads and technology trends, and developed them together in order to achieve good performance.  GFS addressed changing assumptions in the data access.  Frequent component failure, huge files, mostly append instead of overwrite workloads are the assumptions which GFS is based upon.

GFS runs on many commodity servers with master servers and chunk servers.  The master maintains metadata for the file system and manages chunks, and chunk servers store the chunk data.  Clients have to communicate with the master first to get the chunk servers for specific chunk data.  However, the metadata is cached on the client for some time.  No chunk data is cached by the client, since cache coherence is difficult in a distributed environment, and applications mostly scan through the data once.  Chunks are 64MB and are lazily allocated to reduce internal fragmentation on the chunk servers.  Metadata is stored in memory to allow for fast access, and in practice, metadata is only about 100MB in size.  Data flows and control flow messages are separated to allow for full use of bandwidth when transferring data.  Data flows are pipelined and transfered through a primary chunk server.  Lazy garbage collection is used to clean up deleted and orphaned chunks.  GFS provides high availability by designing for fast recovery and replication of data.  In practice, most of the writes were appends rather than overwrites, which was expected.  The write throughput was slower than the read throughput because writes have to access 3 replicas and increase the probability of collisions.  GFS works well in practice, because the designers considered the new assumptions in data access patterns, and the application and file system were developed together.

The future holds new and changing assumptions which will drive for different design decisions for a large distributed file system.  More memory is becoming commonplace in commodity servers these days, so the future file system should take into account more memory.  In addition to the memory, SSDs are also becoming popular, which may replace disks in servers.  GFS tries to optimize for aggregate throughput and append workloads, but with these new trends in hardware, lower latencies can be achieved.  Lower latencies will be important for interactive workloads, so I think it will be interesting to see if the new hardware will allow for interactive and overwrite workloads such a transactional database workloads.

Megastore: Providing Scalable, Highly Available Storage for Interactive Services


Megastore is a system which provides scalability, availability and consistency over the wide area network.  Many systems focus on scalability and performance, but consistent semantics are cleaner and simpler to reason about.  Also, Megastore tries to solve the issue of data loss with some replication schemes across diverse geographic regions.  Megastore also uses paxos to provide synchronous replication of a log, which is unlike most usages of paxos.  Megastore provides fully serializable ACID semantics for fine-grained partitions of the data.

Megastore blends the scalability of a noSQL system, and the semantic convenience of a traditional RDBMS.  Megastore provides availability by using synchronous fault tolerant log replication, and provides scalability by partitioning data into entity groups.  Paxos is used to synchronously replicate the write ahead log to all the replicas of each entity group, which provides ACID semantics within the group.  Across groups, 2PC is used to coordinate transactions.  At Google, there is usually some natural way to partition the data.  Several optimizations with paxos are developed to improve the write transactions.  A leader is chosen to allow for 1 round trip writes, but does not require a dedicated master.  The first writer accessing the leader gets to commit with 1 round trip, but other writers have to use 2 phases.  Witness and read-only replicas allow improve performance for writes and reads.  The commit point of Megastore is the durability point, and the visibility point is afterwards.  Most of the write scalability is achieved by partitioning the data and reducing the write conflicts.  At Google, most applications say 5 9's of read and write availability.  The reads were usually around 10s of milliseconds, and the writes were around 100-400ms, because of the WAN communication.  Most of the reads by applications were serviced by the local replica as expected.

I think future data stores will be increasingly geographically distributed, because region diversity is crucial for fault tolerance and availability.  Also, users are more geographically distributed which necessitates distributed data stores.  Eventual consistent data stores have been increasingly popular these days because of the high scalability it affords, but I think in the future higher consistency models will be required and developed.  Higher consistency is easier to reason about and develop for, and future systems will provide consistency.  Megastore shows that a consistent system can be built with decent performance, and the future will bring more multi-datacenter systems with higher consistency models.  With newer consistency models will also bring new programming models to handle the nuances of different consistency and failure modes.

Tuesday, September 20, 2011

The Chubby lock service for loosely-coupled distributed systems

The Chubby lock service is a system built by Google to provide coarse grained locking and storage for low volume data.  The primary goal is for availability and reliability, and not for throughput or latency performance.  Because of the performance characteristics, it is only used for coarse-grained locking and not fine-grained locking.  Google has shown that this is not a problem.  Most usage of Chubby has been for naming and metadata/configuration storage.  Before Chubby, Google services used ad hoc methods which could have duplicated work, or required user intervention.  After Chubby, there no longer was a need for human intervention, which eases the maintenance of large distributed systems.  The core of Chubby uses Paxos for distributed consensus of information.

The Chubby service usually has about ~5 machines, running paxos.  The reads are just serviced from the master, and the writes are propagated to all machines with the paxos protocol.  A simple filesystem like interface is available so that it is easily browsable.  All directories or files can have advisory locks and the clients can hold the locks for long periods of time.  There are timeouts to ensure the system makes progress when clients fail.  To improve performance, clients can cache data, but must be connected to Chubby so that the master can invalidate information when necessary.  Writes block for the cache invalidation so that the data is consistent.  Caching greatly reduces the amount of communication between the clients and the master, which improves the scalability.  The main factor for scaling Chubby is reducing communication, not server performance.

The Chubby model is something that may become more popular for distributed systems.  Instead of some ad hoc algorithm for metadata or configuration, Google created a centralized service which provide those features.  Even for massively distributive systems, some things are best done in a centralized fashion, or there is some global data which must be shared by all nodes.  Therefore, there is always a need for some centralized service.  However, centralized services could be the single point of failure, but that is not the case for Chubby.  Chubby uses paxos to agree upon values among several machines.  This allows for some failures and partitions, which means the centralized service can provide greater reliability and availability.  In the future, there will probably have to be more consideration on the network latencies and the non-uniformity.  I think distributed systems will become more global, not just within a single datacenter, so a centralized service will need to handle varying latencies, and increased probability of partitions.  Changes will have to be made to the Chubby algorithms to take into account the longer latencies and cross-world network traffic.

Paxos Made Simple and Paxos Made Practical

Paxos is a distributed consensus algorithm formalized by Leslie Lamport.  The consensus problem is simple.  There are a group of processes which must agree on a single proposed value, which can be learned after the agreement.  For safety of the algorithm, only proposed valued can be chosen, only a single value can be chosen, and a learner cannot learn a value unless it was actually chosen.  For the basic paxos algorithm by Lamport, it is assumed that the set of processes do not change during the execution.  The protocol can be described as a 2 phase protocol.  It is different from 2 phase commit because 2PC requires all processes to respond, but paxos is based on majorities, which improves availability and response times.  In the first phase, proposers send PREPARE(n) messages to acceptors, and acceptors return promises not to accept anything less than n.  acceptors also return the value of the highest accepted proposal, with number less than n.  If the proposer gets promises from a majority of acceptors, it then sends ACCEPT(n, v) messages to acceptors, where v is the value of the highest proposal from acceptors, or a new value (if there was no accepted proposal yet).  This type of protocol ensures safety of the consensus, and with distinguished leader, can also ensure liveness.  Paxos can be used to implement the state machine distributed system model efficiently, because the leader can get phase 1 messages for an indefinite number of instances, so it only needs phase 2 messages to commit values.

"Paxos made practical" does not assume the acceptors are fixed and that they can fail or more can be added.  The set of cohorts in the group is maintained by paxos as well.  The view is the configuration of the cohorts, and a new view is agreed upon by using a paxos instance for a new view.  Using this protocol, new cohort configurations can be reliably modified to adapt to the changing environment.

Paxos is the standard distributed consensus algorithm for distributed systems today.  Paxos is used mostly for small amount of metadata or configuration information, but lately more systems have been using Paxos for more datastores to achieve consistency across several different nodes.  I think in the future, paxos will be used more often to achieve consistency across nodes, as well as a way to replicate data across nodes.  Paxos will be used to achieve different semantics along the consistency spectrum.  Since paxos is a quorum based protocol, unlike 2PC, better availability can be provided by the protocol.  By using paxos or some variation, different points in the CAP tradeoffs.  These different types of semantics will be useful for the different types of system requirements.  There may even be an asymmetric protocol where the different latencies between different nodes will be taken into account, in order to improve performance or provide certain guarantees.

Monday, September 19, 2011

Cluster-Based Scalable Network Services

The popularity of internet services have been growing because of the elimination of software distribution and easier customer service of the product.  There are several reasons for network services, but there are also several challenges in deploying the service, including scalability, availability and cost effectiveness.  The authors of this paper propose a solution of using cluster of commodity machines with a high speed interconnect, using a 3 layered architecture.

A key tradeoff in the design of these systems is the observation that many internet services do not need full consistency or durability of ACID semantics.  BASE (basically available, soft state, eventually consistent) semantics is usually good enough, and this allows for easier deployment and better performance.  With the relaxation of strict consistency, fewer messages are required during partial failures, which allows for better availability and performance.  BASE semantics also improves maintenance of the cluster because they allow stateless worker nodes, which provides easier fault tolerance and better scalability.

The layered approach provides a clean separation of tasks and modularity for the system which has many benefits.  By splitting workers from the front end, the workers can be stateless and can achieve greater scalability with ease.  By making the workers stateless, overflow workers can be allocated to handle bursty traffic efficiently.  Also, in the TACC layer, by modularizing the components, the modules can be composed together to implement a wide range of services, thus simplifying implementation.  Two systems TranSend and HotBot are two real world internet services which implement the layered concept on a cluster of commodity machines with success.  Several experiments on TranSend show that the design decisions with BASE and the layered approach can achieve linear scalability and other benefits such as handling of bursty requests, and load balancing.

The future will have even more internet services, or cloud services.  Most companies are now developing cloud services to provide highly available, scalable products.  This means cluster based services will be even more prevalent in the future.  I think the architecture will probably still look the same as today, but with more public clusters, like Amazon's EC2.  Not all companies can afford to build large clusters, so more internet services will be built on top of public clouds, like Netflix.  This means layers and modularity will become even more important.  There will probably be a growing library of modules which service developers can use.  These modules will be implementations of commonly used components of services, such as load balancers, monitoring tools, and others.  These modules may be available through the cloud provider, or may look like open source software.  I think the clean layered concept may disappear, but only because as services do more and become more complicated, a simple layered approach will not satisfy all cases.  Services will probably look like a DAG of internal services with defined APIs.  There will also be more emphasis on data storage, and different consistency and availability models as data will be become more and more important.  The new models will be necessary because internet services will have to service the entire world, and not a single data center.  The global model will have different performance and availability characteristics which will drive for different semantics for the data.

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

The basics of the CAP theorem states that it is impossible to provide all 3 of consistency, availability, and partition tolerance at the same time for internet services.  Only 2 out of the 3 can be provided at the same time.  Consistency is defined as if operations on an object were completed on a single instance of the object.  Availability is defined as requiring all requests to a non-failing node must result in a response.  Partition tolerance means the system must tolerate when the network may lose arbitrarily many messages from one node to another.  The authors prove that it is impossible to provide an object which has atomic consistency and availability in any asynchronous network model, since it is impossible to determine if a message is lost or arbitrarily delayed.  In a synchronous network model, weaker consistency can be achieved, and there can be bounds in the length of inconsistency after the network heals.  This is possible by using timeouts and basing the bounds on the timeouts.  These results are very important and relevant to internet services today.  Most services are distributed over many nodes, and even across several datacenters.  The network is not 100% reliable so partitions are always possible.  Therefore all developers must take into account the tradeoffs between availability and consistency when designing the system.

I think this CAP result will be increasingly important in the future because internet services will be increasingly global.  Data and services will span across datacenters and regions.  The datacenter will be the new computer and the world will be the new datacenter.  This means the networks between the datacenters will be even more unreliable and unpredictable.  The probability of partitions will be greater and so this CAP theorem will be relevant more frequently.  Future services and applications will have tolerate more partitions and have to be developed with partitions and high latencies in mind.  This is a departure from the programming model today where programs for single machines or single datacenters are far more predictable.

Tuesday, September 13, 2011

The Datacenter Needs an Operating System

Clusters of commodity servers are becoming very popular, and it is important to be able to manage resources effectively.  The datacenter is becoming the new computer, and so an "operating system" is necessary to provide certain abstractions.  The main components for managing the datacenter are resource sharing, data sharing, programming abstractions, and debugging facilities.

Currently, resource sharing is available at a course-grained granularity, which is simple, but limits sharing with different types of applications.  Mesos is one step in the right direction which tries to provide fine-grained resource allocation.  Data sharing is most commonly done with distributed filesystems, but may limit performance.  RDDs provide read only partitions of data with known transformations to improve performance.  There are lots of programming frameworks like map reduce and pregel to help application developers.  However, there is need for more specialized APIs for systems programming for a large cluster.  Finally, there are currently not many good tools for debugging large distributed programs.

The datacenter will absolutely need an "operating system" or some sort of management of the resources.  Clusters may have thousands of machines, so there are many more options for resource allocation/sharing decisions.  However, I think simple methods for resource sharing can work for most applications.  Clusters are usually over-provisioned so the utilization is usually far from 100%.  Since there is usually lots of resources available, the scheduling only needs to avoid the worst options, and the optimal sharing is not required.  Also, adding additional resources is simple, by just adding machines, replacing slower servers with faster servers, or simply adding more datacenters.  Sharing will not be as much of an issue, but colocation will be.  Great performance benefits can be gained by colocating inter-dependent programs together, because of the longer network communication latencies.

Something that will be very important is the debugging and monitoring.  Debugging is already hard for single machine programs, but debugging over many machines becomes exponentially more difficult.  Good monitoring/alerting tools must be developed in order to reduce manual management of the cluster.  Clusters will be getting bigger, running increasing number of applications and frameworks.  Good monitoring tools will allow fewer people to manually interact with a growing number of machines.

Monday, September 12, 2011

Amdahl's Law in the Multicore Era

Amdahl's law has to be slightly modified to be able to describe multicore CPUs in today's world.  The classic speedup formula was modified for different chip configurations such as symmetric chips, asymmetric chips, and dynamic chips.  By plotting the graphs, it was shown that asymmetric chips could provide better speedups than symmetric multicore chips, and dynamic chips could provide the best speedups.  However, asymmetric and dynamic chips requires dealing with complicated scheduling and other overheads.

Something this paper did not address was the costs for the different types of CPUs.  That will be an important factor in how the future turns out.  Right now, most clusters are built using symmetric multicore chips, because they are the less expensive and easier to use than the specialized chips.  This may lead to wasted resources and non-optimal speedups, but the cost effectiveness and ease of deployment makes it worthwhile.  Since most chip manufacturers build symmetric chips for most consumers, the costs for them will stay pretty low, which will make it attractive when designing a datacenter or cluster.

Another factor which will be important is the software or firmware required in order to take advantage of the asymmetric and dynamic chips.  I do not know what software decisions are involved for using asymmetric or dynamic multicore chips, but I imagine that software changes are required and they are not trivial.  There will probably have to be new and better compilers to be able to exploit the better speedups offered by the specialized chips.  New programming languages or constructs may need to be developed.  All of this either does not exist or is in its infancy, so there is a lot of work to be done to be able achieve the theoretical speedups of newer hardware.  These challenges and their solutions may not make it worthwhile to use the specialized CPUs and force large warehouse scale clusters to continue using the simple symmetric chips.

Sunday, September 11, 2011

Performance Modeling and Analysis of Flash-based Storage Devices

SSDs are becoming more prevalent in computers today, and will become the standard storage solution for datacenters and clusters.  The authors of this paper developed performance models for SSDs because their architecture is vastly different from spinning hard disks.  The SSD technology can provide low latency, but also forces slow updates and expensive block-level erasures.  The authors used several workload characteristics such as read/write ratio, request size, queue depth, and others to model performance of latency, bandwidth, and IO throughput.  Their linear regression models predicted real world performance well for 2 OLTP and 2 web search workloads.

In the next 5 to 10 years, SSDs will play a bigger part in computing infrastructure.  Currently, SSD technology is not cost effective enough in order to store all the necessary data.  Hard disks still have a huge advantage in cost to bit density ratio.  However, SSDs are slowly improving the cost/storage ratio, and could be the main storage system in warehouse scale clusters.  Disks will become the "archival tape" of the future.  SSDs have very different performance characteristics, as the paper shows.  Different access patterns benefit from SSDs or hard drives.  When the main storage for clusters is replaced by SSDs, the software stack will have to be modified to take advantage of the different characteristics.  Most software assume hard disk performance in the secondary storage in order to optimize the program.  It will be very important for programmers to adapt to the changing performance characteristics in order to take advantage of SSDs.

Since the performance of SSDs is different from hard disks, system architecture may also have to change for distributed storage systems.  Distributed file systems such has GFS or HDFS assume hard disk are the storage medium, so they adopt the optimized append only model, since hard disks are very good with sequential writes.  SSDs usually do not have as fast write performance as hard disks because of the block-level erasing requirement, but has great random read performance.  These differences will cause the distributed file systems to change.

Wednesday, September 7, 2011

Warehouse-Scale Computing: Entering the Teenage Decade

Warehouse scale computing, WSC, seems to be a recent phenomenon, but it is not exactly revolutionary.  In 1993, David Patterson gave a talk on datacenter computing where the building was the computer.  The difference between the recent warehouse scale computing is that the scale is much larger and runs many more internet services and applications.  Datacenter computing may have petabytes of storage, but WSC is on alert if there is only petabytes of storage free.  Bigger data and problems now require larger scale computing.

WSC is now in beginning in its teenage years.  About 10 years ago, Google released a paper which described their software and hardware, and the paper was only 7 pages long.  As the internet grew in popularity, the increasing need for farming out work and services to compute clusters became more apparent and gave rise to more innovations in WSC.  The increase of new internet applications was the driving force to create and manage larger and larger clusters, in order to provide the interactivity that the new applications required.  WSC is now past its infancy, and as it becomes more sophisticated, several challenges exist.

Warehouse scale computing is usually more cost effective if more but weaker CPUs are used, since the aggregate CPU performance is more important.  However, recently, faster CPUs have made a come back because of embarrassingly parallel solutions.  Simple parallelism is very easy to implement and reason about.  Google mostly uses request level parallelism, but if weaker CPUs are used, it will take longer to complete each request.  In order to speed them up, each request will have to be parallelized which requires more engineering effort.

WSC also have options for storage now.  Flash drives are becoming prevalent and provide far better random read performance.  However, write performance is sometimes worse than disk, and the latency tail is very long.  This can cause problems when applications are at Google's scale, because the latency tail many more delays.

WSC power efficiency is an important metric of interest.  There have been large improvements in the PUE (power usage effectiveness) of datacenters, and Google has reached about PUE of around 1.09 - 1.16.  Machines have also gotten better in terms of the power efficiency relative to peak output.  However, it is difficult for an entire building to use near the available amount of power, since there will always be fluctuations and they could cause an overload.  One solution is for each machine or rack to have a UPS, which has many advantages.  It can help survive power outages, allow machines to ride power load leaks, and allow the reliable usage of ficke renewable energy sources.

Networking is becoming a bottleneck in these systems.  Storage is getting faster, but the network connectivity cannot keep up to be able to disaggregate resources in the datacenter.  Disks are slow enough to tolerate the networks, but with the emergence of flash technology, the networking needs to improve.

A lot of more research has to be conducted in order to reduce latencies.  IOs are actually getting faster, but the software stack is still the same as from the past.  This would be fine if CPUs continued to grow faster, but the clock speeds are leveling off.  At Google any query pattern still creates a very bursty request pattern at the leaf nodes.  This is because most of the software and research concentrated on throughput (batching, nagle mode) instead of latency.  Therefore, there is plenty of research opportunities to minimize latencies.

Tuesday, September 6, 2011

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (chapters 3, 4, 7)

chapter 3

Clusters of low-end servers are preferred over clusters of high-end servers for warehouse scale computing because of the economies of scale.  Low-end servers can take advantage of high-volume personal computing hardware which can drive the prices down.  However, by building a large cluster with many cores (either with low-end or high-end machines), communication costs will greatly affect performance.  With high communication pattern, a single large SMP machine may outperform a cluster by 10x.  However, if a problem is too large even for a single SMP machine, a cluster can be formed by high-end machines.  Experiments show that as the cluster grows, the performance advantage of the high-end cluster diminishes.  The cluster of low-end machines can perform within 5% of a cluster of high-end machines.  This shows that the price premium for large high-end machines is not worth it for large clusters.

There have been arguments that this model can continue further to use more, single-core, but slower CPUs to cut the costs of the hardware.  However, there is a point of diminishing returns.  The hardware may be cheaper, but it will take more parallel tasks and network coordination and optimization to finish a task in the same amount of time as with multi-core CPUs.  If there is global state, using smaller/slower cores will require much more communication or require the use of more conservative (and less efficient) heuristics.

chapter 4

Datacenters are very large systems which consume a lot of power and generate a lot of heat.  Most of the costs going into a datacenter goes to power distribution and cooling.  There are different tiers of datacenters, according to the amount of reliability and redundancy of the power and cooling paths.

Utility power enters the datacenter and feeds into the uninterruptible power supply (UPS).  The UPS also has an input from generator power.  The UPS contains a transfer switch to detect utility power failure, and switch over to the generator power for input.  During the outage time, the UPS uses its battery or flywheel to provide power long enough for the generator to kick in.  

CRAC (computer room ac) uses ac units to take in the hot air in the server room, and cool it down and push out cold air through the floor.  The cold air goes through the floor and back into the racks and flows through the servers, and the hot air gets expelled, and the cycle continues.  The actual cooling units use chilled water to cool down the air.  Some datacenters use free cooling which cools the water in a more efficient way than using a chiller.  Cooling towers can be used to cool the warm water by evaporation, and this method works well in dry climates.

The airflow through a server and rack of servers determines the amount of airflow the floor must push out, and that is determined by the fans of the CRAC units.  More servers mean more airflow must be supported and eventually there may be limits were it becomes impractical to increase the airflow any further.  In-rack cooling involves bringing chilled water to each rack, and having the warm air exchange at each rack.  Container based datacenters usually use methods similar to CRAC cooling but in more modular sizes.

chapter 7

In large clusters, failures are much more probable than a single machine, so all machines in the cluster will rarely all be functional.  This means software will have to fault tolerant.  By having the software be fault tolerant, the hardware does not have to be as reliable, and also, there are more options in dealing with hardware upgrades.  Usually, the minimum requirement for the hardware is to be able to detect failures.  Surveys on internet services and Google services have shown that actual hardware failures causing outages is relatively small compared to configuration problems and software bugs.

Using studies of machine reboots at Google, if a service uses 2,000 servers, there would be a reboot about every 2.5 hours, or about 10 reboots a day.  Since most reboots complete within 2 minutes, about 20 additional spare machines will have to be provisioned to safely keep the service available.  There is significant indirect evidence that software-induced machine crashes are more common than those induced by hardware, but the most common hardware causes are DRAM errors, and disk errors.

An efficient repair process is very important for such large scale clusters for warehouse scale computers.  However, some features of WSC's can help the repair system.  Since there is usually a lot of spare capacity in the cluster, a technician can more efficiently repair all machines at a schedule, instead of repairing every failure immediately.  Also, with all the possible failures, the data can be collected and analyzed.  Google uses machine learning to determine the best course of action for a repair, and detect faulty hardware or firmware.