Monday, November 14, 2011

VL2: A Scalable and Flexible Data Center Network and PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric and c-Through: Part-time Optics in Data Centers

VL2 is a datacenter network to be cost effective and agile for better utilization and performance of the cluster.  Current datacenter networks usually have high cost hardware in a tree like structure, and cannot provide the agility required for todays services.  The high costs limit the available capacity between any two nodes.  The network does not do very much to isolate traffic floods.  Also, typical networks have a fragmented address space which is a nightmare for configuration.  VL2 solves these problems by providing uniform high capacity, performance isolation, and layer-2 semantics.  VL2 uses low cost switch ASICs in a Clos topology and uses valiant load balancing for decentralized coordination.  VL2 chooses paths for flows, rather than packets for better TCP performance.  Measurements of network traffic patterns show that the traffic is highly variable, and the hierarchical topology is inherently unreliable near the top.  VL2 uses scales out hardware with many low cost switches instead of scaling up to expensive hardware.  This provides high aggregate and bisection bandwidth in the network.  An all-to-all data shuffle showed that VL2 had a 94% efficiency and that a conventional hierarchical topology would have taken 11 times longer.  Experiments also showed that VL2 provided fairness and performance isolation.

Portland is scalable, easy maintenance, reliable network fabric for datacenters.  The key insight is that the baseline topology and the growth model of the network is known, so they can be leveraged.  Current network protocols incur lots of overhead to route and manage 100,000s of machines in a datacenter.  Portland uses a centralized fabric manager which holds the soft state of the topology, and uses pseudo MAC addresses to separate host location from host identifier, and to reduce forwarding table sizes.  The fabric manager can reduce the overheads of broadcasts in the common case because it can respond to ARP requests from its PMAC mappings.  Switches can use a distributed location discovery protocol to determine where in the topology they are.  This allows for efficient routing as well as no maintenance required.  Portland uses a simple loop free routing by following an up/down protocol.  All packets are routed up through aggregation switches, then down to the destination node.  Faults are detected during the location discovery protocol and are forwarded to the fabric manager, which can later notify the rest of the network.  Portland only uses O(n) instead of O(N^2) messages to resolve the failure.  Experiments show that Portland is scalable, fault-tolerant, and easy to manage.

c-through is a new system which uses optical switching and a hybrid packet and circuit switching.  Traditional hierarchical topologies have cost and other limitations, and fat tree topologies require many more switches and links which makes wiring more difficult to manage.  Optical fibers can support greater capacities, but require coarser-grained flows, instead of packet switching.  HyPaC is a hybrid architecture with traditional packet switching, but also has a second high-speed rack-to-rack circuit switching optical network.  The optical network switches much slower where the fast bandwidth between two racks are provided on demand.  c-through implements the HyPaC architecture but using increased buffers to estimate rack-to-rack traffic and fully utilize optical links, and uses the estimations to calculate the perfect matching between racks.  This can be done within a few hundreds of milliseconds for 1000 racks.  The HyPaC architecture is emulated and experiments show that the emulation behavior follows expectations and the large buffers do not result in large packet delays.  Several experiments with over-subscribed networks showed that c-through performed similarly to a network with full bisection bandwidth.

Future networks will move towards lots of lower cost network hardware, just like how clusters have moved towards many low cost commodity machines.  However, being able to provide large capacity with many commodity switches will be the challenge.  The fat tree topology like in VL2 will be popular since since it uses lots of commodity switches.  Large companies like facebook, google, and microsoft can easily purchase many switches and wires in bulk for their datacenters to be very cost effective, so the extra hardware and wiring will not be a huge problem.  Easier maintenance will be crucial for the network to be sustainable, so techniques similar to the ones in portland will be required.  having easy mechanisms for fault-tolerance, fairness and isolation will be important.  These mechanisms will be required for a resource manager like mesos to be able to allocate them to applications.  I don't think special optical switches will make it into large datacenters, because it will cause extra management of new hardware, but rather, more switches and wires could make it into the cluster.

Saturday, November 12, 2011

An Operating System for Multicore and Clouds: Mechanisms and Implementation and Improving Per-Node Efficiency in the Datacenter with New OS Abstractions

fos (factored os) is a system from MIT which tries to tackle the problem of running applications on multicore or cluster of machines.  Current cloud systems force the user to manage the individual virtual machines, and also the communication between machines and between cores.  fos solves the complexity by providing a single os image to the system, and abstracts the hardware away.  This allows the developer to see a single os image either on multi-core system or on a cluster of machines.  Each sub-system in fos is factored and parallelized and distributed over cores or machines.  The single system image provides advantages such as ease of administration, transparent sharing, better optimizations, a consistent view of the system, and fault tolerance.  fos prioritizes space multiplexing over time multiplexing, factoring system services into parallel distributed systems, adapting to resource needs, and providing fault tolerance.  The fos server model uses rpcs to remote servers, but hidden from the developer, so the code can be written without explicit rpcs.  fos intercepts file systems calls to forwards them to the fileserver, and can also spawn new vms on the any machine seamlessly.  Many experiments with syscalls, file system performance, and spawning showed that fos had lots of latency and variability, but a simple http server showed that it performed very similarly to a linux implementation.

Akaros is an os to improve per-node efficiency, performance and predictability of nodes in a cluster to improve the performance of the internet services.  This can improve fault tolerance, since fewer nodes are required to do the same amount of work.  Traditional oses are not good for large scale SMP machines so Akaros is a solution with 4 main ideas.  The underlying principle of Akaros is to expose as much information to the application as possible so it can make better decisions.  Akaros exposes cores instead of threads, so applications can manage the core it is allocated.  Akaros can expose different cores with different time quantas for different latency needs.  Akaros also uses a block abstraction for moving data for more efficient transfers than arbitrary memory addresses.  Akaros also separates the idea of provisioning resources and allocating resources.  Abstractions can be very useful, but they can reduce performance, so Akaros tries to be as transparent as possible for applications, and also provides useful defaults for simplicity.  Akaros uses the many core process abstraction so that the processes can perform their own scheduling on the cores.  The cores are gang scheduled and the kernel does not interrupt them.  All sys calls called with an asynchronous api, to decouple IO from parallelism.  Applications can pin pages in memory and Akaros also allows for application level memory management for low latency.  In addition to block level transfers, disk-to-network, network-to-network, and network-to-disk transferred are optimized.  By separating resource provisioning and allocation, the hardware can be better utilized.

Both projects focus on improving performance and utilization for a cluster, but creating a new os.  fos tries to create a global os, where Akaros focuses on the single node os.  I think the future of clusters will need efforts in both areas.  Having a single os abstraction is nice, but I think fos goes too far in the abstractions. I think developers are just fine with explicitly writing rpcs, or explicitly sharing memory for local threads. Having a VM for each thread is overkill, and I think the future global os will be more lightweight and provide better global resource scheduling for sharing the cluster with many applications.  The future os for the single node should be different than todays stock os, since the stock os was built for different purposes.  However, I don't know if rebuilding the entire os is the way to go.  Mostly, starting from a linux base will be sufficient, and maybe all the features of Akaros will not be necessary.  I think one of the most important features will be the different types of block transfers of data.  Since data sizes are growing and more analysis is being done, more data is being transfered, so efficient large data transfers of Akaros should be implemented in linux systems.

Sunday, November 6, 2011

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center and Hadoop NextGen

Mesos is a platform for resource sharing in a cluster of commodity servers.  Sharing between different frameworks like Hadoop or MPI is beneficial because it can increase utilization and data locality.  This will reduce costs and improve performance.  Mesos uses distributed two level scheduling where mesos offers resources to the frameworks, and frameworks chooses which ones to accept.  The common ways to share a cluster now is to statically partition the cluster, or to use VMs, but both methods cannot achieve high utilization nor efficient data sharing.  Scheduling and resource sharing can be done with a centralized global scheduler but this can get very complicated.  Mesos uses a distributed model with resource offers and the frameworks accepting what they need, and use the resources how they wish.  This allows the frameworks to implement their own scheduling, which many of them already do.  Pushing more control to the frameworks is beneficial because frameworks can implement more complicated solutions, and allows mesos to be simpler and more stable.  Mesos does not require frameworks to specify their requirements, but allow them to reject offers.  This allows Mesos to remain simple, and allow frameworks to implement complicated resource constraints.  Isolation between frameworks is achieved with pluggable isolation modules, which include linux containers.  To scale resource offers, frameworks can provide filters for resources, Mesos counts outstanding offers as part of the allocation, and rescinds long outstanding offers.  Mesos master maintains only soft state and can be reconstructed by just slaves and framework schedulers. The decentralized scheduling model incentivizes the frameworks to use short, elastic tasks, and to only accept resources they will actually use.  Experiments with 4 different frameworks showed that Mesos scheduling had better utilization and improved task completion times over static allocations.  The overhead of Mesos is about 4% and could scale up to 50000 nodes with sub second task launching overhead.


The current hadoop framework can support up to about 4000 nodes and that limits the size of the clusters that can be used.  The current hadoop architecture hits scalability limits, and also, it forces a system wide upgrade for any bug fixes or features.  This make it very difficult to maintain a large cluster, while still providing good uptimes.  In the nextgen hadoop, the two JobTracker's functionality are split into two separate components, the ResourceManager and the ApplicationMaster.  The ResourceManager only manages the clusters resources, and offers fine-grained resources, not slots like the current system.  This allows for better utilization.  The ApplicationMaster is manages the scheduling and monitoring of the tasks.  Offloading the management of the lifecycle of tasks to the applications themselves allows the ResourceManager to be much more scalable.  Also, this allows for the actual map reduce to be essentially part of the user library code, which makes easy upgrades.  The entire system does not have to be upgraded together.  The fine-grained resources allows for fungible resources so the utilization is improved.  The resource management is similar to Mesos, in that it has a two level scheduling component.  Application frameworks can accept and reject resources given by the ResourceManager.

The two level scheduling of both Mesos and nextgen hadoop is an efficient, and scalable way to manage resources in a large cluster, so will probably be more popular in the future.  Having a soft state resource manager will be important for scalability and performance because allocations can operate in memory and failures are easy to recover from.  However, I think there will have to be some sort of feedback mechanism on what individual frameworks require to run.  Also, some applications will need to define certain types of restrictions, like this server needs to run on the same rack as the database shard 4.  These types of restrictions are important for both performance and reliability.  Also, there will also have to be a way to isolate disk IO and throughput.  Being able to share and isolate disk IO could be useful for data intensive applications and adds complexity to cpu and ram resources.

Monday, October 31, 2011

Piccolo: Building Fast, Distributed Programs with Partitioned Tables and Spark

Piccolo is an in-memory programming model for large applications in data centers.  It allows programs to access mutable distributed in-memory state with a key-value data model.  Piccolo is different from many existing solutions based on MPI or data flow models providing no global state like MapReduce.  With Piccolo, developers just need to define kernel functions which will be distributed and access distributed data.  By exposing global shared state, it makes writing many types of applications easier.  The application can access state with simple get() and put() apis, and can define accumulator functions to handle conflicts.  The developer can define locality preferences so that kernels can operate locally on the data it mostly accesses, to reduce long network delays.  The master process coordinates all the kernels running on the workers, and to achieve load balancing, workers can "steal" un-started kernels from other workers.  To handle faults, Chandy-Lamport checkpoints are taken online, because restarting kernels will not work since state could have been changed.  Many different types of applications have been developed with Piccolo, such has distributed crawler, page rank, k-means clustering, and matrix multiplication.  In experiments, scaling was decent with more workers, and also for scaling the input size along with the workers on a local cluster.  On EC2, the scaling experiments were more linear.  Piccolo implementations of page rank and k-means clustering were compared to hadoop implementations and performed better, mostly because hadoop had to sort lots of data, serialize keys, and read/write to hdfs.  Piccolo reduces the effects of all of those issues.  Also, experiments with work stealing showed that performance can be improve greatly when there is an imbalance on load in the cluster.

Spark and Resilient distributed datasets (RDDs) is a distributed memory model for programmers to write in-memory applications on large clusters.  RDDs provide fault tolerance for the in-memory data by only allowing coarse grained transformations on the data.  RDDs are extremely useful for reusing data/computation and for iterative algorithms, since the data can still be kept in memory.  Existing systems have to provide fault tolerance with copying data or logging updates, but RDDs provide efficient fault tolerance by storing the lineage of transformations and be able to re-compute partitions.  Developers can control the partitioning of the RDDs and the persistence in memory.  The Spark programming interface for RDDs is similar to DryadLINQ and FlumeJava, by defining new RDDs with transformations.  Since the transformations are coarse grained, the RDDs are immutable and the lineage is remembered, RDDs behave very differently from distributed shared memory.  Representing RDDs require 5 pieces of information: partitions(), preferredLocations(), dependencies(), iterators(), and partitioner().  There are two classes or dependencies, narrow, where a partition of the parent RDD is used at most once by the child RDD, and wide, where a parent partition is used by many child partitions.  Narrow dependencies can be pipelined, and allows for more efficient recovery.  With experiments, Spark could perform iterative algorithms up to 20x faster than hadoop, and for analytics programs, could perform up to 40x faster.  Recovery was fast, because only the failed partitions of the RDDs has to be regenerated.  Also, a 1 TB dataset can be interactively queried within 5-7 seconds.

These two distributed in-memory systems solve the problem in different ways.  Piccolo focuses on mutable, fine-grained updates, where Spark handles coarse grained transformations.  There are tradeoffs between the systems, but clearly the future will still have needs of both types of models.  Certain applications will need asynchronous, fine-grained updates such as web applications, and certain applications will need batch processing on lots of data.  The fine-grained approach is most similar to traditional rdbmses, and the coarse-grained approach is most similar to olap systems.  In the future, the fine-grained approaches will have to incorporate more batch/performance techniques like in spark, and the coarse-grained approaches will try to incorporate more asynchronous and immediate updates and processing.

Relational Cloud: A Database-as-a-Service for the Cloud and Database Scalability, Elasticity, and Autonomy in the Cloud

Relational cloud is an effort to develop a database as a service in the cloud, in order to provide more operational tasks resulting in lower costs for the users.  There are several db as a service products already, such as Amazon RDS or Microsoft Azure, but Relational Cloud tries to handle efficient multi-tenancy, elastic scalability, and database privacy.  Efficiently running many database instances is advantageous because more databases can be consolidated into fewer machines, thus reducing costs.  The basic design uses existing unmodified RDBMSes as backend storage servers, and clients coordinate queries and communications among them.  Each tenant uses separate databases and tables, and they are partitioned and partitions can be migrated between servers.  CryptDB enabled client libraries allow for server-side cryptography for privacy.  The partitioning strategy is workload aware, because periodically, workload traces are analyzed to identify sets of tuples which are accessed together.  The Kairos component gathers performance statistics, such has CPU usage and RAM.  Workload working set is determined by slowly growing a probe table until the IO starts increasing.  A special model has been developed to determine how multiple databases and workloads may combine on a single server.  Also, another placement model has been developed to place partitions on servers in order to minimize the number of servers and balance load.  Adjustable security is achieved by identifying several layers of levels of encryption techniques, of decreasing levels of security.  More queries can be computed at the lower levels of encryption, and the system only decrypts the data minimally in order to execute the query.  With experiments of several tenants, Relational Cloud could achieve consolidation ratios of 6:1 to 17:1, and performed much better than database in VMs, because single servers could allocate resources more effectively, and could share the same log and buffer pool.

When running databases in the cloud, scalability, elasticity, and autonomy are important features.  RDBMSes are traditionally known to not have these features and be able to scale to the cloud.  However, the emergence of key-value stores have been popular in the cloud architecture, because if the highly independent access.  However, sometimes applications need more transactions and consistency and so several techniques can be used to achieve them.  You can start with key-value stores and add more transactional features or you can start from RDBMSes and add more key-value store features and scalability.  Data fusion is a technique for key-value stores which groups keys into entities, in order to provide transactions among them.  G-Store employs this technique, along with dynamic partitions of keys.  Data fission is a technique for DBMSes to partition databases into shards and provide full database semantics for each shard independently.  This allows for scalability, similar to key-value stores.  Elasticity is important to start up or power down servers according to utilization.  Live migration is an important technique to be able to achieve elasticity, and is important for shared-disk and shared-nothing systems.  Iterative copy is developed for shared-disk, and Zephyr is developed for shared-nothing systems.  For autonomy, machine learning algorithms are used to develop models of tenant behavior and placement, and to determine when to migrate, what to migrate and to where to migrate.

Database systems in the cloud will be very important since the cloud model is becoming very popular.  There are several public clouds which can support multi-tenancy and other cloud features.  However, I don't think this model will be very useful for large companies or data.  The multi-tenant databases on public clouds will be useful for small to medium databases, with the assumption that the workload or size will not change much.  If the data grows too large and the workload becomes too demanding, a dedicated system will be able to perform much better, and with better isolation, and multi-tenancy will not help.  In addition, companies may want to keep their own data, instead of in the public cloud, even if there is privacy and security built in.  Many of the multi-tenancy concepts should be useful in an internal setting, in order for companies to be able to keep their data, while allowing tenants to not worry about the details.  Maybe there will be products or open source solutions for reliable multi-tenancy on local clusters.

Saturday, October 22, 2011

BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud and Erlang - A survey of the language and its industrial applications

BOOM analytics is an effort to build large distributed systems in a data-centric, declarative way.  The higher level of abstraction may improve the code simplicity and speed up development, and debugging.  Building large distributed systems for the cloud can be very difficult, because it requires the developer to be careful about concurrent programming and fault tolerant communication.  BOOM attempts to make it easier with data-centric design and declarative programming.  The Overlog language is used to for the declarative language for cloud systems.  Overlog is based on datalog which is a deductive query language, where logical rules describe which tuples belong to which relations.  Overlog extends datalog with location of data notations, SQL features, and definition of of a model of generating changes to tables.  Each datalog timestep has three phases: inbound events are converted to tuple insertions and deletions, local rules are run until a fixed point is reached, then outbound events are emitted.  HDFS and hadoop scheduling were ported using BOOM, along with additional availability and scalability features, with far fewer lines of code, and better assurance in correctness.  With BOOM it was easy to implement first-come-first-serve and LATE scheduling policies for hadoop, and experimental results mirror the original LATE results.  Performance experiments show that the BOOM implementations were slightly lower than the original implementations.

Erlang is a functional programming language for soft real time systems.  Erlang also have requirements of large programs, control systems which cannot be stopped, portability, high levels of concurrency, IPCs, distributed systems, efficient garbage collection, incremental code loading, timeouts, and external interfaces.  Sequential programs are sets of recursive equations, and functions are primitive data types.  The spawn() primitive creates a new process for concurrent programming, and send() and receive() is used for message passing between processes.  Message sending is asynchronous, and the messages are delivered in order.  spawn() can be used to create a process on a different node as well.  Several commercial products use Erlang and some conclusions are that development time can be shorter, the performance is very good for real time systems, and the real-time garbage collection works well.

Both Erlang and Overlog provide declarative, functional programming models for distributed cloud systems.  Both papers conclude that it takes less time and less code to implement a system, and also inspires higher confidence of correctness.  However, I think most people do not think or program in the functional, logic rule based method.  Certain class of problems are useful to implement in a declarative way, such as well-defined state machines such as paxos.  In the future, there will be new languages or better language interoperability in order to insert segments of declarative, functional code, for certain tasks such as communication logic or other protocols.

PNUTS: Yahoo!'s Hosted Data Serving Platform and Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

PNUTS is a scalable, geographically distributed data store for web applications.  Web services usually have scalability, response time, high availability, and relaxed consistency requirements, and PNUTS provides all of those features.  PNUTS provides features other systems usually don't, such has geographic distribution, and a timeline consistency model, instead of eventual consistency.  The timeline consistency is achieved per record with a master per record for all the replicas.  APIs exist to get the latest version of the record, or some older versions, depending on the application.  The data is partitioned into tablets either by range or by hash, and each tablet is replicated.  Replication across regions is asynchronous with the reliable message broker, which is like a pub/sub system.  It is functions as a durable WAL, and as the replication mechanism.  Also, the messages are delivered in order, and this attribute is used to achieve timeline consistency per record.  With experiments, the average latency for client requests were around 100 ms, the latency increased as there were more writes in the system, and increasing the number of storage servers almost linearly decreased the latency.

COPS is an ALPS (availability, low latency, partition-tolerance, high scalability) key-value store which provides the causal+ consistency property.  ALPS properties are desirable for internet services but COPS also provides stronger consistency guarantees, which makes it easier to develop for.  The causal property ensures that the system maintains the causal dependencies between operations, and the + property adds convergent conflict handling.  The convergent conflict handling ensures replicas deal with conflicts in the same way so that replicas don't diverge permanently.  The 3 potential causality situations are within an execution thread, gets from when the get() follows the put(), and transitivity.  COPS is the first system to provide causal+ property in a scalable way, and not with a single master.  COPS depends on a linearizable key-value store to run within a cluster and asynchronously replicates using a replication queue.  Additional operations are available to ensure the causal+ property: get_by_version(), put_after(), and dep_check().  There is also metadata per record to store dependencies, and/or older versions for COPS-GT.  Garbage collection is occasionally performed to clear out old versions and unnecessary dependencies.  In the system, only 5 seconds of metadata needs to be stored, since transactions are only allowed to run for 5 seconds.  In experiments, the throughput increased as the put:get ratio decreased, because the number of dependencies decreased.  Also, COPS-GT usually has lower throughput than COPS because of the stronger guarantees, but as the value size increases, the inter-operation time increases and COPS-GT can achieve similar performance.  Scalability for COPS is also much better than single master log shipping systems.

Both of these systems provide stronger consistency across geographically distributed systems.  The stronger consistency is useful, but future systems will have to better deal with fault tolerance and data loss.  Both of these systems make use of a message bus to deliver asynchronous replication messages, but if a datacenter fails, all those messages will be lost, and data will be lost.  Datacenters can fail, like the Amazon outage showed, and also, modular datacenters are becoming more popular, so less reliability is built into them.  Future systems will have to try to prevent data loss in the face of datacenter failures.

Wednesday, October 12, 2011

DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language and FlumeJava: Easy, Efficient Data-Parallel Pipelines

DryadLINQ is a new programming model for large scale distributed programming, providing imperative and declarative operations on datasets within a traditional high-level language.   Programs are written sequentially as transformations on datasets in the .NET framework, and the runtime system runs the program in parallel on the Dryad platform.  DryadLINQ makes programming large scale distributed programs very simple, since the programmer can just write iterative programs and the system compiles it to reliability run on many machines.  By combining imperative and declarative programming, DryadLINQ provides more flexibility than SQL and optimization opportunities over imperative languages.  DryadLINQ exposes datasets as enumerable objects, uses lazy evaluation of the datasets, and provides strongly, statically typed constructs.  DryadLINQ adds new operators Apply() and Fork() to LINQ to add more parallelizable operators to the language.  Programs are translated to execution plan graph which is a DAG, and static optimizations and runtime dynamic optimizations are performed.  Static optimizations are pipelining, removing redundancy, eager aggregation, and IO reduction.  During runtime, the system can dynamically re-partition the data to optimize sorting.  DryadLINQ can leverage the advantages of PLINQ which parallelizes programs on multicore machines, and can also leverage the LINQ-to-SQL system to interoperate with SQL databases.  With experiments, terasort benchmark ran with almost constant scaleup properties.  The overhead of DryadLINQ over custom Dryad programs is about 10%.

FlumeJava is a similar system developed by Google.  Writing large scale parallel programs for massive amounts of data can be very hard, and FlumeJava solve that problem.  Many problems need pipelines of map-reduce jobs, so FlumeJava exports the idea of immutable parallel collections to ease the programming and allow automatic compilation and optimizations to map-reduce jobs.  The two central classes are Collection and Table, and main operators are parallelDo(), groupByKey(), combineValues(), and flatten().    Other more complex operations, like count(), join() and top() can be implemented with the previous basic operators.  The sequence of operators are not executed immediately but exhibit deferred evaluation, so the resulting program forms a DAG of operators.  Optimizations on the graph include, parallelDo fusion, map-shuffle-combine-reduce fusion, sinking flattens, and lifting combineValues.  At runtime, the system decides to run the MSCR operator locally or run it remotely and in parallel, with the associated startup latency.  Within Google, FlumeJava is becoming the primary java-based API for data parallel computations.  The optimizations usually reduce the stages by about 5x, and some can be reduced by 30x.  FlumeJava was compared with map reduce and sawzall, and for 4 applications, FlumeJava code was more concise.  After optimizations, FlumeJava could reduce the number of map-reduce stages down to the hand-optimized map reduce code.  In terms ok the running time, FlumeJava is very close to optimal runtime of the hand-optimized map reduce code.

The future language data intensive computation will look similar to DryadLINQ and FlumeJava and Pig, with the combination of declarative and imperative programming.  Declarative nature of SQL is very well defined, but at times it can be too restrictive, and difficult to express complicated algorithms.  Since programmers are already used to writing programs iteratively, so it makes sense to allow developers to write large scale data computations iteratively as well.  The functional aspects allow the semantics to be cleaner with the immutable datasets concepts.  This will allow more developers to create larger and more involved computations easily, which will be a huge advantage.  Optimizations will improve and include more cost based features from traditional databases.  Also, compiling down to map-reduce does not always make sense.  Dryad supports any DAG of dataflow, but there will probably be other lower level primitives similar to the map-reduce paradigm, which higher level programs can utilize.

Monday, October 10, 2011

Dynamo: Amazon's Highly Available Key-Value Store

Dynamo was developed in Amazon to provide a highly available, scalable distributed data store for applications that need high reliability.  Usually, traditional dbmses are used as data stores, but Amazon determined that most applications only need simple key value access.  Dbmses have a lot of extra querying capabilities and also favor consistency over availability and scalability.  Dynamo provides an "always on" experience because the shortest unavailability could still have significant financial impact.  It is always writable, with conflict resolution during reads, and not writes.  At the base, Dynamo uses consistent hashing with virtual nodes.  Data is replicated to the first hashed node, and then the N-1 next available successors.  Each update creates a new version of the data for vector clocks, and for conflict resolution on the read.  The application developers can configure an instance with varying number of N, read quorum and write quorum, in order to achieve different levels of consistency and availability.  Hinted handoff protocol is used for nodes receive updates for data it is not responsible for, by trying to send the updates back to the original nodes.  Experiences show that the 99.9 percentile is around 200 ms, but there is a lot of variability.  Buffering helps reduce the variability, at the cost of durability.  Experiments also show that inconsistencies do not happen very frequently.

The future will have to rely more on distributing data across many different datacenters.  Datacenters are becoming more prevalent, and will be the "computers" of the future.  How will the Dynamo model map to many geo-diverse datacenters?  It will still be important for low, predictable latencies, but they will be more difficult to achieve when replicas could be all over the world.  Highly scalable and distributed data stores will have to be able to run across many datacenters.  Most likely, they will still be asynchronously replicated to remote sites to preserve latencies, but new techniques will be developed to provide consistency for applications which need it.  Also, Dynamo is only a key value store.  Future datastores will need to provide more rich APIs in order to support other use cases, such as analytics.

Dremel: Interactive Analysis of Web-Scale Datasets

Dremel is a system from Google which is a highly scalable, interactive ad-hoc analytic system of read-only nested data.  It uses execution trees and columnar storage data, and can run aggregations over trillion row tables in only a few seconds.  It can scale up to thousands of machines and supports petabytes of data.  Interactive response times are important for data exploration and debugging.  Too much data is too difficult to analyze interactively with map reduce, so Dremel tries to achieve low response times.  Dremel's sweet spot is for aggregating over large scale data, and is not a replacement for map reduce.  Using a serving tree, partial aggregation can be performed in the intermediate results as the data is streamed from the leaf nodes up to the root data.  The data is stored in a column wise fashion, which can provide good compression and also advantageous when projections only need a small subset of the columns.  Dremel works on the data in situ, so that there is no loading time, and other frameworks can access the data.  Experiments show that with columnar storage, only accessing a few columns can have orders of magnitude speedup.  Map reduce can also benefit from the columnar storage.  With large aggregations, having more levels in the serving tree can result in speedups since there is more partial aggregation of intermediate results.  Stragglers can be a problem for interactivity, but the system can be configured so that the system returns an answer with less than 100% of the data read.

Dremel is a successful system because Google can afford to have thousands of machines allocated for the system.  High scalability and low latency are achieved by using more machines to process the data.  Most other companies will not be able to afford so many machines for their ad hoc processing system.  The future will have to more efficiently use fewer machines but still provide the interactivity.  This may mean the hardware will have to use SSDs or use different compression or scanning techniques.  Also, future systems will have to take care of other use cases, such has streaming data and modifiable data.

Tuesday, October 4, 2011

Bigtable: A Distributed Storage System for Structured Data

Bigtable is a flexible, high performance, scalable distributed storage system for structured data built by Google.  Bigtable has a richer data model than a key-value store, but does not provide the full relational model.  Bigtable can scale up to petabytes of data and thousands of machines.  It can support a wide range of applications and configurations, and clients can control many aspects such as data layout, data format, and data locality.  Bigtable is essential a distributed, sorted map, which maps keys (row string, column string, timestamp) to string data.  The keys are lexicographically ordered, and columns are grouped into column families.  The architecture has a master which coordinates all the tablets, and tablet servers which handle tablets.  Clients usually communicate directly with the tablet servers.  SStables stored in GFS are used for the durable data, which are ordered, immutable maps.  Writes go to a redo log in GFS, and the recent writes are stored in a memtable.  When the memtable fills up, it is flushed and written to a new SStable, and occasionally, a major compaction merges several SStables into one.  Reads are served from the memtable and the merged SStables.  Locality groups are groups of column families which correspond to an SStable.  This can provide vertical partitioning in order to group related columns together.  Two pass compression scheme is used to achieve 10:1 compression ratios, better than gzip.  Experiments show that Bigtable scales up well, but not linearly, because of the imbalance in the load and the saturation of the network from GFS to the tablet servers.

Bigtable has inspired many other distributed data stores recently.  The apis of these systems are usually limited to simple key-value type of access.  However, not all applications fit the model of single row access and transactions.  I think the future of these systems will be to add more database concepts to the systems without significantly harming the scalability and availability.  Some of these features include transactions, higher level languages, and different levels of consistency.  Bigtable type of systems will look more and more like distributed parallel databases.

Monday, October 3, 2011

Scads: Scale-independent storage for social computing applications

Internet services need to be able to adapt to the load they receive from users.  This load could vary wildly, as services gain popularity, or times of day show natural low levels of load.  In addition to adapting to the query load, the services should also handle growing data, in order to still provide good response times for users.  SCADS tries to solve these issues by with data scale independence, effective scaling up and down, and using machine learning to predict performance and resource requirements.  data scale independence is a useful property for services because it allows the data to grow without changing the application.  SCADS also provides a performance safe query language to statically analyze queries and prohibit queries which may not have constant amount of work per user.  This means all queries must have lookups and not depend on the size of the data.  This can be achieved by creating indexes.

SCADS tries to provide additional features to large scale data storage systems.  Key additions are the performance safe query language and the usage of machine learning to automatically scale up and scale down the cluster.  In the future, automatic scaling will be very important for larger systems.  Maintenance costs can grow if it becomes a manual process, so effective estimation of resource consumption and automatic scaling will be crucial.  Data scale independent queries are a useful feature, but at times, it may be too restrictive, especially if larger analysis is required.  SCADS was built for interactive queries and not large analysis, but future systems will bridge the gap between OLTP and OLAP workloads.


HIVE: Data Warehousing & Analytics on Hadoop -and- Pig Latin: A Not-So-Foreign Language for Data Processing

Lots of data is being generated and stored by companies today, and there is great interest in gaining insight from analyzing all that data.  Facebook generates over 2+ TB day, and storage and analysis is too difficult for traditional DBMSes.  The map reduce framework has better scalability and availability than DBMSes and these properties are usually more important than the ACID properties of DBMSes.  However, writing map reduce jobs is difficult.  Hive is a system built by Facebook which makes querying and managing structured data more easy using hadoop.  It supports many different data formats, structured data, common SQL features, and metadata management.  Many different application types make use of Hive, such as summarization, ad hoc analysis, data mining, spam detection and more.  The data is logically partitioned, then hash partitioned to provide high scalability.  All the metadata is stored in a SQL backend which holds the schemas, the serialization and deserialization libraries, locations, partition info, and more.  The SQL-like interface also allows for custom map and reduce scripts to be able to run map reduce style jobs.

Pig Latin is similar to Hive in the sense that it also tries to solve the difficulty of writing map reduce jobs.  Map reduce is considered too low level, and restricted.  Pig Latin tries to find the sweet spot between declarative and procedural querying of data.  Pig Latin provides ad hoc read-only querying for large scale data.  Parallel databases are usually too expensive and SQL is sometimes unnatural.  Map reduce data flow is sometimes too restrictive, and the opaque nature of the map and reduce functions limit optimizations.  Pig Latin works by specifying transformations on the data.  The programmer defines the sequence of transformations procedurally, but the operators are high level concepts like filtering, grouping, and aggregation.  There is more opportunity for optimizations than map reduce.  Pig Latin uses a nested data model which is closer to what programmers think in, and lots of use cases already store data in nested way.  Java UDFs are treated as first class citizens in the system, and only primitives which can be parallelized are allowed.  This allows for good performance over large amounts of data.  Some primitives are filter, cogroup, join, foreach, flatten, and store.  The system uses lazy evaluation so execution is only done when it is needed to view or store the data.  Optimizations can be performed after the logical plan is constructed at execution time.  The execution compiles down to hadoop map reduce jobs, where the cogroups becoming the boundary of the map reduces.  Pig pen is a debugging framework which uses a generated sandbox data set to allow the developer to see example transformations to debug the programs.

Hive and Pig Latin are only two of the programming frameworks for large scale distributed programs.  The common theme is that map reduce is too low level, so higher level languages need to be developed.  Unlike how SQL became a standard for relational data querying, I don't think there will be a standard for large scale data processing.  There are more and more people dealing with the data and they have varying levels of technical skills and abilities.  Some people are comfortable at very high level languages like SQL, which is sometimes limited to a certain type of data set and analysis.  Others may prefer lower levels such as map reduce.  Hive and Pig Latin borrow from declarative query languages to allow a higher level for map reduce.  They also provide a way to define multiple map reduce phase for a particular task.  Higher level languages are definitely necessary, but right now it is not clear which model is "better".  Also, they both compile down to map reduce jobs, but that isn't necessarily the answer.  Maybe if the higher level languages used a more general underlying framework, they could provide more optimizations, since there are more lower level primitives.  Forcing tasks to follow the map reduce model may limit performance for some use cases.

Tuesday, September 27, 2011

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Dryad is developed at Microsoft to solve the problem of making distributed computation easier to develop.  Large scale services need large scale computation and Dryad provides a simple programming model as well as a runtime system that provides reliability, efficiency, and scalability.  Dryad forces the application developer to think about the data parallelism, and provides flexibility to design the computation.  Developers can define an arbitrary DAG and the communication mechanisms for the edges.  Dryad is more flexible because it can allow arbitrary DAGs and communication mechanisms and input sizes and types, so simpler models can be implemented on top of Dryad.

The Dryad tasks are coordinated by a job manager, which talks to the cluster nodes.  The job manager has the DAG of functions, and sends the vertices to the appropriate nodes.  The vertices communicate with each other through the data plane by the developer specified communication mechanism.  Defining a Dryad DAG is through a C++ library.  A new vertex is created by a static "factory" and edges are created through simple composition rules and graph merges.  Graphs have a distinction between input and output nodes to ensure it is an acyclic graph.  Communication mechanisms include files, shared memory, and TCP.  Encapsulation is a method to group several vertices together into one process and to run on the same node.  When the inputs to a vertex are all completed, then the vertex is added to the greedy scheduler queue.  Similar to Google's MapReduce, tasks are re-executed on failures, and some are duplicated to take care of stragglers.  Runtime local aggregation and partial aggregation optimizations are performed to reduce network traffic and improve performance.  Experiments show that for a simple SQL query, Dryad improved the speedup over SQLServer even on one machine.  The Dryad versions had nearly linear speed-up as the cluster size increased, and the in-memory version about twice as fast as the NTFS version.  For data mining workloads, the communication topology had to be optimized to improve the scalability.

Dryad provides more flexibility than Google's MapReduce, but statically defining a communication topology is difficult and burdensome.  It is nice to be able to define a DAG of operations instead of a simple map and reduce phase, but a more declarative interface will be necessary as computation gets more complicated.  If a more declarative approach is available, then there will be room for more interesting optimizations, similar to query optimizations in databases.  The future of distributed programming models should have an easy way to define the UDFs, but have a simpler way to define the DAG of data flow.  The runtime system will take care of the resource allocation in the cluster, and optimizing the DAG with respect to the cluster state and configuration.

MapReduce: Simplified Data Processing on Large Clusters

Various groups within Google have implemented hundreds of special purpose computations that process large amounts of data.  These are usually parallelized on a large cluster of machines, but there are many issues to running a distributed algorithm on many servers, such has how to parallelize the computation and how to deal with failures.  MapReduce was developed by Google to solve these problems and hide the complexity from users.  MapReduce is a programming model and system implementation for processing large amounts of data.  users just need to define a map() function and a reduce() function, and the system efficiently takes care of the execution on a large cluster of commodity machines.  The runtime system handles the partitioning, scheduling, and failures so that users can easily utilize large clusters.  The MapReduce abstraction is inspired by Lisp primitives map and reduce, and the functional model allows effective parallelization and re-execution for fault tolerance.  Most computations already had a map phase where data was transformed to intermediate key/value pairs, and a reduce phase where a final operation was applied to all the values of the same key.

The MapReduce programming model is very simple for users to use.  Only a map() function and reduce() function need to be defined and the runtime system handles the rest.  MapReduce passes in strings to and from the user defined functions, so the user must convert to the appropriate types.  There are many examples which fit this model well: distributed grep, count of url access frequency, reverse web-link graph, and distributed sort.  The MapReduce execution model starts from the library in the user code.  The input files are split into M pieces for the M mappers.  A master is also forked, and assigns work to mappers and reducers.  A mapper reads key/values from the input, and generates the new intermediate key/value pairs on the local disk.  These local locations are passed to the master, which then notifies the reducers.  When the R reducers read all the data, they sort the keys, and pass it along to the UDF, and produces the final R output files.  To tolerate failures, the master occasionally pings the workers, to determine if they are still running.  On failures, the work is just restarted on a different machine.  Most of the UDFs are deterministic so the programs are correct in a distributed environment.  Locality of the GFS data is considered when assigning mappers to conserve network bandwidth.  Some map tasks can be slow to finish (stragglers), so when the map phase is near completion, the master will start backup tasks of the running tasks and wait for any of them to complete.  Another optimization is the combiner function which is run on the map worker to reduce the data before the real reducers access the data.  This can reduce the amount of data to transfer over the network.  In experiments on a cluster of 1800 machines, grep and sort workloads performed well.  The backup tasks for stragglers reduced the run time of the sort jobs, and killing 200 servers automatically started new jobs and completed the job with just 5% increase in time.  The MapReduce framework is widely used in Google today, and makes writing, running and maintaining distributed computations much easier.

The MapReduce model is very popular today and there are even open source versions like hadoop.  I think the future will not be as dependent on MapReduce but will new systems and programming models will certainly retain a lot of the runtime features.  The important runtime features are still fault tolerance, locality, and efficient shuffling.  I think MapReduce will still be used for certain applications, but new programming models will emerge to handle different types of parallel computation.  Something key to parallel computation on large data is that the data should not be moved much and the computation should go to the data.  However, networking is also improving which would mitigate the transfer costs.  I think there will be an interesting tradeoff between co-locating computation to the data, and incurring transfer costs but amortizing it over several uses.

Sunday, September 25, 2011

The Google File System

GFS is a large, reliable, distributed file system running on many commodity servers and commodity disks.  Google developed GFS to provide high aggregate throughput for data intensive applications.  GFS also addresses the issues on running a file system on commodity servers which often fail.  The developers examined application workloads and technology trends, and developed them together in order to achieve good performance.  GFS addressed changing assumptions in the data access.  Frequent component failure, huge files, mostly append instead of overwrite workloads are the assumptions which GFS is based upon.

GFS runs on many commodity servers with master servers and chunk servers.  The master maintains metadata for the file system and manages chunks, and chunk servers store the chunk data.  Clients have to communicate with the master first to get the chunk servers for specific chunk data.  However, the metadata is cached on the client for some time.  No chunk data is cached by the client, since cache coherence is difficult in a distributed environment, and applications mostly scan through the data once.  Chunks are 64MB and are lazily allocated to reduce internal fragmentation on the chunk servers.  Metadata is stored in memory to allow for fast access, and in practice, metadata is only about 100MB in size.  Data flows and control flow messages are separated to allow for full use of bandwidth when transferring data.  Data flows are pipelined and transfered through a primary chunk server.  Lazy garbage collection is used to clean up deleted and orphaned chunks.  GFS provides high availability by designing for fast recovery and replication of data.  In practice, most of the writes were appends rather than overwrites, which was expected.  The write throughput was slower than the read throughput because writes have to access 3 replicas and increase the probability of collisions.  GFS works well in practice, because the designers considered the new assumptions in data access patterns, and the application and file system were developed together.

The future holds new and changing assumptions which will drive for different design decisions for a large distributed file system.  More memory is becoming commonplace in commodity servers these days, so the future file system should take into account more memory.  In addition to the memory, SSDs are also becoming popular, which may replace disks in servers.  GFS tries to optimize for aggregate throughput and append workloads, but with these new trends in hardware, lower latencies can be achieved.  Lower latencies will be important for interactive workloads, so I think it will be interesting to see if the new hardware will allow for interactive and overwrite workloads such a transactional database workloads.

Megastore: Providing Scalable, Highly Available Storage for Interactive Services


Megastore is a system which provides scalability, availability and consistency over the wide area network.  Many systems focus on scalability and performance, but consistent semantics are cleaner and simpler to reason about.  Also, Megastore tries to solve the issue of data loss with some replication schemes across diverse geographic regions.  Megastore also uses paxos to provide synchronous replication of a log, which is unlike most usages of paxos.  Megastore provides fully serializable ACID semantics for fine-grained partitions of the data.

Megastore blends the scalability of a noSQL system, and the semantic convenience of a traditional RDBMS.  Megastore provides availability by using synchronous fault tolerant log replication, and provides scalability by partitioning data into entity groups.  Paxos is used to synchronously replicate the write ahead log to all the replicas of each entity group, which provides ACID semantics within the group.  Across groups, 2PC is used to coordinate transactions.  At Google, there is usually some natural way to partition the data.  Several optimizations with paxos are developed to improve the write transactions.  A leader is chosen to allow for 1 round trip writes, but does not require a dedicated master.  The first writer accessing the leader gets to commit with 1 round trip, but other writers have to use 2 phases.  Witness and read-only replicas allow improve performance for writes and reads.  The commit point of Megastore is the durability point, and the visibility point is afterwards.  Most of the write scalability is achieved by partitioning the data and reducing the write conflicts.  At Google, most applications say 5 9's of read and write availability.  The reads were usually around 10s of milliseconds, and the writes were around 100-400ms, because of the WAN communication.  Most of the reads by applications were serviced by the local replica as expected.

I think future data stores will be increasingly geographically distributed, because region diversity is crucial for fault tolerance and availability.  Also, users are more geographically distributed which necessitates distributed data stores.  Eventual consistent data stores have been increasingly popular these days because of the high scalability it affords, but I think in the future higher consistency models will be required and developed.  Higher consistency is easier to reason about and develop for, and future systems will provide consistency.  Megastore shows that a consistent system can be built with decent performance, and the future will bring more multi-datacenter systems with higher consistency models.  With newer consistency models will also bring new programming models to handle the nuances of different consistency and failure modes.

Tuesday, September 20, 2011

The Chubby lock service for loosely-coupled distributed systems

The Chubby lock service is a system built by Google to provide coarse grained locking and storage for low volume data.  The primary goal is for availability and reliability, and not for throughput or latency performance.  Because of the performance characteristics, it is only used for coarse-grained locking and not fine-grained locking.  Google has shown that this is not a problem.  Most usage of Chubby has been for naming and metadata/configuration storage.  Before Chubby, Google services used ad hoc methods which could have duplicated work, or required user intervention.  After Chubby, there no longer was a need for human intervention, which eases the maintenance of large distributed systems.  The core of Chubby uses Paxos for distributed consensus of information.

The Chubby service usually has about ~5 machines, running paxos.  The reads are just serviced from the master, and the writes are propagated to all machines with the paxos protocol.  A simple filesystem like interface is available so that it is easily browsable.  All directories or files can have advisory locks and the clients can hold the locks for long periods of time.  There are timeouts to ensure the system makes progress when clients fail.  To improve performance, clients can cache data, but must be connected to Chubby so that the master can invalidate information when necessary.  Writes block for the cache invalidation so that the data is consistent.  Caching greatly reduces the amount of communication between the clients and the master, which improves the scalability.  The main factor for scaling Chubby is reducing communication, not server performance.

The Chubby model is something that may become more popular for distributed systems.  Instead of some ad hoc algorithm for metadata or configuration, Google created a centralized service which provide those features.  Even for massively distributive systems, some things are best done in a centralized fashion, or there is some global data which must be shared by all nodes.  Therefore, there is always a need for some centralized service.  However, centralized services could be the single point of failure, but that is not the case for Chubby.  Chubby uses paxos to agree upon values among several machines.  This allows for some failures and partitions, which means the centralized service can provide greater reliability and availability.  In the future, there will probably have to be more consideration on the network latencies and the non-uniformity.  I think distributed systems will become more global, not just within a single datacenter, so a centralized service will need to handle varying latencies, and increased probability of partitions.  Changes will have to be made to the Chubby algorithms to take into account the longer latencies and cross-world network traffic.

Paxos Made Simple and Paxos Made Practical

Paxos is a distributed consensus algorithm formalized by Leslie Lamport.  The consensus problem is simple.  There are a group of processes which must agree on a single proposed value, which can be learned after the agreement.  For safety of the algorithm, only proposed valued can be chosen, only a single value can be chosen, and a learner cannot learn a value unless it was actually chosen.  For the basic paxos algorithm by Lamport, it is assumed that the set of processes do not change during the execution.  The protocol can be described as a 2 phase protocol.  It is different from 2 phase commit because 2PC requires all processes to respond, but paxos is based on majorities, which improves availability and response times.  In the first phase, proposers send PREPARE(n) messages to acceptors, and acceptors return promises not to accept anything less than n.  acceptors also return the value of the highest accepted proposal, with number less than n.  If the proposer gets promises from a majority of acceptors, it then sends ACCEPT(n, v) messages to acceptors, where v is the value of the highest proposal from acceptors, or a new value (if there was no accepted proposal yet).  This type of protocol ensures safety of the consensus, and with distinguished leader, can also ensure liveness.  Paxos can be used to implement the state machine distributed system model efficiently, because the leader can get phase 1 messages for an indefinite number of instances, so it only needs phase 2 messages to commit values.

"Paxos made practical" does not assume the acceptors are fixed and that they can fail or more can be added.  The set of cohorts in the group is maintained by paxos as well.  The view is the configuration of the cohorts, and a new view is agreed upon by using a paxos instance for a new view.  Using this protocol, new cohort configurations can be reliably modified to adapt to the changing environment.

Paxos is the standard distributed consensus algorithm for distributed systems today.  Paxos is used mostly for small amount of metadata or configuration information, but lately more systems have been using Paxos for more datastores to achieve consistency across several different nodes.  I think in the future, paxos will be used more often to achieve consistency across nodes, as well as a way to replicate data across nodes.  Paxos will be used to achieve different semantics along the consistency spectrum.  Since paxos is a quorum based protocol, unlike 2PC, better availability can be provided by the protocol.  By using paxos or some variation, different points in the CAP tradeoffs.  These different types of semantics will be useful for the different types of system requirements.  There may even be an asymmetric protocol where the different latencies between different nodes will be taken into account, in order to improve performance or provide certain guarantees.

Monday, September 19, 2011

Cluster-Based Scalable Network Services

The popularity of internet services have been growing because of the elimination of software distribution and easier customer service of the product.  There are several reasons for network services, but there are also several challenges in deploying the service, including scalability, availability and cost effectiveness.  The authors of this paper propose a solution of using cluster of commodity machines with a high speed interconnect, using a 3 layered architecture.

A key tradeoff in the design of these systems is the observation that many internet services do not need full consistency or durability of ACID semantics.  BASE (basically available, soft state, eventually consistent) semantics is usually good enough, and this allows for easier deployment and better performance.  With the relaxation of strict consistency, fewer messages are required during partial failures, which allows for better availability and performance.  BASE semantics also improves maintenance of the cluster because they allow stateless worker nodes, which provides easier fault tolerance and better scalability.

The layered approach provides a clean separation of tasks and modularity for the system which has many benefits.  By splitting workers from the front end, the workers can be stateless and can achieve greater scalability with ease.  By making the workers stateless, overflow workers can be allocated to handle bursty traffic efficiently.  Also, in the TACC layer, by modularizing the components, the modules can be composed together to implement a wide range of services, thus simplifying implementation.  Two systems TranSend and HotBot are two real world internet services which implement the layered concept on a cluster of commodity machines with success.  Several experiments on TranSend show that the design decisions with BASE and the layered approach can achieve linear scalability and other benefits such as handling of bursty requests, and load balancing.

The future will have even more internet services, or cloud services.  Most companies are now developing cloud services to provide highly available, scalable products.  This means cluster based services will be even more prevalent in the future.  I think the architecture will probably still look the same as today, but with more public clusters, like Amazon's EC2.  Not all companies can afford to build large clusters, so more internet services will be built on top of public clouds, like Netflix.  This means layers and modularity will become even more important.  There will probably be a growing library of modules which service developers can use.  These modules will be implementations of commonly used components of services, such as load balancers, monitoring tools, and others.  These modules may be available through the cloud provider, or may look like open source software.  I think the clean layered concept may disappear, but only because as services do more and become more complicated, a simple layered approach will not satisfy all cases.  Services will probably look like a DAG of internal services with defined APIs.  There will also be more emphasis on data storage, and different consistency and availability models as data will be become more and more important.  The new models will be necessary because internet services will have to service the entire world, and not a single data center.  The global model will have different performance and availability characteristics which will drive for different semantics for the data.

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

The basics of the CAP theorem states that it is impossible to provide all 3 of consistency, availability, and partition tolerance at the same time for internet services.  Only 2 out of the 3 can be provided at the same time.  Consistency is defined as if operations on an object were completed on a single instance of the object.  Availability is defined as requiring all requests to a non-failing node must result in a response.  Partition tolerance means the system must tolerate when the network may lose arbitrarily many messages from one node to another.  The authors prove that it is impossible to provide an object which has atomic consistency and availability in any asynchronous network model, since it is impossible to determine if a message is lost or arbitrarily delayed.  In a synchronous network model, weaker consistency can be achieved, and there can be bounds in the length of inconsistency after the network heals.  This is possible by using timeouts and basing the bounds on the timeouts.  These results are very important and relevant to internet services today.  Most services are distributed over many nodes, and even across several datacenters.  The network is not 100% reliable so partitions are always possible.  Therefore all developers must take into account the tradeoffs between availability and consistency when designing the system.

I think this CAP result will be increasingly important in the future because internet services will be increasingly global.  Data and services will span across datacenters and regions.  The datacenter will be the new computer and the world will be the new datacenter.  This means the networks between the datacenters will be even more unreliable and unpredictable.  The probability of partitions will be greater and so this CAP theorem will be relevant more frequently.  Future services and applications will have tolerate more partitions and have to be developed with partitions and high latencies in mind.  This is a departure from the programming model today where programs for single machines or single datacenters are far more predictable.

Tuesday, September 13, 2011

The Datacenter Needs an Operating System

Clusters of commodity servers are becoming very popular, and it is important to be able to manage resources effectively.  The datacenter is becoming the new computer, and so an "operating system" is necessary to provide certain abstractions.  The main components for managing the datacenter are resource sharing, data sharing, programming abstractions, and debugging facilities.

Currently, resource sharing is available at a course-grained granularity, which is simple, but limits sharing with different types of applications.  Mesos is one step in the right direction which tries to provide fine-grained resource allocation.  Data sharing is most commonly done with distributed filesystems, but may limit performance.  RDDs provide read only partitions of data with known transformations to improve performance.  There are lots of programming frameworks like map reduce and pregel to help application developers.  However, there is need for more specialized APIs for systems programming for a large cluster.  Finally, there are currently not many good tools for debugging large distributed programs.

The datacenter will absolutely need an "operating system" or some sort of management of the resources.  Clusters may have thousands of machines, so there are many more options for resource allocation/sharing decisions.  However, I think simple methods for resource sharing can work for most applications.  Clusters are usually over-provisioned so the utilization is usually far from 100%.  Since there is usually lots of resources available, the scheduling only needs to avoid the worst options, and the optimal sharing is not required.  Also, adding additional resources is simple, by just adding machines, replacing slower servers with faster servers, or simply adding more datacenters.  Sharing will not be as much of an issue, but colocation will be.  Great performance benefits can be gained by colocating inter-dependent programs together, because of the longer network communication latencies.

Something that will be very important is the debugging and monitoring.  Debugging is already hard for single machine programs, but debugging over many machines becomes exponentially more difficult.  Good monitoring/alerting tools must be developed in order to reduce manual management of the cluster.  Clusters will be getting bigger, running increasing number of applications and frameworks.  Good monitoring tools will allow fewer people to manually interact with a growing number of machines.

Monday, September 12, 2011

Amdahl's Law in the Multicore Era

Amdahl's law has to be slightly modified to be able to describe multicore CPUs in today's world.  The classic speedup formula was modified for different chip configurations such as symmetric chips, asymmetric chips, and dynamic chips.  By plotting the graphs, it was shown that asymmetric chips could provide better speedups than symmetric multicore chips, and dynamic chips could provide the best speedups.  However, asymmetric and dynamic chips requires dealing with complicated scheduling and other overheads.

Something this paper did not address was the costs for the different types of CPUs.  That will be an important factor in how the future turns out.  Right now, most clusters are built using symmetric multicore chips, because they are the less expensive and easier to use than the specialized chips.  This may lead to wasted resources and non-optimal speedups, but the cost effectiveness and ease of deployment makes it worthwhile.  Since most chip manufacturers build symmetric chips for most consumers, the costs for them will stay pretty low, which will make it attractive when designing a datacenter or cluster.

Another factor which will be important is the software or firmware required in order to take advantage of the asymmetric and dynamic chips.  I do not know what software decisions are involved for using asymmetric or dynamic multicore chips, but I imagine that software changes are required and they are not trivial.  There will probably have to be new and better compilers to be able to exploit the better speedups offered by the specialized chips.  New programming languages or constructs may need to be developed.  All of this either does not exist or is in its infancy, so there is a lot of work to be done to be able achieve the theoretical speedups of newer hardware.  These challenges and their solutions may not make it worthwhile to use the specialized CPUs and force large warehouse scale clusters to continue using the simple symmetric chips.

Sunday, September 11, 2011

Performance Modeling and Analysis of Flash-based Storage Devices

SSDs are becoming more prevalent in computers today, and will become the standard storage solution for datacenters and clusters.  The authors of this paper developed performance models for SSDs because their architecture is vastly different from spinning hard disks.  The SSD technology can provide low latency, but also forces slow updates and expensive block-level erasures.  The authors used several workload characteristics such as read/write ratio, request size, queue depth, and others to model performance of latency, bandwidth, and IO throughput.  Their linear regression models predicted real world performance well for 2 OLTP and 2 web search workloads.

In the next 5 to 10 years, SSDs will play a bigger part in computing infrastructure.  Currently, SSD technology is not cost effective enough in order to store all the necessary data.  Hard disks still have a huge advantage in cost to bit density ratio.  However, SSDs are slowly improving the cost/storage ratio, and could be the main storage system in warehouse scale clusters.  Disks will become the "archival tape" of the future.  SSDs have very different performance characteristics, as the paper shows.  Different access patterns benefit from SSDs or hard drives.  When the main storage for clusters is replaced by SSDs, the software stack will have to be modified to take advantage of the different characteristics.  Most software assume hard disk performance in the secondary storage in order to optimize the program.  It will be very important for programmers to adapt to the changing performance characteristics in order to take advantage of SSDs.

Since the performance of SSDs is different from hard disks, system architecture may also have to change for distributed storage systems.  Distributed file systems such has GFS or HDFS assume hard disk are the storage medium, so they adopt the optimized append only model, since hard disks are very good with sequential writes.  SSDs usually do not have as fast write performance as hard disks because of the block-level erasing requirement, but has great random read performance.  These differences will cause the distributed file systems to change.

Wednesday, September 7, 2011

Warehouse-Scale Computing: Entering the Teenage Decade

Warehouse scale computing, WSC, seems to be a recent phenomenon, but it is not exactly revolutionary.  In 1993, David Patterson gave a talk on datacenter computing where the building was the computer.  The difference between the recent warehouse scale computing is that the scale is much larger and runs many more internet services and applications.  Datacenter computing may have petabytes of storage, but WSC is on alert if there is only petabytes of storage free.  Bigger data and problems now require larger scale computing.

WSC is now in beginning in its teenage years.  About 10 years ago, Google released a paper which described their software and hardware, and the paper was only 7 pages long.  As the internet grew in popularity, the increasing need for farming out work and services to compute clusters became more apparent and gave rise to more innovations in WSC.  The increase of new internet applications was the driving force to create and manage larger and larger clusters, in order to provide the interactivity that the new applications required.  WSC is now past its infancy, and as it becomes more sophisticated, several challenges exist.

Warehouse scale computing is usually more cost effective if more but weaker CPUs are used, since the aggregate CPU performance is more important.  However, recently, faster CPUs have made a come back because of embarrassingly parallel solutions.  Simple parallelism is very easy to implement and reason about.  Google mostly uses request level parallelism, but if weaker CPUs are used, it will take longer to complete each request.  In order to speed them up, each request will have to be parallelized which requires more engineering effort.

WSC also have options for storage now.  Flash drives are becoming prevalent and provide far better random read performance.  However, write performance is sometimes worse than disk, and the latency tail is very long.  This can cause problems when applications are at Google's scale, because the latency tail many more delays.

WSC power efficiency is an important metric of interest.  There have been large improvements in the PUE (power usage effectiveness) of datacenters, and Google has reached about PUE of around 1.09 - 1.16.  Machines have also gotten better in terms of the power efficiency relative to peak output.  However, it is difficult for an entire building to use near the available amount of power, since there will always be fluctuations and they could cause an overload.  One solution is for each machine or rack to have a UPS, which has many advantages.  It can help survive power outages, allow machines to ride power load leaks, and allow the reliable usage of ficke renewable energy sources.

Networking is becoming a bottleneck in these systems.  Storage is getting faster, but the network connectivity cannot keep up to be able to disaggregate resources in the datacenter.  Disks are slow enough to tolerate the networks, but with the emergence of flash technology, the networking needs to improve.

A lot of more research has to be conducted in order to reduce latencies.  IOs are actually getting faster, but the software stack is still the same as from the past.  This would be fine if CPUs continued to grow faster, but the clock speeds are leveling off.  At Google any query pattern still creates a very bursty request pattern at the leaf nodes.  This is because most of the software and research concentrated on throughput (batching, nagle mode) instead of latency.  Therefore, there is plenty of research opportunities to minimize latencies.

Tuesday, September 6, 2011

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (chapters 3, 4, 7)

chapter 3

Clusters of low-end servers are preferred over clusters of high-end servers for warehouse scale computing because of the economies of scale.  Low-end servers can take advantage of high-volume personal computing hardware which can drive the prices down.  However, by building a large cluster with many cores (either with low-end or high-end machines), communication costs will greatly affect performance.  With high communication pattern, a single large SMP machine may outperform a cluster by 10x.  However, if a problem is too large even for a single SMP machine, a cluster can be formed by high-end machines.  Experiments show that as the cluster grows, the performance advantage of the high-end cluster diminishes.  The cluster of low-end machines can perform within 5% of a cluster of high-end machines.  This shows that the price premium for large high-end machines is not worth it for large clusters.

There have been arguments that this model can continue further to use more, single-core, but slower CPUs to cut the costs of the hardware.  However, there is a point of diminishing returns.  The hardware may be cheaper, but it will take more parallel tasks and network coordination and optimization to finish a task in the same amount of time as with multi-core CPUs.  If there is global state, using smaller/slower cores will require much more communication or require the use of more conservative (and less efficient) heuristics.

chapter 4

Datacenters are very large systems which consume a lot of power and generate a lot of heat.  Most of the costs going into a datacenter goes to power distribution and cooling.  There are different tiers of datacenters, according to the amount of reliability and redundancy of the power and cooling paths.

Utility power enters the datacenter and feeds into the uninterruptible power supply (UPS).  The UPS also has an input from generator power.  The UPS contains a transfer switch to detect utility power failure, and switch over to the generator power for input.  During the outage time, the UPS uses its battery or flywheel to provide power long enough for the generator to kick in.  

CRAC (computer room ac) uses ac units to take in the hot air in the server room, and cool it down and push out cold air through the floor.  The cold air goes through the floor and back into the racks and flows through the servers, and the hot air gets expelled, and the cycle continues.  The actual cooling units use chilled water to cool down the air.  Some datacenters use free cooling which cools the water in a more efficient way than using a chiller.  Cooling towers can be used to cool the warm water by evaporation, and this method works well in dry climates.

The airflow through a server and rack of servers determines the amount of airflow the floor must push out, and that is determined by the fans of the CRAC units.  More servers mean more airflow must be supported and eventually there may be limits were it becomes impractical to increase the airflow any further.  In-rack cooling involves bringing chilled water to each rack, and having the warm air exchange at each rack.  Container based datacenters usually use methods similar to CRAC cooling but in more modular sizes.

chapter 7

In large clusters, failures are much more probable than a single machine, so all machines in the cluster will rarely all be functional.  This means software will have to fault tolerant.  By having the software be fault tolerant, the hardware does not have to be as reliable, and also, there are more options in dealing with hardware upgrades.  Usually, the minimum requirement for the hardware is to be able to detect failures.  Surveys on internet services and Google services have shown that actual hardware failures causing outages is relatively small compared to configuration problems and software bugs.

Using studies of machine reboots at Google, if a service uses 2,000 servers, there would be a reboot about every 2.5 hours, or about 10 reboots a day.  Since most reboots complete within 2 minutes, about 20 additional spare machines will have to be provisioned to safely keep the service available.  There is significant indirect evidence that software-induced machine crashes are more common than those induced by hardware, but the most common hardware causes are DRAM errors, and disk errors.

An efficient repair process is very important for such large scale clusters for warehouse scale computers.  However, some features of WSC's can help the repair system.  Since there is usually a lot of spare capacity in the cluster, a technician can more efficiently repair all machines at a schedule, instead of repairing every failure immediately.  Also, with all the possible failures, the data can be collected and analyzed.  Google uses machine learning to determine the best course of action for a repair, and detect faulty hardware or firmware.