Monday, November 14, 2011

VL2: A Scalable and Flexible Data Center Network and PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric and c-Through: Part-time Optics in Data Centers

VL2 is a datacenter network to be cost effective and agile for better utilization and performance of the cluster.  Current datacenter networks usually have high cost hardware in a tree like structure, and cannot provide the agility required for todays services.  The high costs limit the available capacity between any two nodes.  The network does not do very much to isolate traffic floods.  Also, typical networks have a fragmented address space which is a nightmare for configuration.  VL2 solves these problems by providing uniform high capacity, performance isolation, and layer-2 semantics.  VL2 uses low cost switch ASICs in a Clos topology and uses valiant load balancing for decentralized coordination.  VL2 chooses paths for flows, rather than packets for better TCP performance.  Measurements of network traffic patterns show that the traffic is highly variable, and the hierarchical topology is inherently unreliable near the top.  VL2 uses scales out hardware with many low cost switches instead of scaling up to expensive hardware.  This provides high aggregate and bisection bandwidth in the network.  An all-to-all data shuffle showed that VL2 had a 94% efficiency and that a conventional hierarchical topology would have taken 11 times longer.  Experiments also showed that VL2 provided fairness and performance isolation.

Portland is scalable, easy maintenance, reliable network fabric for datacenters.  The key insight is that the baseline topology and the growth model of the network is known, so they can be leveraged.  Current network protocols incur lots of overhead to route and manage 100,000s of machines in a datacenter.  Portland uses a centralized fabric manager which holds the soft state of the topology, and uses pseudo MAC addresses to separate host location from host identifier, and to reduce forwarding table sizes.  The fabric manager can reduce the overheads of broadcasts in the common case because it can respond to ARP requests from its PMAC mappings.  Switches can use a distributed location discovery protocol to determine where in the topology they are.  This allows for efficient routing as well as no maintenance required.  Portland uses a simple loop free routing by following an up/down protocol.  All packets are routed up through aggregation switches, then down to the destination node.  Faults are detected during the location discovery protocol and are forwarded to the fabric manager, which can later notify the rest of the network.  Portland only uses O(n) instead of O(N^2) messages to resolve the failure.  Experiments show that Portland is scalable, fault-tolerant, and easy to manage.

c-through is a new system which uses optical switching and a hybrid packet and circuit switching.  Traditional hierarchical topologies have cost and other limitations, and fat tree topologies require many more switches and links which makes wiring more difficult to manage.  Optical fibers can support greater capacities, but require coarser-grained flows, instead of packet switching.  HyPaC is a hybrid architecture with traditional packet switching, but also has a second high-speed rack-to-rack circuit switching optical network.  The optical network switches much slower where the fast bandwidth between two racks are provided on demand.  c-through implements the HyPaC architecture but using increased buffers to estimate rack-to-rack traffic and fully utilize optical links, and uses the estimations to calculate the perfect matching between racks.  This can be done within a few hundreds of milliseconds for 1000 racks.  The HyPaC architecture is emulated and experiments show that the emulation behavior follows expectations and the large buffers do not result in large packet delays.  Several experiments with over-subscribed networks showed that c-through performed similarly to a network with full bisection bandwidth.

Future networks will move towards lots of lower cost network hardware, just like how clusters have moved towards many low cost commodity machines.  However, being able to provide large capacity with many commodity switches will be the challenge.  The fat tree topology like in VL2 will be popular since since it uses lots of commodity switches.  Large companies like facebook, google, and microsoft can easily purchase many switches and wires in bulk for their datacenters to be very cost effective, so the extra hardware and wiring will not be a huge problem.  Easier maintenance will be crucial for the network to be sustainable, so techniques similar to the ones in portland will be required.  having easy mechanisms for fault-tolerance, fairness and isolation will be important.  These mechanisms will be required for a resource manager like mesos to be able to allocate them to applications.  I don't think special optical switches will make it into large datacenters, because it will cause extra management of new hardware, but rather, more switches and wires could make it into the cluster.

Saturday, November 12, 2011

An Operating System for Multicore and Clouds: Mechanisms and Implementation and Improving Per-Node Efficiency in the Datacenter with New OS Abstractions

fos (factored os) is a system from MIT which tries to tackle the problem of running applications on multicore or cluster of machines.  Current cloud systems force the user to manage the individual virtual machines, and also the communication between machines and between cores.  fos solves the complexity by providing a single os image to the system, and abstracts the hardware away.  This allows the developer to see a single os image either on multi-core system or on a cluster of machines.  Each sub-system in fos is factored and parallelized and distributed over cores or machines.  The single system image provides advantages such as ease of administration, transparent sharing, better optimizations, a consistent view of the system, and fault tolerance.  fos prioritizes space multiplexing over time multiplexing, factoring system services into parallel distributed systems, adapting to resource needs, and providing fault tolerance.  The fos server model uses rpcs to remote servers, but hidden from the developer, so the code can be written without explicit rpcs.  fos intercepts file systems calls to forwards them to the fileserver, and can also spawn new vms on the any machine seamlessly.  Many experiments with syscalls, file system performance, and spawning showed that fos had lots of latency and variability, but a simple http server showed that it performed very similarly to a linux implementation.

Akaros is an os to improve per-node efficiency, performance and predictability of nodes in a cluster to improve the performance of the internet services.  This can improve fault tolerance, since fewer nodes are required to do the same amount of work.  Traditional oses are not good for large scale SMP machines so Akaros is a solution with 4 main ideas.  The underlying principle of Akaros is to expose as much information to the application as possible so it can make better decisions.  Akaros exposes cores instead of threads, so applications can manage the core it is allocated.  Akaros can expose different cores with different time quantas for different latency needs.  Akaros also uses a block abstraction for moving data for more efficient transfers than arbitrary memory addresses.  Akaros also separates the idea of provisioning resources and allocating resources.  Abstractions can be very useful, but they can reduce performance, so Akaros tries to be as transparent as possible for applications, and also provides useful defaults for simplicity.  Akaros uses the many core process abstraction so that the processes can perform their own scheduling on the cores.  The cores are gang scheduled and the kernel does not interrupt them.  All sys calls called with an asynchronous api, to decouple IO from parallelism.  Applications can pin pages in memory and Akaros also allows for application level memory management for low latency.  In addition to block level transfers, disk-to-network, network-to-network, and network-to-disk transferred are optimized.  By separating resource provisioning and allocation, the hardware can be better utilized.

Both projects focus on improving performance and utilization for a cluster, but creating a new os.  fos tries to create a global os, where Akaros focuses on the single node os.  I think the future of clusters will need efforts in both areas.  Having a single os abstraction is nice, but I think fos goes too far in the abstractions. I think developers are just fine with explicitly writing rpcs, or explicitly sharing memory for local threads. Having a VM for each thread is overkill, and I think the future global os will be more lightweight and provide better global resource scheduling for sharing the cluster with many applications.  The future os for the single node should be different than todays stock os, since the stock os was built for different purposes.  However, I don't know if rebuilding the entire os is the way to go.  Mostly, starting from a linux base will be sufficient, and maybe all the features of Akaros will not be necessary.  I think one of the most important features will be the different types of block transfers of data.  Since data sizes are growing and more analysis is being done, more data is being transfered, so efficient large data transfers of Akaros should be implemented in linux systems.

Sunday, November 6, 2011

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center and Hadoop NextGen

Mesos is a platform for resource sharing in a cluster of commodity servers.  Sharing between different frameworks like Hadoop or MPI is beneficial because it can increase utilization and data locality.  This will reduce costs and improve performance.  Mesos uses distributed two level scheduling where mesos offers resources to the frameworks, and frameworks chooses which ones to accept.  The common ways to share a cluster now is to statically partition the cluster, or to use VMs, but both methods cannot achieve high utilization nor efficient data sharing.  Scheduling and resource sharing can be done with a centralized global scheduler but this can get very complicated.  Mesos uses a distributed model with resource offers and the frameworks accepting what they need, and use the resources how they wish.  This allows the frameworks to implement their own scheduling, which many of them already do.  Pushing more control to the frameworks is beneficial because frameworks can implement more complicated solutions, and allows mesos to be simpler and more stable.  Mesos does not require frameworks to specify their requirements, but allow them to reject offers.  This allows Mesos to remain simple, and allow frameworks to implement complicated resource constraints.  Isolation between frameworks is achieved with pluggable isolation modules, which include linux containers.  To scale resource offers, frameworks can provide filters for resources, Mesos counts outstanding offers as part of the allocation, and rescinds long outstanding offers.  Mesos master maintains only soft state and can be reconstructed by just slaves and framework schedulers. The decentralized scheduling model incentivizes the frameworks to use short, elastic tasks, and to only accept resources they will actually use.  Experiments with 4 different frameworks showed that Mesos scheduling had better utilization and improved task completion times over static allocations.  The overhead of Mesos is about 4% and could scale up to 50000 nodes with sub second task launching overhead.


The current hadoop framework can support up to about 4000 nodes and that limits the size of the clusters that can be used.  The current hadoop architecture hits scalability limits, and also, it forces a system wide upgrade for any bug fixes or features.  This make it very difficult to maintain a large cluster, while still providing good uptimes.  In the nextgen hadoop, the two JobTracker's functionality are split into two separate components, the ResourceManager and the ApplicationMaster.  The ResourceManager only manages the clusters resources, and offers fine-grained resources, not slots like the current system.  This allows for better utilization.  The ApplicationMaster is manages the scheduling and monitoring of the tasks.  Offloading the management of the lifecycle of tasks to the applications themselves allows the ResourceManager to be much more scalable.  Also, this allows for the actual map reduce to be essentially part of the user library code, which makes easy upgrades.  The entire system does not have to be upgraded together.  The fine-grained resources allows for fungible resources so the utilization is improved.  The resource management is similar to Mesos, in that it has a two level scheduling component.  Application frameworks can accept and reject resources given by the ResourceManager.

The two level scheduling of both Mesos and nextgen hadoop is an efficient, and scalable way to manage resources in a large cluster, so will probably be more popular in the future.  Having a soft state resource manager will be important for scalability and performance because allocations can operate in memory and failures are easy to recover from.  However, I think there will have to be some sort of feedback mechanism on what individual frameworks require to run.  Also, some applications will need to define certain types of restrictions, like this server needs to run on the same rack as the database shard 4.  These types of restrictions are important for both performance and reliability.  Also, there will also have to be a way to isolate disk IO and throughput.  Being able to share and isolate disk IO could be useful for data intensive applications and adds complexity to cpu and ram resources.

Monday, October 31, 2011

Piccolo: Building Fast, Distributed Programs with Partitioned Tables and Spark

Piccolo is an in-memory programming model for large applications in data centers.  It allows programs to access mutable distributed in-memory state with a key-value data model.  Piccolo is different from many existing solutions based on MPI or data flow models providing no global state like MapReduce.  With Piccolo, developers just need to define kernel functions which will be distributed and access distributed data.  By exposing global shared state, it makes writing many types of applications easier.  The application can access state with simple get() and put() apis, and can define accumulator functions to handle conflicts.  The developer can define locality preferences so that kernels can operate locally on the data it mostly accesses, to reduce long network delays.  The master process coordinates all the kernels running on the workers, and to achieve load balancing, workers can "steal" un-started kernels from other workers.  To handle faults, Chandy-Lamport checkpoints are taken online, because restarting kernels will not work since state could have been changed.  Many different types of applications have been developed with Piccolo, such has distributed crawler, page rank, k-means clustering, and matrix multiplication.  In experiments, scaling was decent with more workers, and also for scaling the input size along with the workers on a local cluster.  On EC2, the scaling experiments were more linear.  Piccolo implementations of page rank and k-means clustering were compared to hadoop implementations and performed better, mostly because hadoop had to sort lots of data, serialize keys, and read/write to hdfs.  Piccolo reduces the effects of all of those issues.  Also, experiments with work stealing showed that performance can be improve greatly when there is an imbalance on load in the cluster.

Spark and Resilient distributed datasets (RDDs) is a distributed memory model for programmers to write in-memory applications on large clusters.  RDDs provide fault tolerance for the in-memory data by only allowing coarse grained transformations on the data.  RDDs are extremely useful for reusing data/computation and for iterative algorithms, since the data can still be kept in memory.  Existing systems have to provide fault tolerance with copying data or logging updates, but RDDs provide efficient fault tolerance by storing the lineage of transformations and be able to re-compute partitions.  Developers can control the partitioning of the RDDs and the persistence in memory.  The Spark programming interface for RDDs is similar to DryadLINQ and FlumeJava, by defining new RDDs with transformations.  Since the transformations are coarse grained, the RDDs are immutable and the lineage is remembered, RDDs behave very differently from distributed shared memory.  Representing RDDs require 5 pieces of information: partitions(), preferredLocations(), dependencies(), iterators(), and partitioner().  There are two classes or dependencies, narrow, where a partition of the parent RDD is used at most once by the child RDD, and wide, where a parent partition is used by many child partitions.  Narrow dependencies can be pipelined, and allows for more efficient recovery.  With experiments, Spark could perform iterative algorithms up to 20x faster than hadoop, and for analytics programs, could perform up to 40x faster.  Recovery was fast, because only the failed partitions of the RDDs has to be regenerated.  Also, a 1 TB dataset can be interactively queried within 5-7 seconds.

These two distributed in-memory systems solve the problem in different ways.  Piccolo focuses on mutable, fine-grained updates, where Spark handles coarse grained transformations.  There are tradeoffs between the systems, but clearly the future will still have needs of both types of models.  Certain applications will need asynchronous, fine-grained updates such as web applications, and certain applications will need batch processing on lots of data.  The fine-grained approach is most similar to traditional rdbmses, and the coarse-grained approach is most similar to olap systems.  In the future, the fine-grained approaches will have to incorporate more batch/performance techniques like in spark, and the coarse-grained approaches will try to incorporate more asynchronous and immediate updates and processing.

Relational Cloud: A Database-as-a-Service for the Cloud and Database Scalability, Elasticity, and Autonomy in the Cloud

Relational cloud is an effort to develop a database as a service in the cloud, in order to provide more operational tasks resulting in lower costs for the users.  There are several db as a service products already, such as Amazon RDS or Microsoft Azure, but Relational Cloud tries to handle efficient multi-tenancy, elastic scalability, and database privacy.  Efficiently running many database instances is advantageous because more databases can be consolidated into fewer machines, thus reducing costs.  The basic design uses existing unmodified RDBMSes as backend storage servers, and clients coordinate queries and communications among them.  Each tenant uses separate databases and tables, and they are partitioned and partitions can be migrated between servers.  CryptDB enabled client libraries allow for server-side cryptography for privacy.  The partitioning strategy is workload aware, because periodically, workload traces are analyzed to identify sets of tuples which are accessed together.  The Kairos component gathers performance statistics, such has CPU usage and RAM.  Workload working set is determined by slowly growing a probe table until the IO starts increasing.  A special model has been developed to determine how multiple databases and workloads may combine on a single server.  Also, another placement model has been developed to place partitions on servers in order to minimize the number of servers and balance load.  Adjustable security is achieved by identifying several layers of levels of encryption techniques, of decreasing levels of security.  More queries can be computed at the lower levels of encryption, and the system only decrypts the data minimally in order to execute the query.  With experiments of several tenants, Relational Cloud could achieve consolidation ratios of 6:1 to 17:1, and performed much better than database in VMs, because single servers could allocate resources more effectively, and could share the same log and buffer pool.

When running databases in the cloud, scalability, elasticity, and autonomy are important features.  RDBMSes are traditionally known to not have these features and be able to scale to the cloud.  However, the emergence of key-value stores have been popular in the cloud architecture, because if the highly independent access.  However, sometimes applications need more transactions and consistency and so several techniques can be used to achieve them.  You can start with key-value stores and add more transactional features or you can start from RDBMSes and add more key-value store features and scalability.  Data fusion is a technique for key-value stores which groups keys into entities, in order to provide transactions among them.  G-Store employs this technique, along with dynamic partitions of keys.  Data fission is a technique for DBMSes to partition databases into shards and provide full database semantics for each shard independently.  This allows for scalability, similar to key-value stores.  Elasticity is important to start up or power down servers according to utilization.  Live migration is an important technique to be able to achieve elasticity, and is important for shared-disk and shared-nothing systems.  Iterative copy is developed for shared-disk, and Zephyr is developed for shared-nothing systems.  For autonomy, machine learning algorithms are used to develop models of tenant behavior and placement, and to determine when to migrate, what to migrate and to where to migrate.

Database systems in the cloud will be very important since the cloud model is becoming very popular.  There are several public clouds which can support multi-tenancy and other cloud features.  However, I don't think this model will be very useful for large companies or data.  The multi-tenant databases on public clouds will be useful for small to medium databases, with the assumption that the workload or size will not change much.  If the data grows too large and the workload becomes too demanding, a dedicated system will be able to perform much better, and with better isolation, and multi-tenancy will not help.  In addition, companies may want to keep their own data, instead of in the public cloud, even if there is privacy and security built in.  Many of the multi-tenancy concepts should be useful in an internal setting, in order for companies to be able to keep their data, while allowing tenants to not worry about the details.  Maybe there will be products or open source solutions for reliable multi-tenancy on local clusters.

Saturday, October 22, 2011

BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud and Erlang - A survey of the language and its industrial applications

BOOM analytics is an effort to build large distributed systems in a data-centric, declarative way.  The higher level of abstraction may improve the code simplicity and speed up development, and debugging.  Building large distributed systems for the cloud can be very difficult, because it requires the developer to be careful about concurrent programming and fault tolerant communication.  BOOM attempts to make it easier with data-centric design and declarative programming.  The Overlog language is used to for the declarative language for cloud systems.  Overlog is based on datalog which is a deductive query language, where logical rules describe which tuples belong to which relations.  Overlog extends datalog with location of data notations, SQL features, and definition of of a model of generating changes to tables.  Each datalog timestep has three phases: inbound events are converted to tuple insertions and deletions, local rules are run until a fixed point is reached, then outbound events are emitted.  HDFS and hadoop scheduling were ported using BOOM, along with additional availability and scalability features, with far fewer lines of code, and better assurance in correctness.  With BOOM it was easy to implement first-come-first-serve and LATE scheduling policies for hadoop, and experimental results mirror the original LATE results.  Performance experiments show that the BOOM implementations were slightly lower than the original implementations.

Erlang is a functional programming language for soft real time systems.  Erlang also have requirements of large programs, control systems which cannot be stopped, portability, high levels of concurrency, IPCs, distributed systems, efficient garbage collection, incremental code loading, timeouts, and external interfaces.  Sequential programs are sets of recursive equations, and functions are primitive data types.  The spawn() primitive creates a new process for concurrent programming, and send() and receive() is used for message passing between processes.  Message sending is asynchronous, and the messages are delivered in order.  spawn() can be used to create a process on a different node as well.  Several commercial products use Erlang and some conclusions are that development time can be shorter, the performance is very good for real time systems, and the real-time garbage collection works well.

Both Erlang and Overlog provide declarative, functional programming models for distributed cloud systems.  Both papers conclude that it takes less time and less code to implement a system, and also inspires higher confidence of correctness.  However, I think most people do not think or program in the functional, logic rule based method.  Certain class of problems are useful to implement in a declarative way, such as well-defined state machines such as paxos.  In the future, there will be new languages or better language interoperability in order to insert segments of declarative, functional code, for certain tasks such as communication logic or other protocols.

PNUTS: Yahoo!'s Hosted Data Serving Platform and Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

PNUTS is a scalable, geographically distributed data store for web applications.  Web services usually have scalability, response time, high availability, and relaxed consistency requirements, and PNUTS provides all of those features.  PNUTS provides features other systems usually don't, such has geographic distribution, and a timeline consistency model, instead of eventual consistency.  The timeline consistency is achieved per record with a master per record for all the replicas.  APIs exist to get the latest version of the record, or some older versions, depending on the application.  The data is partitioned into tablets either by range or by hash, and each tablet is replicated.  Replication across regions is asynchronous with the reliable message broker, which is like a pub/sub system.  It is functions as a durable WAL, and as the replication mechanism.  Also, the messages are delivered in order, and this attribute is used to achieve timeline consistency per record.  With experiments, the average latency for client requests were around 100 ms, the latency increased as there were more writes in the system, and increasing the number of storage servers almost linearly decreased the latency.

COPS is an ALPS (availability, low latency, partition-tolerance, high scalability) key-value store which provides the causal+ consistency property.  ALPS properties are desirable for internet services but COPS also provides stronger consistency guarantees, which makes it easier to develop for.  The causal property ensures that the system maintains the causal dependencies between operations, and the + property adds convergent conflict handling.  The convergent conflict handling ensures replicas deal with conflicts in the same way so that replicas don't diverge permanently.  The 3 potential causality situations are within an execution thread, gets from when the get() follows the put(), and transitivity.  COPS is the first system to provide causal+ property in a scalable way, and not with a single master.  COPS depends on a linearizable key-value store to run within a cluster and asynchronously replicates using a replication queue.  Additional operations are available to ensure the causal+ property: get_by_version(), put_after(), and dep_check().  There is also metadata per record to store dependencies, and/or older versions for COPS-GT.  Garbage collection is occasionally performed to clear out old versions and unnecessary dependencies.  In the system, only 5 seconds of metadata needs to be stored, since transactions are only allowed to run for 5 seconds.  In experiments, the throughput increased as the put:get ratio decreased, because the number of dependencies decreased.  Also, COPS-GT usually has lower throughput than COPS because of the stronger guarantees, but as the value size increases, the inter-operation time increases and COPS-GT can achieve similar performance.  Scalability for COPS is also much better than single master log shipping systems.

Both of these systems provide stronger consistency across geographically distributed systems.  The stronger consistency is useful, but future systems will have to better deal with fault tolerance and data loss.  Both of these systems make use of a message bus to deliver asynchronous replication messages, but if a datacenter fails, all those messages will be lost, and data will be lost.  Datacenters can fail, like the Amazon outage showed, and also, modular datacenters are becoming more popular, so less reliability is built into them.  Future systems will have to try to prevent data loss in the face of datacenter failures.