Monday, November 14, 2011

VL2: A Scalable and Flexible Data Center Network and PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric and c-Through: Part-time Optics in Data Centers

VL2 is a datacenter network to be cost effective and agile for better utilization and performance of the cluster.  Current datacenter networks usually have high cost hardware in a tree like structure, and cannot provide the agility required for todays services.  The high costs limit the available capacity between any two nodes.  The network does not do very much to isolate traffic floods.  Also, typical networks have a fragmented address space which is a nightmare for configuration.  VL2 solves these problems by providing uniform high capacity, performance isolation, and layer-2 semantics.  VL2 uses low cost switch ASICs in a Clos topology and uses valiant load balancing for decentralized coordination.  VL2 chooses paths for flows, rather than packets for better TCP performance.  Measurements of network traffic patterns show that the traffic is highly variable, and the hierarchical topology is inherently unreliable near the top.  VL2 uses scales out hardware with many low cost switches instead of scaling up to expensive hardware.  This provides high aggregate and bisection bandwidth in the network.  An all-to-all data shuffle showed that VL2 had a 94% efficiency and that a conventional hierarchical topology would have taken 11 times longer.  Experiments also showed that VL2 provided fairness and performance isolation.

Portland is scalable, easy maintenance, reliable network fabric for datacenters.  The key insight is that the baseline topology and the growth model of the network is known, so they can be leveraged.  Current network protocols incur lots of overhead to route and manage 100,000s of machines in a datacenter.  Portland uses a centralized fabric manager which holds the soft state of the topology, and uses pseudo MAC addresses to separate host location from host identifier, and to reduce forwarding table sizes.  The fabric manager can reduce the overheads of broadcasts in the common case because it can respond to ARP requests from its PMAC mappings.  Switches can use a distributed location discovery protocol to determine where in the topology they are.  This allows for efficient routing as well as no maintenance required.  Portland uses a simple loop free routing by following an up/down protocol.  All packets are routed up through aggregation switches, then down to the destination node.  Faults are detected during the location discovery protocol and are forwarded to the fabric manager, which can later notify the rest of the network.  Portland only uses O(n) instead of O(N^2) messages to resolve the failure.  Experiments show that Portland is scalable, fault-tolerant, and easy to manage.

c-through is a new system which uses optical switching and a hybrid packet and circuit switching.  Traditional hierarchical topologies have cost and other limitations, and fat tree topologies require many more switches and links which makes wiring more difficult to manage.  Optical fibers can support greater capacities, but require coarser-grained flows, instead of packet switching.  HyPaC is a hybrid architecture with traditional packet switching, but also has a second high-speed rack-to-rack circuit switching optical network.  The optical network switches much slower where the fast bandwidth between two racks are provided on demand.  c-through implements the HyPaC architecture but using increased buffers to estimate rack-to-rack traffic and fully utilize optical links, and uses the estimations to calculate the perfect matching between racks.  This can be done within a few hundreds of milliseconds for 1000 racks.  The HyPaC architecture is emulated and experiments show that the emulation behavior follows expectations and the large buffers do not result in large packet delays.  Several experiments with over-subscribed networks showed that c-through performed similarly to a network with full bisection bandwidth.

Future networks will move towards lots of lower cost network hardware, just like how clusters have moved towards many low cost commodity machines.  However, being able to provide large capacity with many commodity switches will be the challenge.  The fat tree topology like in VL2 will be popular since since it uses lots of commodity switches.  Large companies like facebook, google, and microsoft can easily purchase many switches and wires in bulk for their datacenters to be very cost effective, so the extra hardware and wiring will not be a huge problem.  Easier maintenance will be crucial for the network to be sustainable, so techniques similar to the ones in portland will be required.  having easy mechanisms for fault-tolerance, fairness and isolation will be important.  These mechanisms will be required for a resource manager like mesos to be able to allocate them to applications.  I don't think special optical switches will make it into large datacenters, because it will cause extra management of new hardware, but rather, more switches and wires could make it into the cluster.

Saturday, November 12, 2011

An Operating System for Multicore and Clouds: Mechanisms and Implementation and Improving Per-Node Efficiency in the Datacenter with New OS Abstractions

fos (factored os) is a system from MIT which tries to tackle the problem of running applications on multicore or cluster of machines.  Current cloud systems force the user to manage the individual virtual machines, and also the communication between machines and between cores.  fos solves the complexity by providing a single os image to the system, and abstracts the hardware away.  This allows the developer to see a single os image either on multi-core system or on a cluster of machines.  Each sub-system in fos is factored and parallelized and distributed over cores or machines.  The single system image provides advantages such as ease of administration, transparent sharing, better optimizations, a consistent view of the system, and fault tolerance.  fos prioritizes space multiplexing over time multiplexing, factoring system services into parallel distributed systems, adapting to resource needs, and providing fault tolerance.  The fos server model uses rpcs to remote servers, but hidden from the developer, so the code can be written without explicit rpcs.  fos intercepts file systems calls to forwards them to the fileserver, and can also spawn new vms on the any machine seamlessly.  Many experiments with syscalls, file system performance, and spawning showed that fos had lots of latency and variability, but a simple http server showed that it performed very similarly to a linux implementation.

Akaros is an os to improve per-node efficiency, performance and predictability of nodes in a cluster to improve the performance of the internet services.  This can improve fault tolerance, since fewer nodes are required to do the same amount of work.  Traditional oses are not good for large scale SMP machines so Akaros is a solution with 4 main ideas.  The underlying principle of Akaros is to expose as much information to the application as possible so it can make better decisions.  Akaros exposes cores instead of threads, so applications can manage the core it is allocated.  Akaros can expose different cores with different time quantas for different latency needs.  Akaros also uses a block abstraction for moving data for more efficient transfers than arbitrary memory addresses.  Akaros also separates the idea of provisioning resources and allocating resources.  Abstractions can be very useful, but they can reduce performance, so Akaros tries to be as transparent as possible for applications, and also provides useful defaults for simplicity.  Akaros uses the many core process abstraction so that the processes can perform their own scheduling on the cores.  The cores are gang scheduled and the kernel does not interrupt them.  All sys calls called with an asynchronous api, to decouple IO from parallelism.  Applications can pin pages in memory and Akaros also allows for application level memory management for low latency.  In addition to block level transfers, disk-to-network, network-to-network, and network-to-disk transferred are optimized.  By separating resource provisioning and allocation, the hardware can be better utilized.

Both projects focus on improving performance and utilization for a cluster, but creating a new os.  fos tries to create a global os, where Akaros focuses on the single node os.  I think the future of clusters will need efforts in both areas.  Having a single os abstraction is nice, but I think fos goes too far in the abstractions. I think developers are just fine with explicitly writing rpcs, or explicitly sharing memory for local threads. Having a VM for each thread is overkill, and I think the future global os will be more lightweight and provide better global resource scheduling for sharing the cluster with many applications.  The future os for the single node should be different than todays stock os, since the stock os was built for different purposes.  However, I don't know if rebuilding the entire os is the way to go.  Mostly, starting from a linux base will be sufficient, and maybe all the features of Akaros will not be necessary.  I think one of the most important features will be the different types of block transfers of data.  Since data sizes are growing and more analysis is being done, more data is being transfered, so efficient large data transfers of Akaros should be implemented in linux systems.

Sunday, November 6, 2011

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center and Hadoop NextGen

Mesos is a platform for resource sharing in a cluster of commodity servers.  Sharing between different frameworks like Hadoop or MPI is beneficial because it can increase utilization and data locality.  This will reduce costs and improve performance.  Mesos uses distributed two level scheduling where mesos offers resources to the frameworks, and frameworks chooses which ones to accept.  The common ways to share a cluster now is to statically partition the cluster, or to use VMs, but both methods cannot achieve high utilization nor efficient data sharing.  Scheduling and resource sharing can be done with a centralized global scheduler but this can get very complicated.  Mesos uses a distributed model with resource offers and the frameworks accepting what they need, and use the resources how they wish.  This allows the frameworks to implement their own scheduling, which many of them already do.  Pushing more control to the frameworks is beneficial because frameworks can implement more complicated solutions, and allows mesos to be simpler and more stable.  Mesos does not require frameworks to specify their requirements, but allow them to reject offers.  This allows Mesos to remain simple, and allow frameworks to implement complicated resource constraints.  Isolation between frameworks is achieved with pluggable isolation modules, which include linux containers.  To scale resource offers, frameworks can provide filters for resources, Mesos counts outstanding offers as part of the allocation, and rescinds long outstanding offers.  Mesos master maintains only soft state and can be reconstructed by just slaves and framework schedulers. The decentralized scheduling model incentivizes the frameworks to use short, elastic tasks, and to only accept resources they will actually use.  Experiments with 4 different frameworks showed that Mesos scheduling had better utilization and improved task completion times over static allocations.  The overhead of Mesos is about 4% and could scale up to 50000 nodes with sub second task launching overhead.


The current hadoop framework can support up to about 4000 nodes and that limits the size of the clusters that can be used.  The current hadoop architecture hits scalability limits, and also, it forces a system wide upgrade for any bug fixes or features.  This make it very difficult to maintain a large cluster, while still providing good uptimes.  In the nextgen hadoop, the two JobTracker's functionality are split into two separate components, the ResourceManager and the ApplicationMaster.  The ResourceManager only manages the clusters resources, and offers fine-grained resources, not slots like the current system.  This allows for better utilization.  The ApplicationMaster is manages the scheduling and monitoring of the tasks.  Offloading the management of the lifecycle of tasks to the applications themselves allows the ResourceManager to be much more scalable.  Also, this allows for the actual map reduce to be essentially part of the user library code, which makes easy upgrades.  The entire system does not have to be upgraded together.  The fine-grained resources allows for fungible resources so the utilization is improved.  The resource management is similar to Mesos, in that it has a two level scheduling component.  Application frameworks can accept and reject resources given by the ResourceManager.

The two level scheduling of both Mesos and nextgen hadoop is an efficient, and scalable way to manage resources in a large cluster, so will probably be more popular in the future.  Having a soft state resource manager will be important for scalability and performance because allocations can operate in memory and failures are easy to recover from.  However, I think there will have to be some sort of feedback mechanism on what individual frameworks require to run.  Also, some applications will need to define certain types of restrictions, like this server needs to run on the same rack as the database shard 4.  These types of restrictions are important for both performance and reliability.  Also, there will also have to be a way to isolate disk IO and throughput.  Being able to share and isolate disk IO could be useful for data intensive applications and adds complexity to cpu and ram resources.