Monday, October 31, 2011

Relational Cloud: A Database-as-a-Service for the Cloud and Database Scalability, Elasticity, and Autonomy in the Cloud

Relational cloud is an effort to develop a database as a service in the cloud, in order to provide more operational tasks resulting in lower costs for the users.  There are several db as a service products already, such as Amazon RDS or Microsoft Azure, but Relational Cloud tries to handle efficient multi-tenancy, elastic scalability, and database privacy.  Efficiently running many database instances is advantageous because more databases can be consolidated into fewer machines, thus reducing costs.  The basic design uses existing unmodified RDBMSes as backend storage servers, and clients coordinate queries and communications among them.  Each tenant uses separate databases and tables, and they are partitioned and partitions can be migrated between servers.  CryptDB enabled client libraries allow for server-side cryptography for privacy.  The partitioning strategy is workload aware, because periodically, workload traces are analyzed to identify sets of tuples which are accessed together.  The Kairos component gathers performance statistics, such has CPU usage and RAM.  Workload working set is determined by slowly growing a probe table until the IO starts increasing.  A special model has been developed to determine how multiple databases and workloads may combine on a single server.  Also, another placement model has been developed to place partitions on servers in order to minimize the number of servers and balance load.  Adjustable security is achieved by identifying several layers of levels of encryption techniques, of decreasing levels of security.  More queries can be computed at the lower levels of encryption, and the system only decrypts the data minimally in order to execute the query.  With experiments of several tenants, Relational Cloud could achieve consolidation ratios of 6:1 to 17:1, and performed much better than database in VMs, because single servers could allocate resources more effectively, and could share the same log and buffer pool.

When running databases in the cloud, scalability, elasticity, and autonomy are important features.  RDBMSes are traditionally known to not have these features and be able to scale to the cloud.  However, the emergence of key-value stores have been popular in the cloud architecture, because if the highly independent access.  However, sometimes applications need more transactions and consistency and so several techniques can be used to achieve them.  You can start with key-value stores and add more transactional features or you can start from RDBMSes and add more key-value store features and scalability.  Data fusion is a technique for key-value stores which groups keys into entities, in order to provide transactions among them.  G-Store employs this technique, along with dynamic partitions of keys.  Data fission is a technique for DBMSes to partition databases into shards and provide full database semantics for each shard independently.  This allows for scalability, similar to key-value stores.  Elasticity is important to start up or power down servers according to utilization.  Live migration is an important technique to be able to achieve elasticity, and is important for shared-disk and shared-nothing systems.  Iterative copy is developed for shared-disk, and Zephyr is developed for shared-nothing systems.  For autonomy, machine learning algorithms are used to develop models of tenant behavior and placement, and to determine when to migrate, what to migrate and to where to migrate.

Database systems in the cloud will be very important since the cloud model is becoming very popular.  There are several public clouds which can support multi-tenancy and other cloud features.  However, I don't think this model will be very useful for large companies or data.  The multi-tenant databases on public clouds will be useful for small to medium databases, with the assumption that the workload or size will not change much.  If the data grows too large and the workload becomes too demanding, a dedicated system will be able to perform much better, and with better isolation, and multi-tenancy will not help.  In addition, companies may want to keep their own data, instead of in the public cloud, even if there is privacy and security built in.  Many of the multi-tenancy concepts should be useful in an internal setting, in order for companies to be able to keep their data, while allowing tenants to not worry about the details.  Maybe there will be products or open source solutions for reliable multi-tenancy on local clusters.

No comments:

Post a Comment