Saturday, October 22, 2011

BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud and Erlang - A survey of the language and its industrial applications

BOOM analytics is an effort to build large distributed systems in a data-centric, declarative way.  The higher level of abstraction may improve the code simplicity and speed up development, and debugging.  Building large distributed systems for the cloud can be very difficult, because it requires the developer to be careful about concurrent programming and fault tolerant communication.  BOOM attempts to make it easier with data-centric design and declarative programming.  The Overlog language is used to for the declarative language for cloud systems.  Overlog is based on datalog which is a deductive query language, where logical rules describe which tuples belong to which relations.  Overlog extends datalog with location of data notations, SQL features, and definition of of a model of generating changes to tables.  Each datalog timestep has three phases: inbound events are converted to tuple insertions and deletions, local rules are run until a fixed point is reached, then outbound events are emitted.  HDFS and hadoop scheduling were ported using BOOM, along with additional availability and scalability features, with far fewer lines of code, and better assurance in correctness.  With BOOM it was easy to implement first-come-first-serve and LATE scheduling policies for hadoop, and experimental results mirror the original LATE results.  Performance experiments show that the BOOM implementations were slightly lower than the original implementations.

Erlang is a functional programming language for soft real time systems.  Erlang also have requirements of large programs, control systems which cannot be stopped, portability, high levels of concurrency, IPCs, distributed systems, efficient garbage collection, incremental code loading, timeouts, and external interfaces.  Sequential programs are sets of recursive equations, and functions are primitive data types.  The spawn() primitive creates a new process for concurrent programming, and send() and receive() is used for message passing between processes.  Message sending is asynchronous, and the messages are delivered in order.  spawn() can be used to create a process on a different node as well.  Several commercial products use Erlang and some conclusions are that development time can be shorter, the performance is very good for real time systems, and the real-time garbage collection works well.

Both Erlang and Overlog provide declarative, functional programming models for distributed cloud systems.  Both papers conclude that it takes less time and less code to implement a system, and also inspires higher confidence of correctness.  However, I think most people do not think or program in the functional, logic rule based method.  Certain class of problems are useful to implement in a declarative way, such as well-defined state machines such as paxos.  In the future, there will be new languages or better language interoperability in order to insert segments of declarative, functional code, for certain tasks such as communication logic or other protocols.

No comments:

Post a Comment