Monday, October 3, 2011

HIVE: Data Warehousing & Analytics on Hadoop -and- Pig Latin: A Not-So-Foreign Language for Data Processing

Lots of data is being generated and stored by companies today, and there is great interest in gaining insight from analyzing all that data.  Facebook generates over 2+ TB day, and storage and analysis is too difficult for traditional DBMSes.  The map reduce framework has better scalability and availability than DBMSes and these properties are usually more important than the ACID properties of DBMSes.  However, writing map reduce jobs is difficult.  Hive is a system built by Facebook which makes querying and managing structured data more easy using hadoop.  It supports many different data formats, structured data, common SQL features, and metadata management.  Many different application types make use of Hive, such as summarization, ad hoc analysis, data mining, spam detection and more.  The data is logically partitioned, then hash partitioned to provide high scalability.  All the metadata is stored in a SQL backend which holds the schemas, the serialization and deserialization libraries, locations, partition info, and more.  The SQL-like interface also allows for custom map and reduce scripts to be able to run map reduce style jobs.

Pig Latin is similar to Hive in the sense that it also tries to solve the difficulty of writing map reduce jobs.  Map reduce is considered too low level, and restricted.  Pig Latin tries to find the sweet spot between declarative and procedural querying of data.  Pig Latin provides ad hoc read-only querying for large scale data.  Parallel databases are usually too expensive and SQL is sometimes unnatural.  Map reduce data flow is sometimes too restrictive, and the opaque nature of the map and reduce functions limit optimizations.  Pig Latin works by specifying transformations on the data.  The programmer defines the sequence of transformations procedurally, but the operators are high level concepts like filtering, grouping, and aggregation.  There is more opportunity for optimizations than map reduce.  Pig Latin uses a nested data model which is closer to what programmers think in, and lots of use cases already store data in nested way.  Java UDFs are treated as first class citizens in the system, and only primitives which can be parallelized are allowed.  This allows for good performance over large amounts of data.  Some primitives are filter, cogroup, join, foreach, flatten, and store.  The system uses lazy evaluation so execution is only done when it is needed to view or store the data.  Optimizations can be performed after the logical plan is constructed at execution time.  The execution compiles down to hadoop map reduce jobs, where the cogroups becoming the boundary of the map reduces.  Pig pen is a debugging framework which uses a generated sandbox data set to allow the developer to see example transformations to debug the programs.

Hive and Pig Latin are only two of the programming frameworks for large scale distributed programs.  The common theme is that map reduce is too low level, so higher level languages need to be developed.  Unlike how SQL became a standard for relational data querying, I don't think there will be a standard for large scale data processing.  There are more and more people dealing with the data and they have varying levels of technical skills and abilities.  Some people are comfortable at very high level languages like SQL, which is sometimes limited to a certain type of data set and analysis.  Others may prefer lower levels such as map reduce.  Hive and Pig Latin borrow from declarative query languages to allow a higher level for map reduce.  They also provide a way to define multiple map reduce phase for a particular task.  Higher level languages are definitely necessary, but right now it is not clear which model is "better".  Also, they both compile down to map reduce jobs, but that isn't necessarily the answer.  Maybe if the higher level languages used a more general underlying framework, they could provide more optimizations, since there are more lower level primitives.  Forcing tasks to follow the map reduce model may limit performance for some use cases.

1 comment:

  1. The first several months of my site there were no comments; just give it time; now they come in like crazy every day! Thanks. Jay Catlin

    ReplyDelete