Functional Relational Programming with Cascalog
(Updated Feb. 6 with corrections, below.)
In 2006, Ben Mosely and Peter Marks published a paper, Out of the Tar Pit, in which they coined the term Functional Relational Programming. "Out of the Tar Pit" was influential on Clojure's design, particularly its emphasis on immutability and the separation of state from behavior. Mosely and Marks went further, however, in recommending that data be manipulated as relations. Relations are the abstract concept behind tables in a relational database or "facts" in some logic programming systems. Clojure does not enforce a relational model, but Clojure can be used for relational programming. For example, the clojure.set namespace defines relational algebra operations such as
In the early aughts, Jeffrey Dean and Sanjay Ghemawat developed the MapReduce programming model at Google to optimize the process of ranking web pages. MapReduce works well for I/O-bound problems where the computation on each record is small but the number of records is large. It specifically addresses the performance characteristics of modern commodity hardware, especially "disk is the new tape."
Hadoop began life at Yahoo! and is now an Apache open-source project. It consists of two parts: a distributed filesystem (HDFS) and a MapReduce implementation. Together, they form a powerful framework for running computations on large data sets (terabytes to petabytes) across hundreds or thousands of machines.
Hadoop is written in Java, and writing MapReduce jobs in Clojure is possible. (I released a helper library a couple of years ago.) But writing "raw" MapReduce in any language is difficult: the conceptual model is primitive, and complex computations often require several MapReduce jobs chained together. This is where data-processing tools built on top of Hadoop come in handy. Cascading is one such tool, others include Hive and Pig.
Cascading provides a dataflow-style API on top of Hadoop using a "pipe" metaphor, but computation steps within the pipes are still written in Java. Nathan Marz at BackType (now part of Twitter) added another layer by implementing Datalog on top of Cascading. He called the result Cascalog.
Datalog is a logic programming language that dates back to the 1970s and has occasionally been used as a database query language. Strictly speaking, Datalog is a subset of Prolog, but it has interesting properties that differentiate it:
- All Datalog programs are guaranteed to terminate
- Order of statements in Datalog doesn't matter
As a result, Datalog programs are often easier to write than programs in Prolog-style logic systems, such as clojure.core.logic. Clojure-contrib once had an in-memory Datalog implementation written by by Jeffrey Straszheim, but it is not actively maintained.
Cascalog compiles a Datalog-like language into Cascading workflows that can be run on Hadoop MapReduce. For example, the following query (from the Cascalog documentation) joins two data sets named "age" and "gender" to find men over twenty-five:
(<- [?name ?age] (age ?name ?age) (gender ?name "male") (> ?age 25))
This program can be tested locally, on a single machine, then run on a Hadoop cluster of hundreds or thousands of machines. Joins across data sets are implicit and the order of expressions doesn't matter. Cascalog and Cascading accomplish the conversion into MapReduce programs.
I think Cascalog is a valid implementation of Functional Relational Programming as laid out by Mosely and Marks. It's not for everyone: MapReduce is only applicable to certain problem domains. But it achieves the goals of FRP: state is separate from behavior and performance concerns are isolated. Naturally, there's a lot of complexity underneath it all, but Backtype has demonstrated that Cascalog works on Twitter-size data sets.
CORRECTIONS February 6, 2012:
Hadoop began as part of the Nutch open-source search engine, with support from Yahoo!.
Cascalog is only a partial realization of Functional Relational Programming as described in "Out of the Tar Pit." In particular, it does not provide means for inserting data into the system (what Mosely/Marks called feeders) or reacting to that data (observers).