Dapper Dataflow Engine description
The Distributed and Parallel Program Execution Runtime
Dapper (Distributed and Parallel Program Execution Runtime) is a tool for taming the complexities of developing for large-scale cloud and grid computing, enabling the user to create distributed computations from the essentials -- the code that will execut
We live in interesting times, where breakthroughs in the sciences increasingly depend on the growing availability and abundance of commoditized, networked computational resources. With the help of the cloud or grid, computations that would otherwise run for days on a single desktop machine now have distributed and/or parallel formulations that can churn through, in a matter of hours, input sets ten times as large on a hundred machines. As alluring as the idea of strength in numbers may be, having just physical hardware is not enough -- a programmer has to craft the actual computation that will run on it. Consequently, the high value placed on human effort and creativity necessitates a programming environment that enables, and even encourages, succinct expression of distributed computations, and yet at the same time does not sacrifice generality.
Dapper, standing for Distributed and Parallel Program Execution Runtime, is one such tool for bridging the scientist/programmer's high level specifications that capture the essence of a program, with the low level mechanisms that reflect the unsavory realities of distributed and parallel computing. Under its dataflow-oriented approach, Dapper enables users to code locally in Java and execute globally on the cloud or grid. The user first writes codelets, or small snippets of code that perform simple tasks and do not, in themselves, constitute a complete program. Afterwards, he or she specifies how those codelets, seen as vertices in the dataflow, transmit data to each other via edge relations. The resulting directed acyclic dataflow graph is a complete program interpretable by the Dapper server, which, upon being contacted by long-lived worker clients, can coordinate a distributed execution.
Under the Dapper model, the user no longer needs to worry about traditionally ad-hoc aspects of managing the cloud or grid, which include handling data interconnects and dependencies, recovering from errors, distributing code, and starting jobs. Perhaps more importantly, it provides an entire Java-based toolchain and runtime for framing nearly all coarse-grained distributed computations in a consistent format that allows for rapid deployment and easy conveyance to other researchers.
Here are some key features of "Dapper Dataflow Engine":
- A code distribution system that allows the Dapper server to transmit requisite program code over the network and have clients dynamically load it. A consequence of this is that, barring external executables, updates to Dapper programs need only happen on the server-side.
- A powerful subflow embedding method for dynamically modifying the dataflow graph at runtime.
- A runtime in vanilla Java, a language that many are no doubt familiar with. Aside from the requirement of a recent JVM and optionally Graphviz Dot, Dapper is self-contained.
- A robust control protocol. The Dapper server expects any number of clients to fail, at any time, and has customizable re-execution and timeout policies to cope. Consequently, one can start and stop (long-lived) clients without fear of putting the entire system into an inconsistent state.
- Flexible semantics that allow data transfers via files or TCP streams.
- Interoperability with firewalls. Since your local cloud or grid probably sits behind a firewall, we have devised special semantics for streaming data transfers.
- Liberal licensing terms. Dapper is released under the LGPL to prevent contamination of your codebase.
- Operation as an embedded application. A user manual describes the programming API that users can follow to run the Dapper server inside an application like Apache Tomcat.
- Operation as a standalone user interface. With it, one can run off-the-shelf demos and learn core concepts from visual examples. By following a minimal set of conventions, one can then bundle one's own Dapper programs as execution archives, and then get realtime dataflow status and debugging feedback.
- The ServerLogic#closeIdleClients method has been changed to better match the user's intuitive notion of idleness.
- A user option for specifying the server's hostname has been added.
- Networking internals have been reworked to use new APIs.
- The build process has been updated to support both 32- and 64-bit Windows cross-compilation.
- The dapper.* hierarchy has been renamed to org.dapper.*.