Twister version 0.9 is now available.

MapReduce programming model has simplified the implementations of many data parallel applications. The simplicity of the programming model and the quality of services provided by many implementations of MapReduce attract a lot of enthusiasm among parallel computing communities. From the years of experience in applying MapReduce programming model to various scientific applications we identified a set of extensions to the programming model and improvements to its architecture that will expand the applicability of MapReduce to more classes of applications. Twister is a lightweight MapReduce runtime we have developed by incorporating these enhancements.

Twister provides the following features to support MapReduce computations. (Twister is developed as part of Jaliya Ekanayake's Ph.D. research and is supported by the S A L S A Team @ IU)

Distinction on static and variable data
Configurable long running (cacheable) map/reduce tasks
Pub/sub messaging based communication/data transfers
Efficient support for Iterative MapReduce computations (extremely faster than Hadoop or Dryad/DryadLINQ)
Combine phase to collect all reduce outputs
Data access via local disks
Lightweight (~5600 lines of Java code)
Support for typical MapReduce computations
Tools to manage data

New features in Twister v0.9:

Support new broker software ActiveMQ (see userguide)
Express Twister environment configuration (see userguide)
Automatically recover from faults when FaultTolerance is enabled (see userguide)
Partition File can be created inside the client code (see userguide)

Twister Programming Model

Iterative MapReduce programming model using Twister

 

Static vs. Dynamic Data

Many iterative applications we analyzed show a common characteristic of operating on two types of data products. Static data is used in each iteration and remain fixed throughout the computation whereas the variable data is the computed results in each iteration and typically consumed in the next iteration in many expectation maximization (EM) type algorithms.

Cacheable Mappers/Reducers

Although some of the typical MapReduce computations such as distributed sorting and information retrieval consume very large data sets, many iterative applications we encounter operate on moderately sized data sets which can fit into the distributed memory of the computation clusters. This observation led us to explore the idea of using long running map/reduce tasks similar to the long running parallel processes in many MPI applications which last throughout the life of the computation. The long running (cacheable) map/reduce tasks allow map/reduce tasks to be configured with static data and use them without loading again and again in each iteration. Current MapReduce implementations such as Hadoop and DryadLINQ do not support this behavior and hence they initiate new map/reduce tasks and load static data in each iteration incurring considerable performance overheads.

Supports "side-effect-free" Programming

By supporting long running map/reduce tasks Twister does not encourage users to store state information in the map/reduce tasks violating the "side-effect-free" nature of the map/reduce computations rather achieving considerable performance gains by caching the static data across map/reduce tasks. The framework does not guarantee the use of same set of map/reduce tasks (objects) throughout the life of the iterative computation.

Combine step as a Further Reduction

Twister also introduce an optional reduction phase named "combine", which is another reduction phase that can be used to combine the results of the reduce phase into a single value. The user program and the combine operation run on a single process space allowing its output directly accessible to the user program. This enables the user to check conditions based on the output of the MapReduce computations.

Uses Pub/Sub Messaging

Twister uses pub/sub messaging for all the communication/data transfer requirements which eliminates the overhead in transferring data via file systems as in Hadoop or DryadLINQ. The output <Key,Value> pairs produced during the map stage get transferred directly to the reduce stage and the output of the reduce stage get transferred directly to the combined stage via the pub-sub broker network. Currently Twister uses publish-subscribe messaging capabilities of NaradaBrokering messaging infrastructure, but the framework is extensible to support any other publish-subscribe messaging infrastructure such as Active MQ .

Data Access via Local Disks

We provide two mechanisms to access data in Twister; (i) from the local disk of the computer nodes, (ii) directly from the pub-sub infrastructure. For the simplicity of the implementation, we provide a file based data access mechanism for the map/reduce tasks. Unlike Hadoop, twister does not come with the built in file system. Instead it  provides a tool to manage the data across these distributed disks.  Apart from the above the use of streaming enables Twister to support features such as directly sending input <Key,Value> pairs for the map stage from the user program and configuring map/reduce stages using the data sent from the user program. The above figure shows the programming model of Twister and how iterative MapReduce computations are executed using it.

Fault Tolerance [Working progress]

Providing fault tolerance support for iterative computations with Twister is currently under development.


If you would like to read about the motivation behind our implementation or explore Twister architecture please follow the links below. You can also find the latest information of SALSA team's activities here.

Papers and Presentation

Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, Geoffrey Fox, Twister: A Runtime for Iterative MapReduce," The First International Workshop on MapReduce and its Applications (MAPREDUCE'10) - HPDC2010
Jaliya Ekanayake, (Advisor: Geoffrey Fox) Architecture and Performance of Runtime Environments for Data Intensive Scalable Computing, Doctoral Showcase, SuperComputing2009. (Presentation)
Jaliya Ekanayake, Atilla Soner Balkir, Thilina Gunarathne, Geoffrey Fox, Christophe Poulain, Nelson Araujo, Roger Barga, DryadLINQ for Scientific Analyses, Fifth IEEE International Conference on e-Science (eScience2009), Oxford, UK.
Jaliya Ekanayake, Geoffrey Fox, High Performance Parallel Computing with Clouds and Cloud Technologies, First International Conference on Cloud Computing (CloudComp09) Munich, Germany, 2009.
Geoffrey Fox, Seung-Hee Bae, Jaliya Ekanayake, Xiaohong Qiu, and Huapeng Yuan, Parallel Data Mining from Multicore to Cloudy Grids, High Performance Computing and Grids workshop, 2008.
Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox MapReduce for Data Intensive Scientific Analysis, Fourth IEEE International Conference on eScience, 2008, pp.277-284.

Twister Name and the Logo
We selected the name Twister because of the following characteristics of real twisters: (i) they are fast, (ii)  rotating (resembling iterations) and (iii) funnel shape representing the reductions. Two colors in the logo represent the map/reduce phases of MapReduce computations.

Contacts
Jaliya Ekanayake, Judy Qiu, Hui Li, Bingjing Zhang, Geoffrey Fox

Leave Feedback
Name:
Email:
Your message:


Checkout Userguide and Samples