| Sample Applications for Twister | |
|
|
Kmeans Clustering |
|
|
Matrix Multiplication |
|
|
Pagerank |
|
|
Graph Search |
| Word Count | |
| High Energy Physics | |
| Kmeans Clustering | |
|
Kmeans clustering is a well-known clustering algorithm aiming to
cluster a set of data points to a predefined number of clusters. A
MapReduce style parallel algorithm for Kmeans Clustering is shown in
the first figure. In that each map function gets a portion of the
data, and it needs to access this data split in each iteration.
These data items do not change over the iterations, and it is loaded
once for the entire set of iterations. The variable data is the
current cluster centers calculated during the previous iteration and
hence used as the input value for the map function.
All the map functions get this same input data (current cluster centers) at each iteration and computes a partial cluster centers by going through its data set. A reduce function computes the average of the partial cluster centers and produce the cluster centers for the next step. Main program, once it gets these new cluster centers, calculates the difference between the new cluster centers and the previous cluster centers and determine if it needs to execute another cycle of MapReduce computation. |
|
|
MapReduce for Kmeans Clustering Algorithm |
|
| In the following figure, we compare the performance of different parallel programming runtimes for this iterative computation. | |
|
Performance of different Parallel Programming Runtimes for Kmeans Clustering Computation |
|
| More details of this analysis can be found in this paper. [You can also download the MPI version of the Kmeans clustering program from this link]. | |
| To run word count sample, please
refer to the readme file at the following location. $TWISTER_HOME/samples/kmeans/README.txt |
|
| Matrix Multiplication | |
| Matrix multiplication is another
well-studied area of research. Among the many parallel
algorithms for matrix multiplication,
Fox's Algorithm and
Cannon's
Algorithm perform the best. However applying these algorithms in
MapReduce style is difficult due to the restricted communication
patterns available in MapReduce domain. Hence I derived a matrix
multiplication algorithm for MapReduce based on the column/row
decomposition strategy.
To clarify our algorithm, let's consider an example where two input matrices, A and B, produce matrix C, as the result of the multiplication process. We split the matrix B into a set of column blocks and the matrix A into a set of row blocks. In each iteration, all the map tasks process two inputs: (i) a column block of matrix B, and (ii) a row block of matrix A; collectively, they produce a row block of the resultant matrix C. The column block associated with a particular map task is fixed throughout the computation, while the row blocks are changed in each iteration. However, in Hadoop's programming model (a typical MapReduce model), there is no way to specify this behavior. Hence, it loads both the column block and the row block in each iteration of the computation. Twister supports the notion of long running map/reduce tasks where these tasks are allowed to retain static data in the memory across invocations, yielding better performance for "Iterative MapReduce" computations. The communication pattern of this application is shown in the following figure. |
|
|
MapReduce Algorithm for Full Matrix Multiplication |
|
| In the following figure we compare the parallel overhead of different parallel runtimes for matrix multiplication application. | |
|
Parallel overhead of Twister, Hadoop, and MPI in Matrix Multiplication |
|
| To run word count sample, please
refer to the readme file at the following location. $TWISTER_HOME/samples/matrix/README.txt |
|
| Pagerank | |
|
The search engine such as Google,Yahoo need to rank the query results from the Internet. The ranking factors of the web pages can be contend-based or link-based. PageRank is the well know link analysis algorithm. It calculates numerical value to each element of a hyperlinked set of web pages, which reflects the probability that the random surfer will access that page. The process of PageRank can be understood as a Markov chain which needs recursively calculation to converge. So it is within the class of applications where multiple iterations are necessary for the overall computation. In PageRank, the goal is to calculate the access probability for each web page. An iteration of the algorithm calculates the new access probability for each web page based on values calculated in the previous computation. The iteration will not stop until the difference value is less than a predefined threshold, where the difference value is the different between the access probabilities of web pages in Nth iteration and the those in (N+1)th iteration.
MapReduce Implementation of Pagerank Algorithm In Twister implementation of PageRank, we constructed web graphs with vertices where in-link degree of all pages comply with the power law distribution. These input data are partitioned into few parts and stored in the format of adjacency list. Each map function runs on one of the partitioned data, which are constant over the iterations. The variable data are the PageRank values calculated during previous iteration which in turns used as the input value for the next iteration. In each iteration, the MAP task updates the old PageRank values to new one by analyzing the local partial adjacency list file. The output of MAP task is partial of PageRank values. The reduce task receives all the partial output and produces the new PageRank values. |
|
| To run pagerank sample, please refer
to the readme file at the following location. $TWISTER_HOME/samples/pagerank/README.txt |
|
| Graph Search | |
This algorithm tries to use
Twister Framework to process breadth-first graph search problem
parallel. This algorithm is based on Cailin's Hadoop version of
breadth-first graph search[1]. The basic idea of this algorithm is
to exploring the nodes of the same level parallel, and then to go to
the nodes in next levels iteratively. Every step will be
sequentially described in the following chapters. Graph Data
Preparation [1]Cailin. 6 18, 2009. http://www.johnandcailin.com/blog/cailin/breadth-first-graph-search-using-iterative-map-reduce-algorithm. |
|
| To graph search sample, please refer
to the readme file at the following location. $TWISTER_HOME/samples/bfs/README.txt |
|
| Word Count | |
|
This is a simple MapReduce computation that is often use to
explain the MapReduce programming model. This application does not utilize
the iterative MapReduce capabilities of Twister. It is mainly used here as a
starting material. The goal of the word count application is to count the number of occurrences of the words in a collection of documents. Unlike typical MapReduce runtimes, Twister allows users to handle files at Map tasks and hence the word count map task can read a data partition (typically a file) and produce the count of words that it identifies. At the reduce tasks, these counts are summarized to produce a total word counts. In typical word count applications implemented using other MapReduce runtimes, the map task outputs (word,1) pairs for all the words it encounter. This approach is not optimized for performance rather simple to program. With small amount of more complexity we can simply convert this map task to produce a list of (word,count) pairs corresponding to the local data partion. This is the approach used in Twister word count application. |
|
|
MapReduce for Word Count |
|
| To run word count sample, please
refer to the readme file at the following location. $TWISTER_HOME/samples/wordcount/README.txt |
|
|
|
| High Energy Physics(HEP) Data Analysis | |
|
MapReduce for HEP Data Analysis
Performance of Twister (CGL-MapReduce) and Hadoop in HEP Data Analysis As part of an effort funded by the DoE we
are working with the High
Energy Physics group at Caltech with their particle physics data
analysis tools. The data analysis framework used by these tools is
ROOT, and the analysis functions are
written using an interpreted language of ROOT named
CINT. The goal of the analysis is to execute a set of analysis functions on a collection of data files produced by high-energy physics experiments. After processing each data file, the analysis produces a histogram of identified features. These histograms are then combined to produce the final result of the overall analysis. This data analysis task is both data and compute intensive and fits very well for MapReduce computation model. First figure shows the program flow of this analysis once it is converted to a MapReduce implementation and the second figure compares the performance of Twister and Hadoop for this data analysis. More details of this analysis can be found in this paper. |
|
|
|
|







