Note that non-local file systems require a schema prefix, such as hdfs://.

Word Count

WordCount is the “Hello World” of Big Data processing systems. It computes the frequency of words in a text collection. The algorithm works in two steps: First, the texts are splits the text to individual words. Second, the words are grouped and counted.

ExecutionEnvironmentenv=ExecutionEnvironment.getExecutionEnvironment();DataSet<String>text=env.readTextFile("/path/to/file");DataSet<Tuple2<String,Integer>>counts=// split up the lines in pairs (2-tuples) containing: (word,1)text.flatMap(newTokenizer())// group by the tuple field "0" and sum up tuple field "1".groupBy(0).sum(1);counts.writeAsCsv(outputPath,"\n"," ");// User-defined functionspublicstaticclassTokenizerimplementsFlatMapFunction<String,Tuple2<String,Integer>>{@OverridepublicvoidflatMap(Stringvalue,Collector<Tuple2<String,Integer>>out){// normalize and split the lineString[]tokens=value.toLowerCase().split("\\W+");// emit the pairsfor(Stringtoken:tokens){if(token.length()>0){out.collect(newTuple2<String,Integer>(token,1));}}}}

The WordCount example implements the above described algorithm with input parameters: --input <path> --output <path>. As test data, any text file will do.

The WordCount example implements the above described algorithm with input parameters: --input <path> --output <path>. As test data, any text file will do.

Page Rank

The PageRank algorithm computes the “importance” of pages in a graph defined by links, which point from one pages to another page. It is an iterative graph algorithm, which means that it repeatedly applies the same computation. In each iteration, each page distributes its current rank over all its neighbors, and compute its new rank as a taxed sum of the ranks it received from its neighbors. The PageRank algorithm was popularized by the Google search engine which uses the importance of webpages to rank the results of search queries.

In this simple example, PageRank is implemented with a bulk iteration and a fixed number of iterations.

For this simple implementation it is required that each page has at least one incoming and one outgoing link (a page can point to itself).

Connected Components

The Connected Components algorithm identifies parts of a larger graph which are connected by assigning all vertices in the same connected part the same component ID. Similar to PageRank, Connected Components is an iterative algorithm. In each step, each vertex propagates its current component ID to all its neighbors. A vertex accepts the component ID from a neighbor, if it is smaller than its own component ID.

This implementation uses a delta iteration: Vertices that have not changed their component ID do not participate in the next step. This yields much better performance, because the later iterations typically deal only with a few outlier vertices.

// read vertex and edge dataDataSet<Long>vertices=getVertexDataSet(env);DataSet<Tuple2<Long,Long>>edges=getEdgeDataSet(env).flatMap(newUndirectEdge());// assign the initial component IDs (equal to the vertex ID)DataSet<Tuple2<Long,Long>>verticesWithInitialId=vertices.map(newDuplicateValue<Long>());// open a delta iterationDeltaIteration<Tuple2<Long,Long>,Tuple2<Long,Long>>iteration=verticesWithInitialId.iterateDelta(verticesWithInitialId,maxIterations,0);// apply the step logic:DataSet<Tuple2<Long,Long>>changes=iteration.getWorkset()// join with the edges.join(edges).where(0).equalTo(0).with(newNeighborWithComponentIDJoin())// select the minimum neighbor component ID.groupBy(0).aggregate(Aggregations.MIN,1)// update if the component ID of the candidate is smaller.join(iteration.getSolutionSet()).where(0).equalTo(0).flatMap(newComponentIdFilter());// close the delta iteration (delta and new workset are identical)DataSet<Tuple2<Long,Long>>result=iteration.closeWith(changes,changes);// emit resultresult.writeAsCsv(outputPath,"\n"," ");// User-defined functionspublicstaticfinalclassDuplicateValue<T>implementsMapFunction<T,Tuple2<T,T>>{@OverridepublicTuple2<T,T>map(Tvertex){returnnewTuple2<T,T>(vertex,vertex);}}publicstaticfinalclassUndirectEdgeimplementsFlatMapFunction<Tuple2<Long,Long>,Tuple2<Long,Long>>{Tuple2<Long,Long>invertedEdge=newTuple2<Long,Long>();@OverridepublicvoidflatMap(Tuple2<Long,Long>edge,Collector<Tuple2<Long,Long>>out){invertedEdge.f0=edge.f1;invertedEdge.f1=edge.f0;out.collect(edge);out.collect(invertedEdge);}}publicstaticfinalclassNeighborWithComponentIDJoinimplementsJoinFunction<Tuple2<Long,Long>,Tuple2<Long,Long>,Tuple2<Long,Long>>{@OverridepublicTuple2<Long,Long>join(Tuple2<Long,Long>vertexWithComponent,Tuple2<Long,Long>edge){returnnewTuple2<Long,Long>(edge.f1,vertexWithComponent.f1);}}publicstaticfinalclassComponentIdFilterimplementsFlatMapFunction<Tuple2<Tuple2<Long,Long>,Tuple2<Long,Long>>,Tuple2<Long,Long>>{@OverridepublicvoidflatMap(Tuple2<Tuple2<Long,Long>,Tuple2<Long,Long>>value,Collector<Tuple2<Long,Long>>out){if(value.f0.f1<value.f1.f1){out.collect(value.f0);}}}

The ConnectedComponents program implements the above example. It requires the following parameters to run: --vertices <path> --edges <path> --output <path> --iterations <n>.

// set up execution environment
valenv=ExecutionEnvironment.getExecutionEnvironment// read vertex and edge data
// assign the initial components (equal to the vertex id)
valvertices=getVerticesDataSet(env).map{id=>(id,id)}// undirected edges by emitting for each input edge the input edges itself and an inverted
// version
valedges=getEdgesDataSet(env).flatMap{edge=>Seq(edge,(edge._2,edge._1))}// open a delta iteration
valverticesWithComponents=vertices.iterateDelta(vertices,maxIterations,Array(0)){(s,ws)=>// apply the step logic: join with the edges
valallNeighbors=ws.join(edges).where(0).equalTo(0){(vertex,edge)=>(edge._2,vertex._2)}// select the minimum neighbor
valminNeighbors=allNeighbors.groupBy(0).min(1)// update if the component of the candidate is smaller
valupdatedComponents=minNeighbors.join(s).where(0).equalTo(0){(newVertex,oldVertex,out:Collector[(Long, Long)])=>if(newVertex._2<oldVertex._2)out.collect(newVertex)}// delta and new workset are identical
(updatedComponents,updatedComponents)}verticesWithComponents.writeAsCsv(outputPath,"\n"," ")

The ConnectedComponents program implements the above example. It requires the following parameters to run: --vertices <path> --edges <path> --output <path> --iterations <n>.

Input files are plain text files and must be formatted as follows:

Vertices represented as IDs and separated by new-line characters.

For example "1\n2\n12\n42\n63\n" gives five vertices with (1), (2), (12), (42), and (63).

Edges are represented as pairs for vertex IDs which are separated by space characters. Edges are separated by new-line characters: