Milinda PathirageWrite an awesome description for your new site here. You can edit this line in _config.yml. It will appear in your document head meta (for Google search results) and in your feed.xml site description.
http://milinda.pathirage.org/
Tue, 21 Mar 2017 18:33:11 +0000Tue, 21 Mar 2017 18:33:11 +0000Jekyll v3.4.2Bulk Loading Data to HBase<p>We are evaluating storage product for storing and serving HathiTrust digital library’s corpus these days and one of the candidates is HBase. We are considering HBase mainly because it comes with built-in Hadoop MapReduce and Spark support. HathiTrust corpus consists of digitized (OCRed) books, journals and various other historical documents from libraries all over the world which we call volumes. The full corpus contains about 14 million volumes, and we keep them in a shared Lustre cluster. Our plan is to move the corpus to our cluster to support large-scale analysis and direct downloads.</p>
<p>To understand the performance behaviors of HBase we loaded about 40,000 volumes from our corpus into HBase and performed some random read tests. We used a three node cluster and <a href="http://hortonworks.com/products/data-center/hdp/">HDP</a> 2.5. Each node has a single Core-i7 processor, 8GB of RAM and a 500GB HDD. Cluster like this is not suitable for a production HBase deployment and we didn’t have any other options because we didn’t receive our hardware yet. Since we had to pack an HDFS cluster, a Zookeeper cluster, and an HBase cluster into three nodes, HBase performance was poor. We used HBase’s bulk load feature, and I am going to discuss the MapReduce-based bulk loading process in the rest of the document.</p>
<p>Apache <a href="https://hbase.apache.org">HBase</a> is a non-relational database modeled after Google’s <a href="http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf">BigTable</a> and uses HDFS for storage layer. One of the interesting properties of HBase is the ability to bulk load data. Since we already have our data and we will only see a small number of writes periodically, this is a handy feature for our use case. There are multiple ways to do this and HBase provide several CLI tools such as TSV bulk loader to facilitate this process. In addition to the built-in tools, you can use a MapReduce application to bulk load data as well. In this approach, MapReduce outputs <a href="http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/">HFile</a>s which is the internal storage format of HBase, and you can use <code class="highlighter-rouge">org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles</code> tools to load generated HFiles into an HBase table.</p>
<p>We start by creating a table in HBase with a single split. If you know your row key distribution, you can <a href="http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/">pre-split</a> your data. I did the splitting manually via the HBase UI. We didn’t load raw data to HDFS for bulk load. Mappers read data directly from the local filesystem. This was possible since our data is on a network filesystem. All we had to do is mount it to the nodes that YARN node managers run and make it accessible to Hadoop user.</p>
<p>HBase provides reducers to use with bulk load MapReduce application and calling <code class="highlighter-rouge">HFileOutputFormat2.configureIncrementalLoad(job, table, regionLocator)</code> in your MapReduce driver code will automatically configure the reducer for you. Your mapper can output two different types of values. You can output <code class="highlighter-rouge">org.apache.hadoop.hbase.KeyValue</code> objects or <code class="highlighter-rouge">org.apache.hadoop.hbase.client.Put</code> objects encapsulating a whole row. We use the latter. Mapper code for loading digitized volumes is shown below. Please note that digitized volumes were stored in the filesystem in special directory hierarchy called <a href="https://wiki.ucop.edu/display/Curation/PairTree">PairTree</a>.</p>
<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="kn">package</span> <span class="n">edu</span><span class="o">.</span><span class="na">indiana</span><span class="o">.</span><span class="na">d2i</span><span class="o">.</span><span class="na">htrc</span><span class="o">.</span><span class="na">corpusmgt</span><span class="o">.</span><span class="na">hbase</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.hadoop.conf.Configuration</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.hadoop.hbase.Cell</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.hadoop.hbase.KeyValue</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.hadoop.hbase.client.Put</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.hadoop.hbase.io.ImmutableBytesWritable</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.hadoop.io.LongWritable</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.hadoop.io.Text</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.hadoop.mapreduce.Mapper</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.io.ByteArrayOutputStream</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.io.File</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.io.IOException</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.io.InputStream</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.nio.file.Files</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.nio.file.Paths</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.util.Enumeration</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.util.Random</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.util.zip.ZipEntry</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.util.zip.ZipFile</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">BulkLoadMapper</span> <span class="kd">extends</span> <span class="n">Mapper</span><span class="o">&lt;</span><span class="n">LongWritable</span><span class="o">,</span> <span class="n">Text</span><span class="o">,</span> <span class="n">ImmutableBytesWritable</span><span class="o">,</span> <span class="n">Put</span><span class="o">&gt;</span> <span class="o">{</span>
<span class="kt">byte</span><span class="o">[]</span> <span class="n">buf</span> <span class="o">=</span> <span class="k">new</span> <span class="kt">byte</span><span class="o">[</span><span class="mi">1024</span><span class="o">];</span>
<span class="nd">@Override</span>
<span class="kd">protected</span> <span class="kt">void</span> <span class="nf">setup</span><span class="o">(</span><span class="n">Context</span> <span class="n">context</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">IOException</span><span class="o">,</span> <span class="n">InterruptedException</span> <span class="o">{</span>
<span class="n">Configuration</span> <span class="n">conf</span> <span class="o">=</span> <span class="n">context</span><span class="o">.</span><span class="na">getConfiguration</span><span class="o">();</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">protected</span> <span class="kt">void</span> <span class="nf">map</span><span class="o">(</span><span class="n">LongWritable</span> <span class="n">key</span><span class="o">,</span> <span class="n">Text</span> <span class="n">value</span><span class="o">,</span> <span class="n">Context</span> <span class="n">context</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">IOException</span><span class="o">,</span> <span class="n">InterruptedException</span> <span class="o">{</span>
<span class="n">String</span><span class="o">[]</span> <span class="n">volIdPathPair</span> <span class="o">=</span> <span class="n">value</span><span class="o">.</span><span class="na">toString</span><span class="o">().</span><span class="na">split</span><span class="o">(</span><span class="s">"\\s*,\\s*"</span><span class="o">);</span>
<span class="n">String</span> <span class="n">volId</span> <span class="o">=</span> <span class="n">volIdPathPair</span><span class="o">[</span><span class="mi">0</span><span class="o">];</span>
<span class="n">volId</span> <span class="o">=</span> <span class="n">volId</span><span class="o">.</span><span class="na">substring</span><span class="o">(</span><span class="mi">1</span><span class="o">,</span> <span class="n">volId</span><span class="o">.</span><span class="na">length</span><span class="o">()</span> <span class="o">-</span> <span class="mi">1</span><span class="o">);</span>
<span class="n">String</span> <span class="n">volPath</span> <span class="o">=</span> <span class="n">volIdPathPair</span><span class="o">[</span><span class="mi">1</span><span class="o">];</span>
<span class="n">volPath</span> <span class="o">=</span> <span class="n">volPath</span><span class="o">.</span><span class="na">substring</span><span class="o">(</span><span class="mi">1</span><span class="o">,</span> <span class="n">volPath</span><span class="o">.</span><span class="na">length</span><span class="o">()</span> <span class="o">-</span> <span class="mi">1</span><span class="o">);</span>
<span class="k">if</span><span class="o">(</span><span class="k">new</span> <span class="n">File</span><span class="o">(</span><span class="n">volPath</span><span class="o">).</span><span class="na">isFile</span><span class="o">())</span> <span class="o">{</span>
<span class="n">Put</span> <span class="n">p</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Put</span><span class="o">(</span><span class="n">volId</span><span class="o">.</span><span class="na">getBytes</span><span class="o">());</span>
<span class="n">p</span><span class="o">.</span><span class="na">addColumn</span><span class="o">(</span><span class="n">Constants</span><span class="o">.</span><span class="na">COMPRESSED_VOLUME_CF</span><span class="o">.</span><span class="na">getBytes</span><span class="o">(),</span> <span class="n">Constants</span><span class="o">.</span><span class="na">ZIP</span><span class="o">.</span><span class="na">getBytes</span><span class="o">(),</span> <span class="n">readVolumeContent</span><span class="o">(</span><span class="n">volPath</span><span class="o">));</span>
<span class="n">ZipFile</span> <span class="n">zip</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ZipFile</span><span class="o">(</span><span class="n">volPath</span><span class="o">);</span>
<span class="n">Enumeration</span> <span class="n">entries</span> <span class="o">=</span> <span class="n">zip</span><span class="o">.</span><span class="na">entries</span><span class="o">();</span>
<span class="k">while</span> <span class="o">(</span><span class="n">entries</span><span class="o">.</span><span class="na">hasMoreElements</span><span class="o">())</span> <span class="o">{</span>
<span class="n">ZipEntry</span> <span class="n">entry</span> <span class="o">=</span> <span class="o">(</span><span class="n">ZipEntry</span><span class="o">)</span> <span class="n">entries</span><span class="o">.</span><span class="na">nextElement</span><span class="o">();</span>
<span class="k">if</span> <span class="o">(!</span><span class="n">entry</span><span class="o">.</span><span class="na">isDirectory</span><span class="o">())</span> <span class="o">{</span>
<span class="n">InputStream</span> <span class="n">in</span> <span class="o">=</span> <span class="n">zip</span><span class="o">.</span><span class="na">getInputStream</span><span class="o">(</span><span class="n">entry</span><span class="o">);</span>
<span class="n">ByteArrayOutputStream</span> <span class="n">bo</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ByteArrayOutputStream</span><span class="o">();</span>
<span class="kt">int</span> <span class="n">n</span><span class="o">;</span>
<span class="k">while</span> <span class="o">((</span><span class="n">n</span> <span class="o">=</span> <span class="n">in</span><span class="o">.</span><span class="na">read</span><span class="o">(</span><span class="n">buf</span><span class="o">,</span> <span class="mi">0</span><span class="o">,</span> <span class="mi">1024</span><span class="o">))</span> <span class="o">&gt;</span> <span class="o">-</span><span class="mi">1</span><span class="o">)</span> <span class="o">{</span>
<span class="n">bo</span><span class="o">.</span><span class="na">write</span><span class="o">(</span><span class="n">buf</span><span class="o">,</span> <span class="mi">0</span><span class="o">,</span> <span class="n">n</span><span class="o">);</span>
<span class="o">}</span>
<span class="n">p</span><span class="o">.</span><span class="na">addColumn</span><span class="o">(</span><span class="n">Constants</span><span class="o">.</span><span class="na">PAGES_CF</span><span class="o">.</span><span class="na">getBytes</span><span class="o">(),</span> <span class="n">entry</span><span class="o">.</span><span class="na">getName</span><span class="o">().</span><span class="na">getBytes</span><span class="o">(),</span> <span class="n">bo</span><span class="o">.</span><span class="na">toByteArray</span><span class="o">());</span>
<span class="n">bo</span><span class="o">.</span><span class="na">close</span><span class="o">();</span>
<span class="n">in</span><span class="o">.</span><span class="na">close</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="n">context</span><span class="o">.</span><span class="na">write</span><span class="o">(</span><span class="k">new</span> <span class="n">ImmutableBytesWritable</span><span class="o">(</span><span class="n">volId</span><span class="o">.</span><span class="na">getBytes</span><span class="o">()),</span> <span class="n">p</span><span class="o">);</span>
<span class="o">}</span> <span class="k">else</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">volPath</span> <span class="o">+</span> <span class="s">" is a directory."</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">private</span> <span class="kt">byte</span><span class="o">[]</span> <span class="nf">readVolumeContent</span><span class="o">(</span><span class="n">String</span> <span class="n">volPath</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">IOException</span> <span class="o">{</span>
<span class="k">return</span> <span class="n">Files</span><span class="o">.</span><span class="na">readAllBytes</span><span class="o">(</span><span class="n">Paths</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="n">volPath</span><span class="o">));</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre>
</div>
<p>MapReduce driver code is shown below.</p>
<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">org.apache.hadoop.conf.Configuration</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.hadoop.fs.Path</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.hadoop.hbase.TableName</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.hadoop.hbase.client.*</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.hadoop.hbase.io.ImmutableBytesWritable</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.hadoop.mapreduce.Job</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.hadoop.mapreduce.lib.input.FileInputFormat</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.apache.hadoop.mapreduce.lib.input.NLineInputFormat</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.slf4j.Logger</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">org.slf4j.LoggerFactory</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">BulkLoadDriver</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">static</span> <span class="kd">final</span> <span class="n">Logger</span> <span class="n">log</span> <span class="o">=</span> <span class="n">LoggerFactory</span><span class="o">.</span><span class="na">getLogger</span><span class="o">(</span><span class="n">BulkLoadDriver</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">submitBulkLoadJob</span><span class="o">(</span><span class="n">String</span> <span class="n">hbaseTable</span><span class="o">,</span> <span class="n">String</span> <span class="n">input</span><span class="o">,</span> <span class="n">String</span> <span class="n">output</span><span class="o">,</span> <span class="n">String</span> <span class="n">hadoopConfDir</span><span class="o">,</span>
<span class="n">String</span> <span class="n">hadoopUser</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">Exception</span> <span class="o">{</span>
<span class="n">System</span><span class="o">.</span><span class="na">setProperty</span><span class="o">(</span><span class="s">"HADOOP_USER_NAME"</span><span class="o">,</span> <span class="n">hadoopUser</span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na">setProperty</span><span class="o">(</span><span class="s">"mapreduce.job.maps"</span><span class="o">,</span> <span class="s">"2"</span><span class="o">);</span>
<span class="n">Configuration</span> <span class="n">conf</span> <span class="o">=</span> <span class="n">Utils</span><span class="o">.</span><span class="na">createHBaseMapRedConfiguration</span><span class="o">(</span><span class="n">hadoopConfDir</span><span class="o">);</span>
<span class="n">conf</span><span class="o">.</span><span class="na">set</span><span class="o">(</span><span class="s">"hbase.fs.tmp.dir"</span><span class="o">,</span> <span class="s">"/tmp"</span><span class="o">);</span>
<span class="n">Job</span> <span class="n">job</span> <span class="o">=</span> <span class="n">Job</span><span class="o">.</span><span class="na">getInstance</span><span class="o">(</span><span class="n">conf</span><span class="o">,</span> <span class="s">"HBase Bulk Importer for HTRC"</span><span class="o">);</span>
<span class="n">job</span><span class="o">.</span><span class="na">setJarByClass</span><span class="o">(</span><span class="n">BulkLoadMapper</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
<span class="n">job</span><span class="o">.</span><span class="na">setMapperClass</span><span class="o">(</span><span class="n">BulkLoadMapper</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
<span class="n">job</span><span class="o">.</span><span class="na">setMapOutputKeyClass</span><span class="o">(</span><span class="n">ImmutableBytesWritable</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
<span class="n">job</span><span class="o">.</span><span class="na">setMapOutputValueClass</span><span class="o">(</span><span class="n">Put</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
<span class="n">job</span><span class="o">.</span><span class="na">setInputFormatClass</span><span class="o">(</span><span class="n">NLineInputFormat</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
<span class="n">job</span><span class="o">.</span><span class="na">getConfiguration</span><span class="o">().</span><span class="na">setInt</span><span class="o">(</span><span class="s">"mapreduce.input.lineinputformat.linespermap"</span><span class="o">,</span> <span class="mi">100</span><span class="o">);</span>
<span class="n">Path</span> <span class="n">tmpPath</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Path</span><span class="o">(</span><span class="n">output</span><span class="o">);</span>
<span class="n">Connection</span> <span class="n">hbCon</span> <span class="o">=</span> <span class="n">ConnectionFactory</span><span class="o">.</span><span class="na">createConnection</span><span class="o">(</span><span class="n">conf</span><span class="o">);</span>
<span class="n">Table</span> <span class="n">hTable</span> <span class="o">=</span> <span class="n">hbCon</span><span class="o">.</span><span class="na">getTable</span><span class="o">(</span><span class="n">TableName</span><span class="o">.</span><span class="na">valueOf</span><span class="o">(</span><span class="n">hbaseTable</span><span class="o">));</span>
<span class="n">RegionLocator</span> <span class="n">regionLocator</span> <span class="o">=</span> <span class="n">hbCon</span><span class="o">.</span><span class="na">getRegionLocator</span><span class="o">(</span><span class="n">TableName</span><span class="o">.</span><span class="na">valueOf</span><span class="o">(</span><span class="n">hbaseTable</span><span class="o">));</span>
<span class="n">Admin</span> <span class="n">admin</span> <span class="o">=</span> <span class="n">hbCon</span><span class="o">.</span><span class="na">getAdmin</span><span class="o">();</span>
<span class="k">try</span> <span class="o">{</span>
<span class="n">HFileOutputFormat2</span><span class="o">.</span><span class="na">configureIncrementalLoad</span><span class="o">(</span><span class="n">job</span><span class="o">,</span> <span class="n">hTable</span><span class="o">,</span> <span class="n">regionLocator</span><span class="o">);</span>
<span class="n">FileInputFormat</span><span class="o">.</span><span class="na">addInputPath</span><span class="o">(</span><span class="n">job</span><span class="o">,</span> <span class="k">new</span> <span class="n">Path</span><span class="o">(</span><span class="n">input</span><span class="o">));</span>
<span class="n">HFileOutputFormat2</span><span class="o">.</span><span class="na">setOutputPath</span><span class="o">(</span><span class="n">job</span><span class="o">,</span> <span class="n">tmpPath</span><span class="o">);</span>
<span class="n">job</span><span class="o">.</span><span class="na">waitForCompletion</span><span class="o">(</span><span class="kc">true</span><span class="o">);</span>
<span class="o">}</span> <span class="k">finally</span> <span class="o">{</span>
<span class="n">hTable</span><span class="o">.</span><span class="na">close</span><span class="o">();</span>
<span class="n">regionLocator</span><span class="o">.</span><span class="na">close</span><span class="o">();</span>
<span class="n">admin</span><span class="o">.</span><span class="na">close</span><span class="o">();</span>
<span class="n">hbCon</span><span class="o">.</span><span class="na">close</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre>
</div>
<p>Most of the sample code in the internet including <a href="https://sreejithrpillai.wordpress.com/2015/01/08/bulkloading-data-into-hbase-table-using-mapreduce/">this</a> uses <code class="highlighter-rouge">LoadIncrementalHFiles.doBulkLoad</code> after the original MapReduce job to load the HFiles created into the HBase table. Somehow that didn’t work with HDP 2.5, and I used the command line version of the same tool with <code class="highlighter-rouge">hbase</code> tool to load the HFiles into HBase as follows:</p>
<div class="language-bash highlighter-rouge"><pre class="highlight"><code>hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles &lt;bulk_load_mapreduce_output_path&gt; &lt;table_name&gt;
</code></pre>
</div>
<h3 id="references">References:</h3>
<ul>
<li><a href="http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/">How-to: Use HBase Bulk Loading, and Why</a></li>
<li><a href="https://sreejithrpillai.wordpress.com/2015/01/08/bulkloading-data-into-hbase-table-using-mapreduce/">Bulkloading Data Into HBase Table Using MapReduce</a></li>
<li><a href="http://blogs.perficient.com/delivery/blog/2015/09/09/some-ways-load-data-from-hdfs-to-hbase/">3 Ways to Load Data From HDFS to HBase</a></li>
<li><a href="https://github.com/jrkinley/hbase-bulk-import-example">hbase-bulk-import-example</a></li>
<li><a href="https://github.com/Paschalis/HBase-Bulk-Load-Example">HBase-Bulk-Load-Example</a></li>
</ul>
Sun, 11 Dec 2016 00:00:00 +0000http://milinda.pathirage.org/2016/12/11/hbase-bulk-load.html
http://milinda.pathirage.org/2016/12/11/hbase-bulk-load.htmlHadoop HDFS on EC2 Tips<p>I deployed a YARN cluster with HDFS on EC2 using a <a href="https://github.com/milinda/yarn-ec2">forked version</a> of <a href="https://github.com/tqchen/yarn-ec2">yarn-ec2</a> and submit a YARN application. But the application submission failed with following error:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>01:20:05.125 [main] INFO o.p.fdbench.yarn.YarnClientWrapper - Creating YARN application....
01:20:05.380 [main] INFO o.p.fdbench.yarn.YarnClientWrapper - YARN application was created with id application_1475642094699_0004
01:20:05.380 [main] INFO o.p.fdbench.yarn.YarnClientWrapper - Preparing to request resources for app id application_1475642094699_0004
01:20:05.691 [main] INFO o.p.fdbench.yarn.YarnClientWrapper - Copying file /Users/mpathira/PhD/Code/FDBench/fdbench-yarn/build/distributions/fdbench-yarn-0.1-SNAPSHOT-dist.tgz to hdfs://ec2-52-26-132-68.us-west-2.compute.amazonaws.com:9000/user/ubuntu/fdbench-producer-throughput-28/application_1475642094699_0004/package.tgz
16/10/05 01:21:06 INFO hdfs.DFSClient: Exception in createBlockOutputStream
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/172.31.12.45:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1508)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1284)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1237)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)
16/10/05 01:21:06 INFO hdfs.DFSClient: Abandoning BP-822330611-172.31.12.45-1475643620638:blk_1073741826_1002
16/10/05 01:21:06 INFO hdfs.DFSClient: Excluding datanode DatanodeInfoWithStorage[172.31.12.45:50010,DS-6d3ed80c-8826-4510-a0c6-e17d914a45a5,DISK]
....
....
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/ubuntu/fdbench-producer-throughput-28/application_1475642094699_0004/package.tgz could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1571)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031)
....
</code></pre>
</div>
<p>It turns out that this is happening mainly because use of EC2 private DNS names to identify data nodes. Host names identifying data nodes are automatically discovered during startup and we always get private DNS name when this happens inside EC2 VMs. It looks like Hadoop is not going to fix this (<a href="https://issues.apache.org/jira/browse/HADOOP-2776">HADOOP-2776</a>). Solution is to add following to your hdfs-site.xml:</p>
<div class="language-xml highlighter-rouge"><pre class="highlight"><code><span class="nt">&lt;property&gt;</span>
<span class="nt">&lt;name&gt;</span>dfs.client.use.datanode.hostname<span class="nt">&lt;/name&gt;</span>
<span class="nt">&lt;value&gt;</span>true<span class="nt">&lt;/value&gt;</span>
<span class="nt">&lt;/property&gt;</span>
</code></pre>
</div>
<p>and add an entry to your <code class="highlighter-rouge">/etc/hosts</code> file to resolve data node’s private DNS to the public ip like below.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>52.26.132.68 ip-172-31-12-45.us-west-2.compute.internal
</code></pre>
</div>
<p>Above will make sure the data nodes on EC2 are accessible from outside by telling HDFS client to use hostname instead of ip and resolving that hostname to publicaly accessible ip. Only drawback of this approach is the need of adding <code class="highlighter-rouge">hosts</code> file entries for each data node you have.</p>
Wed, 05 Oct 2016 00:00:00 +0000http://milinda.pathirage.org/tech/2016/10/05/hadoop-ec2-datanode-exclusion.html
http://milinda.pathirage.org/tech/2016/10/05/hadoop-ec2-datanode-exclusion.htmlhadoop,hdfstechLatex Tips and Tricks<h2 id="horizontal-rule-abovebelow-figure-caption">Horizontal Rule Above/Below Figure Caption</h2>
<p>Source: <a href="http://tex.stackexchange.com/questions/14968/horizontal-line-below-figure-caption">tex.stackexchange.com</a></p>
<p>You can use <a href="http://www.ctan.org/pkg/caption">caption</a> packge if you want to draw a horizontal rule above or below a figure caption. <code class="highlighter-rouge">\DeclareCaptionFormat</code> can be used to create a custom caption format with the horizontal rule and <code class="highlighter-rouge">\DeclareCaptionLabelFormat</code> can be used to customize the caption label format. Finally <code class="highlighter-rouge">\captionsetup</code> can be used to tell the Latex system to use newly defined caption format and label format.</p>
<div class="language-tex highlighter-rouge"><pre class="highlight"><code><span class="k">\usepackage</span><span class="p">{</span>caption<span class="p">}</span>
<span class="k">\DeclareCaptionFormat</span><span class="p">{</span>myformat<span class="p">}{</span><span class="k">\hrulefill\\</span>#1#2#3<span class="p">}</span>
<span class="k">\DeclareCaptionLabelFormat</span><span class="p">{</span>bf-parens<span class="p">}{</span><span class="k">\textbf</span><span class="p">{</span>#1~#2<span class="p">}}</span>
<span class="k">\captionsetup</span><span class="na">[figure]</span><span class="p">{</span>labelformat=bf-parens,format=myformat<span class="p">}</span>
</code></pre>
</div>
<p>If you want the horizontal rule below the caption, all you have to do is move the <code class="highlighter-rouge">\hrulefill</code> to the end and remove <code class="highlighter-rouge">\\</code> like <code class="highlighter-rouge"><span class="p">{</span><span class="err">#1#2#3\hrulefill</span><span class="p">}</span></code>. In caption format, <strong>#1</strong> get replaced with the label, <strong>#2</strong> get replaced with the separator and <strong>#3</strong> get separated with the caption text. In caption label format, <strong>#1</strong> get replaced with the name (e.g. Figure) and <strong>#2</strong> get replavced with reference number.</p>
<h2 id="text-with-background-color">Text With Background Color</h2>
<p><a href="https://www.ctan.org/pkg/framed?lang=en">framed</a> package can be used to create a colored box with text.</p>
<div class="language-tex highlighter-rouge"><pre class="highlight"><code><span class="c">% In preamble</span>
<span class="k">\usepackage</span><span class="p">{</span>framed<span class="p">}</span>
<span class="k">\definecolor</span><span class="p">{</span>shadecolor<span class="p">}{</span>RGB<span class="p">}{</span>216,229,229<span class="p">}</span>
<span class="c">% In document</span>
<span class="nt">\begin{snugshade*}</span>
Text goes here.
<span class="nt">\end{snugshade*}</span>
</code></pre>
</div>
<h2 id="figure-word-wrapping">Figure Word Wrapping</h2>
<p><a href="https://www.ctan.org/pkg/wrapfig?lang=en">wrapfig</a> can be used to have figures wrap text around them.</p>
<div class="language-tex highlighter-rouge"><pre class="highlight"><code><span class="nt">\begin{wrapfigure}</span><span class="p">{</span>r<span class="p">}{</span>0.5<span class="k">\textwidth</span><span class="p">}</span>
<span class="nt">\begin{center}</span>
<span class="k">\includegraphics</span><span class="na">[keepaspectratio=true,width=0.4\textwidth]</span><span class="p">{</span>fig-file<span class="p">}</span>
<span class="nt">\end{center}</span>
<span class="k">\caption</span><span class="p">{</span>Caption Goes Here<span class="p">}</span>
<span class="k">\label</span><span class="p">{</span>fig:1<span class="p">}</span>
<span class="nt">\end{wrapfigure}</span>
</code></pre>
</div>
Sun, 17 Jul 2016 00:00:00 +0000http://milinda.pathirage.org/tech/2016/07/17/latex-tips-n-tricks.html
http://milinda.pathirage.org/tech/2016/07/17/latex-tips-n-tricks.htmllatextechDeploying Kafka on EC2 with Ansible<p>Deploying Kafka is easy when compared to the effort required to deploying a complete Hadoop system. Also, there are multiple Ansible and Vagrant based deployment scripts available for Kafka:</p>
<ul>
<li><a href="http://whatizee.blogspot.com/2015/05/ansible-playbook-setup-kafka-cluster.html">Ansible Playbook - Setup Kafka Cluster</a></li>
<li><a href="https://github.com/hpcloud-mon/ansible-kafka">https://github.com/hpcloud-mon/ansible-kafka</a></li>
<li><a href="https://github.com/rackerlabs/ansible-kafka">https://github.com/rackerlabs/ansible-kafka</a></li>
<li><a href="https://github.com/lloydmeta/ansible-kafka-cluster">https://github.com/lloydmeta/ansible-kafka-cluster</a></li>
</ul>
<p>None of the above solutions don’t come with EC2 support and creating EC2 VMs and then deploying Kafka on them using Ansible requires manual intervention. Or, it’s possible to automate this using a separate script to create VMs and then generate an Ansible inventory file that can use with one of the above solutions.</p>
<p>But Ansible comes with a nice EC2 module that can use to create EC2 VMs directly within an Ansible playbook and make those VMs available to rest of the playbook. This ansible <a href="https://github.com/milinda/KafkaOnEC2/blob/master/kafka.yml">playbook</a> from <a href="https://github.com/milinda/KafkaOnEC2">https://github.com/milinda/KafkaOnEC2</a> uses Ansible EC2 module to create VMs and then deploy Zookeeper and Kafka on to those VMs. Most of the time I use EC2 spot instances and I have written this <a href="https://github.com/milinda/KafkaOnEC2/blob/master/kafka.yml">playbook</a> to use spot instances. But you can customise it to use regular EC2 instances by <code class="highlighter-rouge">spot_price</code> and <code class="highlighter-rouge">spot_wait_timeout</code> configurations from Ansible EC2 task in <a href="https://github.com/milinda/KafkaOnEC2/blob/master/kafka.yml"><code class="highlighter-rouge">kafka.yml</code></a></p>
<p>You can use <a href="https://github.com/milinda/KafkaOnEC2/blob/master/group_vars/all"><code class="highlighter-rouge">group_vars/all</code></a> to customise the Kafka cluster size, Zookeeper cluster size, instance types, EC2 region and spot instance pricing limits.</p>
<div class="language-yml highlighter-rouge"><pre class="highlight"><code><span class="s">ec2</span><span class="pi">:</span>
<span class="s">key</span><span class="pi">:</span> <span class="s">2016july-ec2-keypair</span>
<span class="s">zookeeper_instance_type</span><span class="pi">:</span> <span class="s">m3.large</span>
<span class="s">kafka_instance_type</span><span class="pi">:</span> <span class="s">r3.large</span>
<span class="s">image</span><span class="pi">:</span> <span class="s">ami-9abea4fb</span>
<span class="s">region</span><span class="pi">:</span> <span class="s">us-west-2</span>
<span class="s">kafka_security_group</span><span class="pi">:</span> <span class="s">kafka</span>
<span class="s">zookeeper_security_group</span><span class="pi">:</span> <span class="s">zookeeper</span>
<span class="s">kafka_instance_count</span><span class="pi">:</span> <span class="s">1</span>
<span class="s">zookeeper_instance_count</span><span class="pi">:</span> <span class="s">1</span>
<span class="s">zk_spot_price</span><span class="pi">:</span> <span class="s">0.2</span>
<span class="s">kafka_spot_price</span><span class="pi">:</span> <span class="s">0.8</span>
</code></pre>
</div>
<p>Before using this playbook you have to make sure following:</p>
<ul>
<li>You have created an EC2 key pair (I have used <em>2016july-ec2-keypair</em> key pair) in AWS region you are going to use, and you have access to the private key *.pem file.</li>
<li>You have security groups created for Kafka and Zookeeper deployments. I am using <em>kafka</em> and <em>zookeeper</em> security group with all ports open to the public. For production deployments, you may need to use a secure configuration where a particular set of ports are open to the public or your network based on your requirements.</li>
<li>You have to monitor spot instance pricing in the selected AWS region and decide proper values to use for <code class="highlighter-rouge">zk_spot_price</code> and <code class="highlighter-rouge">kafka_spot_price</code>.</li>
</ul>
<h2 id="how-to-use-the-kafka-playbook">How to use the Kafka playbook</h2>
<p>You can use the following command to deploy a Kafka cluster on EC2 once you have done with configurations.</p>
<div class="language-bash highlighter-rouge"><pre class="highlight"><code><span class="gp">$ </span>ansible-playbook --private-key<span class="o">=</span>&lt;AWS_key_file&gt; -u ubuntu kafka.yml
</code></pre>
</div>
<p>If you are getting an error saying <code class="highlighter-rouge">boto</code> Python module is missing (I got this error in Mac OS X El Capitan) please use the <a href="https://github.com/milinda/KafkaOnEC2/blob/master/inventory">```inventory``</a> file provided with proper python interpreter location as below.</p>
<div class="language-bash highlighter-rouge"><pre class="highlight"><code><span class="gp">$ </span>ansible-playbook --private-key<span class="o">=</span>&lt;AWS_key_file&gt; -u ubuntu -i inventory kafka.yml
</code></pre>
</div>
<p><strong>Please note that I haven’t done extensive testing, and there may be bugs or unsupported scenarios. So please feel free to fork, modify/fix and send pull requests.</strong></p>
Wed, 06 Jul 2016 00:00:00 +0000http://milinda.pathirage.org/tech/2016/07/06/kafka-on-ec2.html
http://milinda.pathirage.org/tech/2016/07/06/kafka-on-ec2.htmlkafka,cloud,ec2,ansibletechPerformance Evaluation<p>I started to reading <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.7619&amp;rep=rep1&amp;type=pdf">Performance Evaluation for Parallel Systems: A Survey</a> to understand basics of performance evaluation of parallel and distributed systems. Below are some important points from the paper’s introduction that can be valuable to anyone who works on performance evaluation.</p>
<h3 id="objective-of-performance-evaluation">Objective of performance evaluation</h3>
<blockquote>
<p>Selecting a proper architecture for an application is problem oriented. One architecture that suits one kind of problems may not at all suit another. After a particular architecture is chosen, the following questions may be asked. How will the system perform? What criteria should be used to evaluate the performance? What techniques can/should we use to get the performance values? To answer these questions is the objective of performance evaluation.</p>
</blockquote>
<h3 id="definition">Definition</h3>
<blockquote>
<p>Performance evaluation can be defined as assigning quantitative values to the indices of the performance of the system under study.</p>
</blockquote>
<h3 id="what-is-performance">What is performance</h3>
<blockquote>
<p>To answer this question is not easy, for performance involves many aspects.</p>
</blockquote>
<blockquote>
<p>some factors that must be considered: <em>functionality, reliability, speed, and economicity</em>.</p>
</blockquote>
<p>Before considering about <em>reliability</em>, <em>speed</em> or <em>economicity</em> the system should be functional, and it should perform what its designer wants it to do. Then the system must be <em>reliable</em>. It is hard to achieve 100% reliability. So probability of errors should be studied. After we have a functional and reliable system, we can consider evaluating for speed and economicity (efficiency).</p>
<h3 id="performance-evaluation">Performance evaluation</h3>
<blockquote>
<p>To evaluate the performance of a system or to compare two or more systems, one must first choose some criteria. These criteria are called metrics.</p>
</blockquote>
<blockquote>
<p>To know metrics, their relationships and their effects on performance parameters is the first step in performance studies.</p>
</blockquote>
<blockquote>
<p>selecting proper workload is almost equally important.</p>
</blockquote>
<blockquote>
<p>After choosing proper metrics and workload, one must consider what technique or techniques should be used.</p>
</blockquote>
<blockquote>
<p>there are three techniques commonly used in performance evaluation. They are <em>measurement, simulation and analytical modelling</em>.</p>
</blockquote>
<blockquote>
<p>at the early design stage when the system has not yet been constructed, measurement is obviously impossible, instead, a simple <em>analytical model</em> is practical. As the design process goes on, more and more details about the system are obtained. At this stage, <em>simulation</em> or more <em>sophisticated analytical modelling</em> techniques could be used. Finally, when the system design has been completed and a real system constructed, <em>measuring</em> becomes possible.</p>
</blockquote>
<p>The <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.7619&amp;rep=rep1&amp;type=pdf">paper</a> is quite old. But I think this paper discuss fundamental stuff everyone should know about evaluating the performance of parallel systems.</p>
Fri, 17 Jun 2016 00:00:00 +0000http://milinda.pathirage.org/tech/2016/06/17/performance-evaluation.html
http://milinda.pathirage.org/tech/2016/06/17/performance-evaluation.htmlperf-eval,perf-modelstechDiscussion about Constraint Programming Bin Packing Models [Paper Summary]<p>This post summarize following paper I found while looking for pub-sub system performance modeling resources.</p>
<p><em>Jean-Charles Régin and Mohamed Rezgui. 2011.</em> <a href="http://dl.acm.org/citation.cfm?id=2908672">Discussion about Constraint Programming Bin Packing Models</a></p>
<h2 id="bin-packing-problem">Bin Packing Problem</h2>
<blockquote>
<p>In the bin packing problem, objects of different volumes must be packed into a finite number of bins or containers each of volume V in a way that minimizes the number of bins used.
<a href="https://en.wikipedia.org/wiki/Bin_packing_problem"><em>Wikipedia</em></a></p>
</blockquote>
<p>This <a href="http://dl.acm.org/citation.cfm?id=2908672">paper</a> talks about bin packing problem in the context of <a href="https://en.wikipedia.org/wiki/Cloud_computing">cloud computing</a>. In cloud context, bin packing is mainly about assigning virtual machines to bare metal servers under a variety of constraints.</p>
<p>In cloud computing it is not always about minimizing number of bare metal servers used:</p>
<blockquote>
<p>However, the problem is more gen- eral than the classical bin packing problem: minimizing the number of used servers costs less energy but several side constraints must also be considered (e.g., agility, reliability, sustainability).</p>
</blockquote>
<h2 id="constraint-programming">Constraint Programming</h2>
<p>A robust method is required in scenarios such as clouds where there are external constraints that cannot be defined in advance.</p>
<blockquote>
<p>Constraint Programming is such a method having the capabilities to be easily adapted for considering new constraints.</p>
</blockquote>
<blockquote>
<p>In order to obtain a more robust and more general method capable to deal with large scale problems, we need first to clearly identify the advantages and the drawbacks of the current constraint programming models. Mainly, we need to identify what parts of the model are really important and what other parts are secondary. Then, we would like to study the scalability of the current models and identify the current limits.</p>
</blockquote>
<h2 id="basic-model">Basic Model</h2>
<ul>
<li>\( \mathbf{I}\) - The set of items</li>
<li>\( \mathbf{B}\) - The set of bins</li>
<li>\(\mathbf{ci}_{k} \) - The capacity associated with item \( \mathbf{k} \)</li>
<li>\(\mathbf{cb}_{j} \) - The capacity associated with bin \( \mathbf{j} \)</li>
<li>\( \mathbf{x}_{ij} \) - The membership variable. \( \mathbf{x}_{ij} \) is equal to 1 if item \( \mathbf{i} \) is assigned to bin \( \mathbf{j} \) and 0 otherwise.</li>
<li>\( \mathbf{y}_{j} \) - Is 1 if bin \( \mathbf{j} \) has been used</li>
</ul>
<p>Then the minimization problem becomes:</p>
<p>\[ min \displaystyle\sum_{j=1}^{m} y_i \]</p>
<p>With following contraints:</p>
<p>\[ \forall j \displaystyle\sum_{i=1}^{n} ci_ix_{ij} \le cb_j y_j \tag 1 \]
\[ \forall i \displaystyle\sum_{j=1}^{m} x_{ij} = 1 \tag 2 \]</p>
<blockquote>
<p>Constraint (1) ensures that the capacity of a bin is not exceeded by the sum of the capacities of the items put in that bin. Constraint (2) states that an item is put in exactly one bin.</p>
</blockquote>
<blockquote>
<p>Some slack variables may be introduced in order to add some implicit constraints. Constraint (1) can be refined in where <script type="math/tex">s_j</script> is a variable expressing the capacity of bin j that has not been used.</p>
</blockquote>
<p>\[ \forall j \displaystyle\sum_{i=1}^{n} ci_ix_{ij} + s_j = cb_j y_j \]</p>
<p>Then we can introduce following constraint about whole system:</p>
<p>\[ \displaystyle\sum_{i=1}^{n} ci_i + \displaystyle\sum_{j=1}^{m} s_j = \displaystyle\sum_{j=1}^{m} cb_j \]</p>
<h2 id="solutions">Solutions</h2>
<p>We can use two approaches to find the minimum number of bins</p>
<ol>
<li>Start with a large number of bins and try to decrease the number of bins until there is no viable solution.</li>
<li>Compute a lower bound and increase the number of bins until we find a solution.</li>
</ol>
<p>Then there are different ways of bin packing</p>
<ul>
<li>first fit decreasing - First bin in which an item can be put is selected,</li>
<li>best fit decreasing - Select a bin such that remaining capacity after item is pack is minimal,</li>
<li>worst fit decreasing - Select a bin such that remaining capacity is the largest after packing</li>
</ul>
<p>Then often we may need to perform symmetry breaking. For example items may have same capacity and they may be interchangeable. But in real life additional contstraints such as data locality limits the symmetry breaking strategies.</p>
<p>Capacitated variables are used to reduce the memory complexity by grouping items that have the same capacity.</p>
<p>The paper describes two concepts called <em>sum constraint</em> and <em>subset sum variables</em> which I don’t really understand.</p>
<p>And in the real world we will have more contraints than above, for example in a cloud environment we may have constraints such that some items (VMs) are can only be assigned to some bins (bare metal servers). For example, VMs with GPUs can only be assigned to servers with GPUs.</p>
<p>I think constraint based bin packing is useful in many other situations as well. For example Kafka partition balancing discussed <a href="http://blog.typeobject.com/a-constraint-based-approach-to-kafka-partition-balancing">here</a>.</p>
Wed, 15 Jun 2016 00:00:00 +0000http://milinda.pathirage.org/tech/2016/06/15/cp-bin-packing.html
http://milinda.pathirage.org/tech/2016/06/15/cp-bin-packing.htmlperf-modelstechBeauty of Apache Samza<p>It’s not easy to deploy and manage stream processing applications written in Samza. Deploying and managing Storm topologies (stream processing applications written in Storm are called topologies) is easy compared to Sazma. Samza doesn’t provide something like Storm topology out of the box. You have to build a topology by connecting multiple Samza jobs by writing output to Kafka topics and reading them downstream. In my opinion, we can add all these to Samza by writing another layer on top of base Samza. But Samza is still attractive as a stream processing framework due to:</p>
<h2 id="samza-job-is-a-yarn-application">Samza job is a YARN application</h2>
<p>A Samza job is a YARN application that takes care of resource allocation and fault-tolerance. Once you deploy a job, its life-cycle is entirely handled by YARN and the Samza application master specific to that job. So you don’t need to run and manage master processes or daemon processes. If you already have a YARN and Kafka cluster running in your environment, you can quickly get started with Samza. This method allows Samza to utilize any improvements come to YARN including resource isolation and security. Samza job being a YARN application also reduces the number of systems you need to maintain. You only need to manage YARN and Kafka.</p>
<p>Each Samza application being a separate YARN job has some downsides too. If you need Storm like a central dashboard for your streaming applications, you will have to build your own thing. But Samza provides extensible metrics system and per job web app if you need to create a custom monitoring and management tool. Also, if you are not familiar with YARN, it may take some time to getting used to Samza. But Samza comes with an excellent <a href="https://github.com/apache/samza-hello-samza">example</a> to help you get started with Samza.</p>
<h2 id="everything-is-a-stream">Everything is a stream</h2>
<p>Samza tries to use streams as much as possible to implement everything from metrics, fault-tolerance to stream-to-relation joins. Samza encourages you to write your metrics to a Kafka topic in production and provides necessary tools for that. Samza implements checkpointing based fault-tolerance where Samza checkpoints to a Kafka topic (or basically to a stream). Then Samza has a concept called bootstrap stream where you use a stream to bootstrap a job. This stream may be a database change-log stream and during the bootstrapping process you can load existing data from a table to task local storage before start to joining incoming messages from an actual stream against the local storage.</p>
<p>I like the use of streams as described above and hope to utilize bootstrap streams to implement stream-to-relation joins in SamzaSQL.</p>
<h2 id="natively-durable-and-fault-tolerant">Natively durable and fault-tolerant</h2>
<p>When using with Kafka, Samza uses Kafka to guarantee message delivery order (<a href="http://samza.apache.org">messages are processed in the order they were written to a partition</a>). This feature is crucial when processing time-based window aggregations and joins and SamzaSQL uses this guarantee to implement window operators.</p>
<p>Samza takes care of snapshotting and restoration of local state; during task failures Samza restores the local state of the task to a known snapshot. Samza utilizes Kafka streams for snapshotting.</p>
<h2 id="good-integration-with-kafka">Good integration with Kafka</h2>
<p>Even though we can theoretically plug any messaging system to Samza; IMO, current implementation’s design was heavily inspired by Kafka, and personally I think Samza is a computational layer for Kafka. Utilizing consumer managed offset concept, Samza allows to start reading Kafka topics from the beginning, or the first message comes after the job is started. Configurable initial offset enables replaying of streams for processing historical data. Or can be utilized to deploy a new version of job that starts from the beginning of the stream and when that job catches up we can stop the old version. Also checkpointing tasks and local state also utilize Kafka.</p>
<p>As with any software tool, there are some downsides as well:</p>
<ul>
<li>It’s not easy to get started with Samza. At least, when comparing with famous Storm.</li>
<li>Each streaming app is a separate YARN application/job, and Samza lacks inbuilt support for centrally monitoring and managing them.</li>
<li>Lack of support for describing DAG of stream processing jobs is an another issue. It’s not impossible to build something, but it requires a considerable amount of effort.</li>
<li>Also, I personally prefer if there is a Java API or something similar to deploy Samza jobs easily.</li>
<li>Samza doesn’t have spouts like in Storm. Samza users need to implement a custom stream if they need to generate a set of test data to a Kafka topic. I think there should be a special type of tasks which can act as data source instead of asking users to implement a system.</li>
</ul>
<p>IMHO, above are not deal breakers, and Samza is still a nice platform to develop stream processing applications.</p>
<p><strong>This post was originally posted in <a href="https://medium.com/@mpathirage/beauty-of-apache-samza-cfb88fb05982#.e3mjzxw8q">Medium</a>.</strong></p>
Wed, 09 Dec 2015 00:00:00 +0000http://milinda.pathirage.org/tech/2015/12/09/samza.html
http://milinda.pathirage.org/tech/2015/12/09/samza.htmlstreaming,streaming-sql,fast-datatechCQL<p><strong>This post was originally published in <a href="https://milinda.svbtle.com/cql-continuous-query-language">milinda.svbtle.com</a>.</strong></p>
<p>In today’s data driven economy, organizations depend heavily on data analytics to stay competitive. Advances in Big Data related technologies transformed how organizations interact with data and as a result more and more data is generated at ever increasing rates. And most of these data is available as continuous streams and organizations utilizes stream processing technologies to extract insights in real-time (or as data arrives). As a result of this change in how we collect and process data stream processing platforms like Apache Storm, Spark Streaming and Apache Samza were created based on about a decade of experience using Big Data processing technologies such as Hadoop.</p>
<p>But these modern platforms lack support for SQL like declarative query languages and require sound knowledge on imperative style programming and distributed systems to effectively utilize them. But for broader adoption, support for SQL like continuous query languages or SQL with streaming extensions is required. In this post I’m going to discuss one such language known as <a href="http://dl.acm.org/citation.cfm?id=1146463">CQL</a> for querying data streams invented roughly 10 years ago. Theoretical framework and SQL extensions discussed in CQL paper is still important and we are using concepts from CQL as a foundation for <a href="https://issues.apache.org/jira/browse/SAMZA-390">Apache Samza’s Streaming SQL</a> implementation.</p>
<h2 id="what-is-cql">What is CQL?</h2>
<p>CQL is not SQL, but a SQL based declarative language for querying streaming and stored relations (a.k.a. database tables). Abstract semantics of CQL relies on three types of operations — <em>stream-to-relations</em>, <em>relation-to-relation</em> and <em>relation-to-stream</em> — on two types of data — <strong>streams</strong> and <strong>relations</strong>.</p>
<h3 id="streams-and-relations">Streams and relations</h3>
<ul>
<li><strong>Stream</strong> - (possibly infinite) bag of elements , where s is a tuple and t is the timestamp of the element</li>
<li><strong>Relation</strong> - time instant to finite but unbounded bag of tuples mapping. This is different from general definition of relation where there is no notion of time and relations in the context of CQL is know as instantaneous relations which varies with the time.</li>
</ul>
<h3 id="operators">Operators</h3>
<ul>
<li><strong>Stream-to-relation</strong> — produces a (instantaneous) relation from a stream. Window operator (there are different types such as sliding and tumbling) is the only stream-to-relation operator available in CQL.
Relation-to-relation — produces a relation from one or more relations. Selection, projection and aggregation operators in CQL are relaiton-to-relation operators.</li>
<li><strong>Relation-to-stream</strong> — produces a stream from a relation. Difference between previous and current instantaneous relation is used to convert a relation to a stream.</li>
</ul>
<p>Stream-to-stream operators are absent and they should be constructed by combining three types of operators defined above. Below figure from CQL paper is the best visualization of abstract semantics defined in CQL.</p>
<p><img src="/img/cql/cql-operators.png" alt="CQL Operators" /></p>
<h2 id="why-cql-is-interesting">Why CQL is interesting?</h2>
<p>Operators like join and some aggregation operators available in SQL are blocking and impossible to evaluate over streams. So, a window operator which divide the stream into possibly overlapping subsets is used after stream scan to reduce the scope of the query to a window extent.</p>
<p>In CQL, the concept of window is embedded into the semantics using the concept instantaneous relation and this allows query execution engines to implement operators such as joins and aggregations as they are operating on general relations. In addition to that, CQL allows integration of stored relations to streaming queries without any magic because once a stream is converted to an instantaneous relation, we are basically working on relations.</p>
<p>In addition to above mentioned semantic features, query execution strategy explained in CQL is also interesting.</p>
<h2 id="cql-query-execution">CQL query execution</h2>
<h3 id="streams-and-insertdelete-streams">Streams and Insert/Delete Streams</h3>
<p>In CQL runtime stream is represented as a sequence of timestamped insert tuples. And time-varying relation (bag of tuples) is represented as timestamped insert and delete tuples. These insertions and deletions represent the changing state of a relation. This technique makes easy to implement incremental processing of streams.</p>
<p>Synopses are used to maintain the intermediate state such as current contents of a sliding window or current state of a relation for join operation.</p>
<p>More information about CQL query execution can be found in <em>Section 12</em> of CQL <a href="http://dl.acm.org/citation.cfm?id=1146463">paper</a>.</p>
<h2 id="limitations">Limitations</h2>
<p>CQL is not a extension to the standard SQL. Its a main limitation if we are planning to use it in a modern streaming management system. Extensions to SQL reduces learning curve and familiar semantics makes it easy to use in production environments. Other than the above there are no other major limitations that I can think of and concepts presented in CQL paper is really useful even today when reasoning about streaming SQL systems.</p>
Sun, 01 Nov 2015 00:00:00 +0000http://milinda.pathirage.org/tech/2015/11/01/cql.html
http://milinda.pathirage.org/tech/2015/11/01/cql.htmlstreaming,streaming-sql,fast-datatechMonitoring Play Apps with InfluxDB<p>New Relic has awesome support for monitoring web applications written in Play Framework. But I couldn’t find any open source libraries or tools closer to it. Recently, I deployed a InfluxDB, collectd and Grafana based solution for monitoring cluster of nodes in our <a href="http://d2i.indiana.edu">lab</a> and it worked out nicely. So the best solution I could think of was to publish Play app metrics to InfluxDB and do a dashboard using Grafana. So I started to browse through different Play application monitoring solutions available on the web and found <a href="https://github.com/kenshoo/metrics-play">metrics-play</a> which is a Play plugin written based on Dropwizard’s <a href="https://dropwizard.github.io/metrics/3.1.0/">Metrics</a> library with support for Graphite (InfluxDB has a native Graphite input plugin).</p>
<p>Even though <em>metrics-play</em> doesn’t include support for Graphite reporter, I <a href="https://github.com/milinda/metrics-play">forked it</a> and added support for Graphite reporter. If anyone is interested you can grab the code from <a href="https://github.com/milinda/metrics-play">here</a> and Graphite support is there in <em>graphite-publisher</em> branch. To add this to your Play application you have to do following.</p>
<ul>
<li>Add metrics-play with Graphite reporter into you Play app dependencies.</li>
</ul>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="n">libraryDependencies</span> <span class="o">++=</span> <span class="nc">Seq</span><span class="o">(</span>
<span class="s">"com.kenshoo"</span> <span class="o">%%</span> <span class="s">"metrics-play"</span> <span class="o">%</span> <span class="s">"2.3.0_0.2.1-graphite"</span><span class="o">,</span>
<span class="n">javaJdbc</span><span class="o">,</span>
<span class="n">javaEbean</span><span class="o">,</span>
<span class="n">cache</span><span class="o">,</span>
<span class="n">javaWs</span><span class="o">)</span></code></pre></figure>
<ul>
<li>
<p>Next, add the Play plugin <strong>com.kenshoo.play.metrics.MetricsPlugin</strong> to your play.plugins file.</p>
</li>
<li>
<p>Enable the Graphite reporter by putting following configuration to you application.conf.</p>
</li>
</ul>
<figure class="highlight"><pre><code class="language-js" data-lang="js"><span class="nx">metrics</span> <span class="p">{</span>
<span class="nx">graphite</span> <span class="p">{</span>
<span class="nx">enabled</span> <span class="o">=</span> <span class="kc">true</span>
<span class="nx">period</span> <span class="o">=</span> <span class="mi">1</span>
<span class="nx">unit</span> <span class="o">=</span> <span class="nx">MINUTES</span>
<span class="nx">host</span> <span class="o">=</span> <span class="nx">localhost</span>
<span class="nx">port</span> <span class="o">=</span> <span class="mi">2003</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<ul>
<li>And make sure you have enabled Graphite input plugin in InfluxDB configuration like below.</li>
</ul>
<figure class="highlight"><pre><code class="language-yaml" data-lang="yaml"><span class="pi">[</span><span class="nv">input_plugins</span><span class="pi">]</span>
<span class="c1"># Configure the graphite api</span>
<span class="pi">[</span><span class="nv">input_plugins.graphite</span><span class="pi">]</span>
<span class="s">enabled = true</span>
<span class="s"># address = "0.0.0.0"</span> <span class="c1"># If not set, is actually set to bind-address.</span>
<span class="s">port = 2003</span>
<span class="s">database = "play-influxdb"</span> <span class="c1"># store graphite data in this database</span>
<span class="s">udp_enabled = true</span> <span class="c1"># enable udp interface on the same port as the tcp interface</span></code></pre></figure>
<p>If you got all the configurations correct you should see the Play app monitoring data including JVM stats in you InfluxDB.</p>
<p>Current <em>metrics-play</em> implementation supports only limited monitoring capabilities. I am planning to improve it to publish route level metrics such as number of requests and response times. So keep watching my <a href="https://github.com/milinda/metrics-play">fork</a> for new features.</p>
Sun, 24 May 2015 00:00:00 +0000http://milinda.pathirage.org/tech/2015/05/24/monitoring-play-apps.html
http://milinda.pathirage.org/tech/2015/05/24/monitoring-play-apps.htmlmonitoring,play,java,InfluxDBtechFreshet: CQL based Clojure DSL for Streaming Queries<p>Interest on continuous queries on streams of data has increased over the last couple of years due to the need of deriving actionable information as soon as possible to be competitive in the fast moving world. Existing Big Data technologies designed to handle batch processing couldn’t handle today’s near real-time requirements and distributed stream processing systems like Yahoo’s <a href="http://incubator.apache.org/s4/">S4</a>, Twitter’s <a href="https://storm.apache.org/">Storm</a>, <a href="https://spark.apache.org/streaming/">Spark Streaming</a> and LinkedIn’s <a href="http://samza.incubator.apache.org/">Samza</a> were introduced into the fast growing Big Data eco-system to tackle real-time requirements. These systems are robust, fault tolerant and scalable to handle massive volumes of streaming data, but lack first class support for SQL like querying capabilities. All of these frameworks provide <a href="http://samza.incubator.apache.org/learn/documentation/0.8/api/overview.html">high-level</a> <a href="https://storm.apache.org/documentation/Trident-tutorial.html">programming</a> <a href="https://spark.apache.org/docs/latest/streaming-programming-guide.html">API</a>s in JVM compatible languages.</p>
<p>In the golden era of stream processing research, a lot of work has been done on query engines and languages for stream processing. But we have yet to adapt these work on streaming query languages to above mentioned distributed stream processing systems widely in use today.</p>
<p>Also with the transition from batch to real-time Big Data, different architectures were proposed to handle the <a href="http://lambda-architecture.net/">integration of batch and real-time systems (Lambda Archiecture)</a> as well as to <a href="http://www.kappa-architecture.com/">revolutionized the way we built today’s systems (Kappa Architecture)</a>. Even though there aren’t any standards (like SQL and Relational Algebra for DBs) on implementing these architectures, <a href="https://github.com/twitter/summingbird">Summingbird</a> implements <a href="http://lambda-architecture.net/">Lambda Architecture</a> based on <a href="http://en.wikipedia.org/wiki/Monoid">monoids</a>. Also there are <a href="http://spark-summit.org/2014/talk/applying-the-lambda-architecture-with-spark">other ways</a> to implement <em>Lambda Architecture</em> such as Spark’s Scala API for streaming and batch processing. Even though it is possible to implement <a href="http://www.kappa-architecture.com/">Kappa Architecture</a> manually using above mentioned frameworks, there aren’t any high-level frameworks like Summingbird for this purpose. <a href="https://github.com/milinda/Freshet">Freshet</a> tries to fill this gap by adapting continuous query semantics and execution planning methods discussed by Arasu et. al. in their paper <a href="https://cs.uwaterloo.ca/~david/cs848/stream-cql.pdf">The CQL Continuous Query Language: Semantic Foundations and Query Execution</a> to implement <em>Kappa Architecture</em> on top of Apache Samza.</p>
<p>Before going into details about <a href="https://github.com/milinda/Freshet">Freshet</a> , it’s important to discuss <em>Kappa Architecture</em> and <em>CQL</em>. These are the fundamental ideas and technologies which Freshet is based on.</p>
<h2 id="kappa-architecture">Kappa Architecture</h2>
<p><em>Kappa Architecture</em> which has the notion of – Everything Is A Stream – is <a href="http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html">proposed</a> as an alternative to Lambda Architecture. In the link above, the author argues that, stream processing is a generalization of data-flow DAGs with support for check-pointing intermediate results and continuous output to the end user. And he emphasizes that we can actually use current distributed stream processing framework like Aapache Samza combine with message queue like <a href="http://kafka.apache.org/">Kafka</a>, which retains ordered data, to implement use cases handled by Lambda Architecture. Reprocessing is accomplished by replaying the stream through new versions of stream processing code or for a completely new algorithm.</p>
<h2 id="cql---continuous-query-language">CQL - Continuous Query Language</h2>
<p><a href="https://cs.uwaterloo.ca/~david/cs848/stream-cql.pdf">CQL</a> - aka Continuous Query Language - is a SQL- based declarative language for expressing queries over data streams and time varying relations. CQL’s abstract semantics are based on two data types - <strong>streams</strong> and <strong>relations</strong> - and three types of operations - <em>stream-to-relations</em>, <em>relation-to-relation</em> and <em>relation-to-stream</em>. In CQL, stream is a infinite bag of tuples and relation is a mapping from time Τ to a finite but unbounded bag of tuples. This special variant of the standard relation is called <em>instantaneous relation</em> in the context of CQL, because a relation <strong>R</strong> in CQL represent a finite but unbounded bag of elements at a given time instance τ. CQL takes advantage of well understood relational semantics and keeps the language simple and queries compact by introducing minimal changes to SQL.</p>
<ul>
<li>Window specification derived from SQL-99 to transform streams to relations</li>
<li>Three new operators to transform time varying relations into streams.</li>
</ul>
<h3 id="cql-sample---filtering-a-stream">CQL Sample - <em>Filtering A Stream</em></h3>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="n">Rstream</span><span class="p">(</span><span class="o">*</span><span class="p">)</span>
<span class="k">FROM</span> <span class="n">PosSpeedStr</span> <span class="p">[</span><span class="n">Now</span><span class="p">]</span>
<span class="k">WHERE</span> <span class="n">speed</span> <span class="o">&gt;</span> <span class="mi">65</span></code></pre></figure>
<p>CQL uses SQL for relation-to-relation transformations, but relations in CQL are different from relations in SQL. CQL relations vary with time. CQL introduces two new concepts: insert/delete streams, which encode both streams and relations in a unified way, and synopses, which contain state (e.g. a counter or buffer of messages) for an operator.</p>
<p>Another important thing is the fact that we can use traditional database relations in CQL queries which enables us to do things like stream-relation joins common in real world applications.</p>
<h2 id="freshet">Freshet</h2>
<p>Let’s come back to <a href="https://github.com/milinda/Freshet">Freshet</a> . <a href="https://github.com/milinda/Freshet">Freshet</a> is a first step towards a complete implementation of Kappa Architecture based on <a href="https://cs.uwaterloo.ca/~david/cs848/stream-cql.pdf">CQL</a> to support continuous queries. Freshet implements a subset (select, windowing, aggregates) of CQL on top of Apache Samza. Freshet implements RStream and IStream relation-to-stream operators, tuple and time based sliding windows to convert streams to relations and basic relation to relation operators for implementing business logic. Following CQL, Freshet uses <em>insert/delete</em> stream to model instantaneous relations.</p>
<p><a href="http://img.svbtle.com/bzsmth0xzik1jq.jpg"><img src="https://d23f6h5jpj26xu.cloudfront.net/bzsmth0xzik1jq_small.jpg" alt="freshet-arch.jpg" /></a></p>
<p>As shown in above figure, Freshet is built out of five main logical components.</p>
<ul>
<li><strong>Query DSL</strong>: Implemented as a Clojure DSL and used to express CQL queries against streams. Queries expressed in Freshet DSL will get compiled in to streaming relation algebra model and then will get converted into an execution plan that consists of a set of operators written as Samza <a href="http://samza.incubator.apache.org/learn/documentation/0.8/api/overview.html">stream tasks</a> connected together as a DAG via Kafka queues.</li>
<li><strong>Query Compiler</strong>: Compile SQL model generated from DSL to intermediate representation which can be converted in to execution plan.</li>
<li><strong>Execution Planner</strong>: Generate execution plans (Samza jobs connected via input, intermediate and output streams to form a DAG) based on intermediate representation and current status of the Freshet cluster.</li>
<li><strong>Scheduler</strong>: Does the actual scheduling of Samza Jobs.</li>
<li><strong>Query Operators</strong>: Samza stream tasks. Implement
CQL operators like window, select, aggregate, and view generation operators like rstream, istream. These operators, connected via intermediate streams, perform stream processing according to the query express in Freshet DSL.</li>
</ul>
<h2 id="freshet-dsl">Freshet DSL</h2>
<p>Freshet Clojure DSL is inspired by <a href="http://sqlkorma.com/">Korma</a> Clojure DSL for SQL. Freshet DSL follows the same style as Korma DSL. There are two main constructs in the current Freshet DSL. <strong>defstream</strong> and <strong>select</strong> queries. These are two forms I am planning to support in the initial version. Other constructs will be added later.</p>
<h3 id="defstream">defstream</h3>
<p>Used to define a new stream. Streams defined using <em>defstream</em> represent Kafka topics in the current implementation. New modifiers will be added to <em>defstream</em> in future to support different input sources. The most important modifier for <em>defstream</em> is the <em>stream-fields</em> modifier, which modifies the stream definition with a field name/type mapping. Clojure <em>keywords</em> are used to specify field names and these keywords will get converted to strings internally. There are pre-defined keywords for specifying types like :string, :integer. Below is how we define a wikipedia activity stream to use in stream queries.</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="nf">defstream</span><span class="w"> </span><span class="n">wikipedia-activity</span><span class="w">
</span><span class="p">(</span><span class="nf">stream-fields</span><span class="w"> </span><span class="p">[</span><span class="no">:title</span><span class="w"> </span><span class="no">:string</span><span class="w">
</span><span class="no">:user</span><span class="w"> </span><span class="no">:string</span><span class="w">
</span><span class="no">:diff-bytes</span><span class="w"> </span><span class="no">:integer</span><span class="w">
</span><span class="no">:diff-url</span><span class="w"> </span><span class="no">:string</span><span class="w">
</span><span class="no">:unparsed-flags</span><span class="w"> </span><span class="no">:string</span><span class="w">
</span><span class="no">:summary</span><span class="w"> </span><span class="no">:string</span><span class="w">
</span><span class="no">:is-minor</span><span class="w"> </span><span class="no">:boolean</span><span class="w">
</span><span class="no">:is-unpatrolled</span><span class="w"> </span><span class="no">:boolean</span><span class="w">
</span><span class="no">:is-special</span><span class="w"> </span><span class="no">:boolean</span><span class="w">
</span><span class="no">:is-talk</span><span class="w"> </span><span class="no">:boolean</span><span class="w">
</span><span class="no">:is-new</span><span class="w"> </span><span class="no">:boolean</span><span class="w">
</span><span class="no">:is-bot-edit</span><span class="w"> </span><span class="no">:boolean</span><span class="w">
</span><span class="no">:timestamp</span><span class="w"> </span><span class="no">:long</span><span class="p">])</span><span class="w">
</span><span class="p">(</span><span class="nf">ts</span><span class="w"> </span><span class="no">:timestamp</span><span class="p">))</span></code></pre></figure>
<h3 id="select">select</h3>
<p>Used to define select queries over streams. Stream filtering using <em>where</em> and <em>aggregators</em> will be supported in the initial version and <em>joins</em> will be added next. Below is a sample select query which filters a stream.</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="nb">select</span><span class="w"> </span><span class="n">wikipedia-activity</span><span class="w">
</span><span class="p">(</span><span class="nf">modifiers</span><span class="w"> </span><span class="no">:istream</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="nf">window</span><span class="w"> </span><span class="p">(</span><span class="nf">unbounded</span><span class="p">))</span><span class="w">
</span><span class="p">(</span><span class="nf">where</span><span class="w"> </span><span class="p">(</span><span class="nb">&gt;</span><span class="w"> </span><span class="no">:diff-bytes</span><span class="w"> </span><span class="mi">100</span><span class="p">)))</span></code></pre></figure>
<h2 id="query-execution">Query Execution</h2>
<p>Freshet follows the same execution semantics as CQL and flow of execution is shown below.</p>
<p><a href="http://img.svbtle.com/csdnd7s46rbvq.jpg"><img src="https://d23f6h5jpj26xu.cloudfront.net/csdnd7s46rbvq_small.jpg" alt="freshet-query-execution.jpg" /></a></p>
<p>Window operator converts input stream into a <em>time varying relation</em> and the <em>time varying relation</em> is encoded as <em>insert/delete</em> stream to make it easy to implement relational operators as streaming operators. The relational part of the query is converted to a DAG of Samza operators, which will operate on insert/delete streams according to the query definition. Finally the stream materializer will materialize the stream according to specification of the original query.</p>
<h2 id="why-samza">Why Samza</h2>
<p>Freshet chose Samza for its initial implementation mainly because</p>
<ul>
<li>Samza is fully integrated with Kafka</li>
<li>Samza supports and encourages <a href="http://samza.incubator.apache.org/learn/documentation/0.8/container/state-management.html">stateful stream processing</a></li>
<li>Samza’s local storage is really useful for implementing CQL synopses</li>
</ul>
<p>Samza’s property based job configuration is the only limitation when it comes to Freshet. A Storm-like topology builder would have come in handy for Freshet-like layers on top of Samza.</p>
<h2 id="current-status-and-future-work">Current Status and Future Work</h2>
<p>I am working on bridging the Clojure DSL and the CQL operator layer (Samza stream tasks) currently. I plan to do the initial release within couple of weeks. After the initial release of Freshet, I am planning to contribute to Apache Samza’s <a href="https://issues.apache.org/jira/browse/SAMZA-390">Stream Query implementation</a>, which is also based on CQL. Once finished, Freshet can be updated to use Apache Samza’s CQL operators directly, rather than having its own.</p>
Wed, 07 Jan 2015 00:00:00 +0000http://milinda.pathirage.org/2015/01/07/freshet.html
http://milinda.pathirage.org/2015/01/07/freshet.html