~ Clojure, Kafka, Spark, Hadoop and AlgoAutomation

Monthly Archives: June 2014

It’s been staring me in the face for a while, all the training, advising and talking about this Hadoop thing. It needs an AppStore, a seamless platform for one click installing and running of Hadoop applications within the framework and cluster.

Up until now most MapReduce work has been hand coded by in house development teams working to a specific requirement. Hadoop2 and it’s Yarn component extends on that and makes applications easier to distribute across the cluster, like I said in yesterday’s post about deploying point of sale software across kiosks.

The issue is that Hadoop takes an amount of skill to get things running, thinking of getting data in and running scripts at so on. We’re a way off from a one touch solution for app and data deployment and processing.

It’s coming, I’m sure it is. The next two years will be very interesting.

MapR got the ball rolling with an application gallery, not an AppStore par se but a great start.

Point Of Sale (POS) is one of the hotly contested areas of startups. While many feel that point of sale should shift from the desktop/kiosk based Windows machine to something a bit more funky (iOS and Android, think AirPOS) there is a legacy of hundreds of thousands of kiosks out there in the field.

One of the major headaches for POS vendors is updating the software on the kiosk. The majority is build on Windows, run locally and using local storage. Only recently the advent of broadband connectivity means that stock updates and deployments can happen as required.

Updates are tiresome especially when you have large numbers of kiosks deployed within a store.

With web based POS this isn’t really much of an issue as the software is essentially a SaaS implementation. iOS and Android update their own operating systems via their own ecosystems primarily but this then puts emphasis on the user to update their own and not the vendor to monitor measure deployments across the site. This can become an issue from a customer support point of view when there are multiple versions of the software in use across the retail field.

Hadoop 2 and YARN

Most people think of Hadoop as the BigData thing, crunching volumes of data and spitting out the answers. For Hadoop 1.x this was true, with the advent of Hadoop 2.x it’s not so much the case now.

The differences between Hadoop version 1 and version 2 is vast and it all boils down to the YARN element performing resource management control. This means that Hadoop is no longer about crunch data in the MapReduce paradigm but more a cluster control tool where the YARN element (Yet Another Resource Negotiator) can play a key role in deploying software across the whole cluster.

One Master, Many Slaves

The diagram below shows six POS kiosks in operation. Traditionally updates would happen manually or via some remote software like TeamViewer, remotely copying the files in and restarting the POS application on the kiosk. Introducing Hadoop 2.x into the mix changes the landscape slightly. The Hadoop master is on another server, this could be within the same data centre, or within the control of the POS company for example. The POS kiosks now become slaves to the Hadoop master.

In Hadoop terms every POS kiosk becomes a Node with a Container. With YARN we can make a deployment to all the kiosks at the same time. Ultimately YARN isn’t concerned about what it is running, just that it can be published to the required containers across the available nodes. The software required for the POS would be stored on the master’s HDFS or a the master local filesystem. When the YARN client is run it’s then copied across to the addressable nodes.

This means the master deployment of the updated software only need happen once. Then it’s up to YARN to distribute the software to the other kiosks and get it started. The beauty of this means that the output and error logs are stored on the master server so they are all in one place.

Taking It Further

As YARN acts as our resource manager it’s also easier to carry data functions, heartbeat monitoring (no one likes a POS kiosk that’s down, you can’t sell) and end of day functions with the data stored on the master. From there it would be a fairly trivial task to perform some batch Hadoop jobs on the day’s trading data. Insight comes quick then the nodes are connected together.

Like it or loathe it the Northern Ireland funding situation is either lifeline or an unrequited flatline to companies. Not much seems to have changed in the last couple of years while the actual sector landscape has changed a lot with the advent of new sectors and ventures.

The Proof Of Concept (PoC) Fund

The £10K and £40K PoC funds have been the starting point for many-a-startup to actually get minimum product built. I personally think it needs a bit of an overhaul.

Tiering The Available 40K Amount

Depending on the type of project you are building. Enterprise projects cost a fortune to put together, consumer apps not so much. So iPhone development for example shouldn’t really be touching the 40K at all and only be eligible for the 10K.

Three Solid Quotes From Each Supplier

I’m not even sure this happens, if not then it should. Northern Ireland is blessed with a lot of mobile developers and hundreds (and I mean hundreds) of web design firms. The quality of these is pretty good across the board. So it makes sense before any funding is passed for each applicant to have three solid quotes. This should happen across both funding amounts.

A New 100K Fund

For larger scale companies who need specific needs development in the enterprise (think BigData/Hadoop/High Performance Computing etc) with a solid proven requirement. Building these components is complex and time consuming and the skills are not always sourced in the province. Consumer applications do not apply this is an enterprise (B2B) only category with a proven opportunity to build a product.

Why do I mention the BigData aspect? Well it’s not going to go away for one and while the suits love talking about it they don’t like the fact that average seed round is about $5M to make a small dent in the market. While it’s predominantly services/support-subscription based the returns can be huge. VC’s are happy to plough $17M in to an open source BigData that makes no money on downloads but generates huge revenue on support fees (the old Redhat model).

In Summary

The PoC funds have started many an idea but once the money leaves the account it’s really down to the applicant to see the delivery through. At this point it gets a bit hazy. A solid plan and specification is required for any software/hardware/business project and a concrete set of deliverables needs to be put in place.

Focus needs to be placed on the larger scale enterprise companies that need more of a boost in the initial stages. While the customer ratio may be lower they tend to yield longer lasting relationships in the long term and a larger long term value.

Consumer apps do not, on the whole, generate the returns that an overseas VC would be looking for (with rare exceptions). The days of making a fortune off apps is sort of over, the real revenue comes from the use data and in app sales around that.

Enterprise is designed with the long run in mind and not a quick exist. The idea of a £100K fund would be a better staring point instead of wasting time in hunting for small scraps under the table.

The commonality with most data projects is that they start with this rather ambiguous, “well, we’ve got this data”, statement. And without a care in the world it’s spin up the Amazon EMR instances, yes instances, and whack all the data up there. The mainstream tech media will focus on the commodity hardware and Yahoo having 50,000 nodes in it’s cluster but the fact remains for most customers we’re talking small amounts of nodes.

The peak number of clusters really only count for a small number of the companies using it (the Google’s, Facebook’s, Twitter’s and Yahoo’s of this world). For most people the single core may actually be fine when dealing with batch processing, especially when those runs are planned during quiet times where the data isn’t going to be changing fast.

Adding cores adds latency, regardless of whether within the local network or a wide area network. When you are combining a number of latency points such as network access, disk i/o and the actual processing the total time added can start to hurt.

Hadoop is usually copying blocks of 64Mb data to a node for processing, done with TCP and RPC. Why 64Mb? Well it’s a Goldilocks number, not too big, not too small but just right.

So the key to powering up any single core system is to find ways of reducing any latency you can. Whether that’s making HDFS a complete in-memory implementation making block read/writes faster or just adding a ton more machine RAM to make things faster will give big time differences (well as much as the JVM will let you add) machines using 48Gb will find other uses for disk caching and so on, it all adds to the performance.

Once you run out of options you are entering the realms of hardware acceleration and there are a few companies working on this now, notable NI company is Analytics Engines. At this point it doesn’t really become a Hadoop issue, just eeking the juices out of the machine you are working on.

The Hadoop cluster count has become the ego point for Hadoop developers, without so much thinking about the cost considerations of running such a complex cluster of machines (devops won’t save you now). I’ve found that even the mid-tier racks will give you up to 6TB of storage and a enough RAM to sink a small startup, in the grand scheme of things SME’s don’t have much to worry about, perhaps we don’t need everything on the cloud after all…..

The book has been on pre order for a little while already but I didn’t want say anything until there was a cover. So he we are, the book, coming on nicely and I’ve tried to keep things more developer focused than mathematical. A get-things-done book more than a theory book.

General release isn’t until November in the US and December elsewhere though I’m trying to get it launched for StrataConf Europe in November.