Sunday, August 07, 2011

Although its been a few years in the making, the noise / buzz around NoSQL has now reached fever pitch. Or to be more precise, the promise of something better / faster / cheaper / more scalable than standard RDBMSs has sucked in a lot of people (plus getting to use MapReduce in an application even if it's not needed is a temptation very hard to resist..). And pretty recently, the persistence hydra has grown another head - NewSQL. NewSQL adherents essentially believe that NoSQL is a design pig and that a better approach is to fix relational databases. In turn, NewSQL claims have been open to counter-claim on the constraints inherent in the NewSQL approach. It's all very fascinating (props for working Lady Gaga into a technical article as well..).

As it turns out, traditional RDBMSs are sometimes slow for valid reasons, and while you can certainly speed things up by relaxing constraints or optimising heavily for a specific use case, that's not a panacea or global solution to the problem of a generic, fast way to store and access structured data. On the other hand, the assertion that Oracle, MySQL and SQL Server have become fat and inefficient because of backwards compatibility requirements definitely strikes a chord with me personally.

The sheer variety of NoSQL candidates (this web page lists ~122!) is evidence that the space is still immature. I don't have a problem with that (every technology goes through the same cycle), but it does raise one nasty problem: what happens if you back the wrong candidate now in 2012 that has disappeared in 2015?

The current NoSQL marketplace demands a defensive architecture approach - it's reasonable to expect that over the next three years some promising current candidates will lose momentum and support, others will merge and still others will be bought up by a commercial RDBMS vendor, and become quite costly to license.

What we need is a good, implementation-independent abstraction layer to model the reading and writing from and to a NoSQL store. No hard coding of specific implementation details into multiple layers of your application - instead segregate that reading and writing code into a layer that is written with change in mind - we're talking about pluggable modules, sensible use of interfaces and design patterns to make the replacement of your current NoSQL squeeze as low-pain as possible if and when that replacement is ever needed.

If the future shows that the current trade-offs made in the NoSQL space (roughly summed up as - a weaker take on A(tomicity),C(onsistency), I(solation) or D(urability), plus with your own favourite blend of Brewer's CAP theorem) are rendered unnecessary by software and hardware advances (as is very likely to be the case), then the API should ideally insulate our application code from this change.

There are interesting moves afoot that demonstrate that the community is actively thinking about this, specifically the very recent announcement ) of UnQL (the NoSQL equivalent to SQL - i.e. a unified NoSQL Query Language). That's good, but UnQL is young enough to shrivel and die just like any of the NoSQL implementations themselves. Also, we know that what has inspired UnQL - SQL - is itself fragmented / with vendor-specific extensions like T-SQL from Microsoft and PL/SQL from Oracle.

So then, in part one of this two-parter, I've worked to justify what's coming in part two - a minimal set of Java classes and interfaces to provide a concrete implementation of the abstract ideas discussed above.

Sunday, July 31, 2011

The new UI for Google Analytics has a distinctly Cromwellian vibe to it, as the screenshot below shows. Is this just my GA account, or does everyone else see Galway and Sligo a bit more surrounded by the Atlantic than normal?

Wednesday, July 06, 2011

Back in August of last year I wrote a step-by-step article on how to get Umbraco running on Windows Azure (the Microsoft cloud computing platform). It got a lot of hits from people looking to do just exactly that.

There were a few loose ends in that piece, notably not using shared rather than VM-local storage to allow for Umbraco clustering and also not using the .NET 4.0 runtime rather than .NET 3.5 (4.0 was a recent addition to Azure in Aug 2010 and it just didn't work out of the box - missing sections in the machine.config).

Saturday, May 14, 2011

Oh the perils of making predictions when there is still a conference keynote to go!

It turns out that Chrome OS and the associated hardware hasn't been read the last rites after all. Rather, v1.0 is almost ready for primetime (scheduled for release in mid-June - about a month away). You have to imagine over time though that Google will want one code base for phones, tablets and chromebooks. At the very least, they will want to make it as easy as possible for developers to write their applications once and have them "just work" on devices with radically different screen sizes and input methods, something that Android developers today are already doing. Nonetheless, a very brave play, especially in targeting the enteprise space, where significant replacement costs exist. If it pays off, it will be huge.

Moving on from Chrome, a couple of sessions I attended yesterday were really interesting, specifically two - Full Text Search and Smart App Design.

Full Text Search is Google's take on Lucene / Solr and integrated into the App Engine Datastore as well, so it will be compelling for developers who just want to start indexing and scoring documents quickly. The "fully automatic" mode of operation with the Datastore should also be a timesaver.

Smart App Design covered material of a completely different color. I had already read about the Prediction API in the blogosphere but I hadn't realised exactly what it did until this session. Essentially, Google offers the discerning developer the ability to add machine learning techniques to their application by leveraging a cloud-based service.

At first glance, I had thought that the API gave access to the same model that Google uses to predict search terms, and I guess that is one use case. But Google has done much more than that - they have effectively white-labelled their machine learning technology and made it available to non-Google developers to use with their own data, i.e. learn what's important for their application / business.

As with all machine-learning techniques, the nub of the matter remains the correct selection and efficient representation of the key attributes in the training set, and that is quite simply a problem that requires deep domain knowledge. One announcement yesterday was quite interesting however, in that Google are now allowing good model authors to sell their models to others. So if I come up with a model that predicts shopping basket behavior on leisure travel websites and a tour operator used that to bump their online conversion rate by 33%, then that model has a lot of value and it's a win-win situation for the model author and the model user.

So an API with a lot of promise. But also with two potential flies in the ointment, one commercial and one cultural:

(a) Commercial - Google are trying to charge for use of the API from day one, this will stymie adoption in the earliest stage

(b) Cultural - an endemic problem with a lot of machine learning techniques is their black box nature. As someone who spent a fair bit of time working with artificial neural networks at university, quite often a machine learning approach will yield the correct answer but the researcher can't exactly explain why! That's not a Google-specific weakness, but what is Google-specific is that the modules you access via the Prediction API (the man behind the curtain if you will) is not made open at all, so can a company really invest time in building, training and using models that they don't really understand and can never hope to do so? Only time will tell.

So to recap then, Google IO was definitely worth attending this year - and not just for the hardware gifts! The main items on my research list post the event are:

Wednesday, May 11, 2011

The official Google code site has the lowdown on all of the announcements that came thick and fast today (some 11 major items last time I checked and plenty of API revs and upgrades) and I won't replay them all here.

Specific announcements that interested me today:

Google Go is about to become an officially supported language on App Engine, alongside Python and Java (it's currently in "Trusted Tester" mode).

Rhetorical question: what value does a complete end-to-end technology stack with no overhanging IPR issues or blockers have to Google as a potential insurance policy in case the Oracle lawsuit does not go in their favor / be settled reasonably? Two things I heard today convinced me that there is now serious engineering investment going into Go (as opposed to a small, talented team cranking things out as they work down the list):

(a) The afore-mentioned App Engine support (this won't have been trivial to implement - Go is the first compiled language to run on App Engine after all for one thing)

(b) The info that a "comprehensive" Go library for ultimately all of the Google APIs is in development and will be with us "soon".

Go is a very nice language to write in, and the App Engine support announced today addresses one of the major gaps I identified when I took a look at Go when it was first released in Nov 2009.

Three final comments on day one:

1. Press articles I read in March / April this year about the +1 button being a make or break deal for Google to compete with Facebook seem overblown. The +1 button has merited just one session so far and apart from that you wouldn't even know Google had it. Either that or the memo didn't make it to the IO organisers in time.

2. It's instructive to watch Google see the mistake that companies like Sun Microsystems made and impressive to watch how they studiously avoid it. It's not enough to develop great code / software / hardware - you have to have people **using** it. Google's continued push into content ensures that usage. Google is not just the place you go to find content on the web, it's also where you consume that content (first youtube, but now books, movies and music too). I'm glad Google don't have a social network offering in their portfolio of services - they would be simply too powerful if they did.

3. Google IO seems to be **all** about Android so far - it's absolutely everywhere you look and consumed the entire keynote this morning (Ice Cream in Q4 that unifies tablet and phone, Futures (Android @ Home), open accessories etc.). Barring some crazy and unforeseen announcement tomorrow, I'd say Chrome OS has been given the last rites internally. But then again, who knows what day two will bring?

Sunday, March 06, 2011

[This is an article for people working in leisure travel technology / ecommerce online conversion who visit this blog, although many of the take-home points are transferable to other industry verticals.]

Data is big, and getting bigger. The more we track and log, the more storage is needed to warehouse it, and the more CPU horsepower is needed to mine it to answer questions posed by the business. As an aside, everyone is facing this issue and it's sink or swim, with the swimmers sure to get a competitive advantage over the sinkers. In this article, I'll examine the main data feeds that matter in leisure travel, and propose an architecture to collect, manage and mine them for business benefit. The end goal is to propose a vision, explaining why and how to collect data to better inform and drive business decisions that improve ecommerce performance.

But why now - hasn't this always been an issue? Yes, but now more than ever, leisure travel is poised on the cusp of another big game-changer. Companies like Google and Microsoft are clearly already focusing more on travel as a segment, and their data gathering and mining capabilities are considerable. But tour operators and online travel agencies (OTAs) have a significant competitive advantage over pure play technology companies as we'll see a little later.

Important data sources in leisure travel ecommerce

First, let's examine the primary data sources that affect leisure travel ecommerce. There are some obvious entries in the table that follows, and some less so.

Tripadvisor is the poster child here, but user generated content (UGC) can be in-house too - but it must be perceived as unbiased by the consumer, otherwise it becomes a negative.

9

Meta data

Both

Yes

Every business tags its own data - timestamps, version numbers, # revisions, author, approver, when last yielded. The more meta data you have the merrier - it often helps to tie disparate data sources together and enriches the overall data pool

10

Search, cost, book funnel

Internal

Yes

Traditionally the core of any ecommerce strategy - measures the complete search, cost and book journey. Needs to be fully instrumented to collect data so that A/B and multivariate testing can be used to fine-tune performance over time. Google Analytics does this very, very well.

11

Offline (shop) interactions

Internal

Yes

Few businesses try to tie shop activity back to online activity, but for a bricks and mortar plus clicks business, this is an opportunity missed

12

Online advertising (SEO)

Internal

Partially

SEO can be thought of as PPC you don't pay for! Critical to making cost of acquisition online as efficient as possible. Only partially controllable due to businesses being at the mercy of search engine scoring (which both Google and Microsoft (Bing) keep as a black box algorithm)

13

Online advertising (PPC)

Internal

Yes

Where Google makes its money!.. PPC has pride of place in every well-constructed ecommerce campaign, but the cost and effectiveness should be continuously monitored, challenged and tuned. CSV exports out of AdWords provide a good way to do this

14

Personalisation

Internal

Yes

Personalisation - both anonymous and known, is a great way to learn what kind of holiday / vacation people want to buy from you and how they want to find and buy it. Just don't try to build personalisation before you have (10) working well - personalisation needs a really solid foundation to work well..

15

Social media

External

No

The rising star that no-one really knows how to handle. The Facebook API contains a lot of potential for travel ecommerce

16

Offline / traditional advertising

External

Yes

The efficacy (or not) of ad spend must extend to traditional / offline as well as the more easily measurable online variant, otherwise you don't know where all of your marketing £s / $s / €s are going

17

Post-booking interactions

Internal

Yes

ecommerce data source, but savvy businesses are now looking at post-booking amendments, cancellation rates etc. to identify patterns that can feed back into the search experience

18

Customer Relationship Management (CRM)

Internal

Yes

Both pre and post travel - it's key to have a good view of what the customer experiences on holiday and feed that back into what holidays are sold going forward. Is that picture of the pool misleading - change it! If the service is great, promote it more!

Two important characteristics of data are whether you control it or not (and hence can change it if you need to) and whether it is sourced from an internal system or an external system (and thus how trustworthy / accurate the data is and whether it is unique to you or if other business entities can see it too). We have added these two characteristics to the table above for clarity.

What should be obvious to the reader is that a holistic picture of ecommerce performance requires multiple data sources, some of which traditionally would not be seen as impacting the effectiveness of a leisure travel ecommerce system. Gone are the days of simply looking at the web logs to see how effective (or leaky) the conversion funnel is! In fact, there are probably some sources that I've inadvertently omitted, and indeed as new systems come on stream, new sources will be added to this table / taxonomy.

Finally, it's interesting from a barrier to entry perspective to note that only the well-placed tour operator or OTA actually has the wherewithal and access to collate data from all of the sources noted in the table. Other new entrants simply do not have access to many of the sources listed. The data itself is now a valuable commodity (and is increasing in value), and an asset that leisure travel businesses would do well to guard jealously.

What we need - Systems and Data working together

At present, I contend that the average tour operator / OTA is collecting some, but not all of the data sources identified, and that no tour operator or OTA has yet constructed a system that provides a holistic, joined-up view of the data back to the business function to inform decision-making activities. Why not? Because it's not easy to do! The IT estate behind these data sources is fragmented (core res system, yielding system, multiple content management systems, external systems, separate booking repositories / agency management systems, Google Analytics, Google AdWords, Excel spreadsheets), often owned by different companies and wasn't designed to provide with the kind of view that is now needed. Ominously, new entrants into the space do not have a lot of the legacy baggage that incumbents do, meaning their velocity of implementation and ongoing change creates a hard-to-ignore imperative for all sellers of leisure travel to innovate quickly and learn from their data, or be left behind.

The technical challenge is four-fold:

1. Collection and storage - gather and store as much data as possible for each data source in the table, with that data being as clean and structured as possible (and in the real world, every data set will have some noise to it)

2. Build a holistic, joined-up data set - identify ways to link the data sources together - version number, unique keys, foreign keys, link backs, tagging etc. The more your data sources are joined up, the more holistic a view of the business you are building (and can provide back to the business). Conversely, disconnected data sets (data islands) are of much less value to the business and introduce the risk of an incomplete / inaccurate view of what's really happening now being used to influence what's going to happen next

3. Answering the questions - provide a mechanism to answer questions over this corpus of data in near real-time to allow the business to modify its behaviour and focus to maximise profits, yield and margin

4. Suggesting the questions - once the above three points have been implemented to a mature and repeatable level, the final logical step is for the data function to actually suggest areas of improvement and further exploration based on emergent patterns in the data, using techniques such as artificial neural network and self-organising maps (SOM) analysis

Putting it all together - a suggested framework

There are many ways to construct a view over the data sources identified in the previous section. And in fact, multiple views are encouraged depending on the goal of the business. Here however, a hybrid of time and business function is selected in order to select a reasonable framework to hold the data. This framework is depicted in the following diagram.

Figure 1. High-level schematic of the big data system for leisure travel ecommerce.

A concrete implementation of the framework

The question naturally arises - how would this system be constructed, not just initially but also maintained and extended going forward?

Some natural candidates already exist, chief among them Cassandra and Hadoop. In the author's opinion, a hybrid architecture of Cassandra's data storage and innate simplicity and high availability, coupled with the MapReduce framework from Hadoop offers the best blend of performance, scalability, availability / resilience, querying and extensibility. A separate follow-on instalment to this article is warranted to provide a detailed technical treatise on the underpinnings of the system outlined here.

Conclusion

The dominant data sources that impact the effectiveness of a leisure travel ecommerce strategy are identified, named and classified. Developing this classification further, a model is used to create a framework to house the data sources and a concrete implementation suggested.

About the author: Humphrey is the Chief Technology Officer for Comtec Group, a company that specializes in leisure travel technology.

(1) has been a long time coming and it's good to see the log jam moving. Simply shipping JDK 7 is good in its own right but it also means that the team will move onto working on JDK 8, which contains some key language features omitted from JDK 7 so that the team could JGIOTFD (Just Get It Out The (reader exercise to complete the acronym)).

(2) looks to be Oracle really making the JEE stack cloud-based / cloud-friendly by default rather than a technology stack that merely facilitates cloud computing. This dynamic should see Oracle formalising exactly what constitutes "JEE in the cloud" via a JSR and thus wresting that intellectual responsibility back from Google's App Engine platform, which is pretty much the de facto standard for "JEE in the cloud" at present.

Looking beyond JEE 7, JEE 8 looks to be embracing Big Data / NoSQL systems like Hadoop and Cassandra, although we can expect to have seen significant consolidation in this space by 2013, making the integration and platform support task easier to accomplish.

All in all, two nice moves, and good news for the Java eco system / economy. You might or might not like Oracle, but they are getting stuff out the door in a way that Sun kind of forgot how to do.

In a nutshell, here's where it is (covering each of the three parts in turn):

Parts two and three of the exam (the practical elements) will remain very similar to how they operate today - these elements test your ability to design and document (part two) a solution to a well-defined business problem using the JEE platform and then challenge you (part three) to self-critique and justify key design decisions taken, especially on how non-functional requirements will be adequately satisfied. Parts two and three are pretty much independent of the current JEE revision, because the candidate is given a good degree of latitude in how you use JEE to solve the problem. Were you to use J2EE 1.4 features let's say, then the examiner is going to question the logic of that decision closely, but that's about it. Writing Ruby code and then having it compile to Java bytecodes at runtime using JRuby is also not recommended (don't laugh, someone did ask..)!

Part one of the exam (the multiple-choice exam) **will** change for JEE 6 - it has to because part one is more tightly coupled to a specific JEE revision - currently JEE 5 (with ~5% of J2EE 1.4 content).

The last time we revised part one, ~ten architects got together in Broomfield, CO for a week to design and critique the corpus of questions used. After that, Sun Microsystems (as they were then), brought in some external testing folks to benchmark the exam and to critique the overall marking strategy we intended to employ. That was an intense week and overall a fairly involved process, because you want to write difficult, tricky questions that will challenge an architect but at the same time, be fair. Part one of the architect exam is also not allowed to test your ability to memorize APIs or specifications - that is the primary task for the lower certifications. You very quickly find that a lot of difficult / tricky questions in JEE revolve around the APIs and specifications!

I think with the benefit of hindsight we erred on the side of fairness over toughness. I think we'll look to toughen up the questions for JEE 6.

I don't expect Oracle to reconvene the team of architects to do this refresh - the last refresh of the exam was a major refresh whereas we would consider this refresh to be more minor. Therefore the time taken to update should be shorter. Once the part one refresh is scheduled in, I'll post again on this topic. For now, the JEE 5 architect exam remains the most current and up to date architect exam you can take.

Sunday, January 09, 2011

Readers of this blog can wax lyrical on how to build a great B2C ecommerce site - either in JEE or .NET. First we get the technology stack right, then frameworks using that technology stack, comprehensive functional and technical specs, testing plans, coding standards + reviews with daily scrum meetings, hardware / cloud estimation and then load / penetration testing - this is bread and butter to the software architect.

What a lot of software architects don't understand (or underestimate) is what needs to happen to their site after it goes live. After the go-live of a B2C ecommerce site, a whole other team (which is fairly non-technical) takes it over. This team is really exercised by and focused on three core goals:

1. Get qualified visitors to the site as cost-effectively as possible

2. Enable those visitors to find the product they want quickly and easily

3. Convert the visitor into a customer - convince them to buy on your site

These goals are completely measurable in monetary terms, and hence you will find senior management taking a serious interest in them as well.

I work in leisure travel, and there are some very specific nuances to achieving these goals in my industry sector (every industry sector will have their own nuances). But there is also a generic model to be found and some very useful (and free!) tools that you can use to put the model in place.

Turns out the model is pretty simple. Essentially it consists of three components:

1. Analytics - where we measure what's happening on our target site - how is the user interacting with the site and can we infer what they do and don't like based on measuring and studying those interactions

2. Hypothesis testing (aka A/B and / or multivariate testing) - Analytics will give us lots of data to generate ideas on how to improve interactions, therefore we need a mechanism to test out hypotheses in a semi-automated way (if I change X, I bet the conversion rate will increase by Y%)

3. Efficient prospect capture - we want the best native SEO score possible on all of the search engines and when we spend money on ad campaigns, we want the best return for that investment.

So that's the high-level model - it's pretty simple.

Many companies (and especially Google), make an awful lot of money around online ecommerce. And that's where the "free!" I noted above comes in. It makes sense for Google to give away the tools enabling Analytics (1) and Hypothesis testing (2) for free, as they make so much revenue on selling ad campaigns in Efficient Prospect Capture (3). Unkind souls might claim that if you spend any kind of money with Google AdWords at all, then you're not really getting (1) or (2) for free, but you won't find a nefarious cheap shot like that on this blog.

Let's look at how we can implement the model then:

1. Analytics - use Google Analytics. Brian Clifton's book is an excellent treatise on the application, and the online training videos are of a high standard as well. It's well worth having a couple of developers on your team get Analytics certified to understand what the tool can do - it really is very powerful

2. Hypothesis (A/B, multivariate) testing - use Google Website Optimizer. There's less information about this tool, I guess because it's a bit simpler than Analytics, but a good overview is available. Being able to change content and see the impact on the fly is a key part of the model - that's why we use a CMS like Umbraco!

3. Efficient prospect capture - SEO, SEO and more SEO. The Art of SEO is a great read. My opinion here is that as long as you're doing a great job on your own SEO, you should begrudge a search engine every penny. By using tagging in conjunction with Google Analytics (make sure you associate your AdWords account with your Analytics account to get all this done for you automagically), you can continually check that your ROI on ad campaigns is worth the spend, and stop buying terms that don't make money.

And that's pretty much it. A three-component generic model for online ecommerce, followed by the simplest (with zero cost) way to implement that model for your B2C site. I intimated that each industry sector has its own quirks and foibles above and beyond this base model, and I'll focus on the leisure travel industry in more detail in a future post or two. For now, enjoy!