Posts Tagged ‘client’

Last week I attended Enterprise Search & Discovery 2014, part of the KMWorld conference in Washington DC. I’d been asked to speak on Turning Search Upside Down and luckily had the first slot after the opening keynote: thanks to all who came and for the great feedback (there are slides available to conference attendees, I’ll publish them more widely soon, but this talk was about media monitoring, our Luwak library and how we have successfully replaced Autonomy IDOL and Verity with a powerful open source solution for a Scandinavian monitoring firm).

Since ESSDC is co-located with KMWorld, Sharepoint Symposium and Taxonomy Bootcamp, it feels like a much larger event than the similar Enterprise Search Europe, although total numbers are probably comparable. It was clear to me that the event is far more focused on a business rather than technical audience, with most of the talks being high-level (and some being simply marketing pitches, which was a little disappointing). Mentions of open source search were common (from Dion Hinchcliffe’s use of it as an example of a collaborative community, to Kamran Kahn’s example of Apache Solr being used for very large scale search at the US National Archives). Unfortunately a lot of the presenters started with the ’search sucks, everyone hates search’ theme (before explaining of course that their own solution would suck less) which I’m personally becoming a little tired of – if we as an industry continue pursuing this negative sentiment we’re unlikely to raise the profile of enterprise search: perhaps we should concentrate on more positive stories as they certainly do exist.

I spent a lot of time networking with other attendees and catching up with some old contacts (a shout out to Miles Kehoe, Eric Pugh, Jeff Fried and Alfresco founder John Newton, great to see you all again). My favourite presentation was Dave Snowden’s fantastic and very funny debunking of knowledge management myths (complete with stories about London taxi drivers and a dig at American football) and I also enjoyed Raytion’s realistic case studies (‘no-one is searching for the sake of searching – except us [search integrators] of course’). Presentations I enjoyed somewhat less included Brainspace (who stressed Transparency as a key value, then when I asked if their software was thus open source, explained that they would love it to be so but then they wouldn’t be able to get any investment – has anyone told Elasticsearch?) and Hewlett Packard, who tried to tell us that their new API to the venerable IDOL search engine was ‘free software’ – not by any definition I’m aware of, sorry. Other presentation themes included graph/semantic search – maybe this is finally something we can consider seriously, many years after Tim Berners Lee’s seminal paper [PDF].

We recently implemented a search solution for a customer using Elasticsearch. Most of their requirements were fairly standard, however they also wanted to be able to search for IP addresses embedded in the document text, using a flexible and precise search syntax, e.g. given the following document fragment:

... the API can be accessed at 167.87.3.201 on port 8700 ...

the following searches should all find the document:

167.87.3.201
*.87.3.201
*.87.*.201
167.[80-100].3.*
etc.

While it would have been possible to implement the multiple wildcard requirement with Elasticsearch/Lucene regular expression queries, there is no simple way to handle the numeric range requirement without constructing some fairly complex regexps. Furthermore, regular expression queries can be slow to run (depending on the complexity of the expression and the size of the index), and this application had a large index.

The obvious thing to do here is to parse the IP address into separate numbers and index it into numeric fields. e.g.:

(please note that these queries are for illustrative purposes only, and are untested).

Unfortunately, this approach fails when there is more than one IP address per document (as there generally was in this case), since if multiple values exist for the ipN fields the relationship between each component is lost. For example, a document containing:

... servers at 167.133.88.1 and 176.90.3.10 are load balanced ...

would spuriously match the user query above, despite the fact that neither IP address matches the query exactly. One possibility would be to use dynamic fields to index each address to a different set of fields:

However, queries would have to cover all possible IP fields with repeated OR subqueries, which would quickly become ugly and unmanagable.

Luckily, Elasticsearch nested documents provide exactly the mechanism we need to preserve the IP address structure within the main document (Solr does too, though this post does not go into the details). This is most easily explained with a JSON example with two IP addresses:

The child documents are created by the indexer script, which uses a regular expression to find all IP addresses in the document content and parses them into separate numbers. IP addresses can then be searched for using the nested query type, e.g:

This query selects parent documents containing at least one ipaddr child document which matches the query. Internally, children are stored as separate documents from parents, but the join is done transparently and extremely fast.

Nested queries can, of course, be combined with text queries etc. The application we built for the client (in AngularJS and Python/Flask) parses user queries to extract IP query expressions and builds combined text, boolean and nested queries to implement the required search logic.

One slight problem with this approach is that IP addresses are not included in any highlighted summaries generated by Elasticsearch as part of search results. This is because the highlighter does not know where in the text the matching IP address is. There is no simple way around this, so to generate highlighted search summaries we used our own standalone highlighter component, extending it to ‘understand’ the IP query syntax. This code is Apache 2 licensed and is free to download and use.

To sum up, this post outlines how we used Elasticsearch’s nested document type to implement a flexible and fast IP address search syntax. Of course, the same approach could be used to search any other type of structured entity in document text, such as social security numbers, ISBNs etc.

A recent client wished to convert documents to and from Microsoft Office formats, using a web form as an intermediate step for editing the content. The documents were read in, imported to a Solr search engine, and could then be searched over, cloned, edited and transformed in batches, before being exported to Office once more.

The content itself was broken down into fields, some of which were simple text or date entry boxes, while others were more complex rich text fields. We opted to use TinyMCE as our rich text editor of choice – it’s small, open source, and easy to extend (we already knew we wanted to write at least one plugin).

The problem arose when the client explained to us that they wanted to use the tab key in rich text fields to create consistent spacing in the text. These needed to display as closely as possible to the original document format, and convert to actual tabs in the Office documents. This presented a number of problems:
By default, the tab key moves the user to the next field on a web page, and needs special handling to prevent this behaviour, especially when it only needs to be applied to certain fields on the page. The spacing had to be consistent, like a word processor’s tab stop. This is tricky when working with proportional fonts, especially in a web form.

The client didn’t want to use an indent feature. The tab only came at the start of the paragraph – beyond that point the text could wrap around to the start of the line. The tab needed to be recognisable in our processing code, so it could be converted to a real tab when it was exported to MS Office.

The preferred solution would have been a document editor like that used for Google Docs. Unfortunately, we didn’t have the time to write the whole input and presentation layer in Javascript as Google have! We also wanted to keep the editing function inside the web application if possible, rather than forcing the user to edit the documents in Microsoft Office and then re-import them every time they needed to make changes.

I started with TinyMCE’s “nonbreaking” plugin, which captures the tab key and converts it to a number of non-breaking spaces. This wasn’t directly suitable for our needs – I discovered that the number of spaces is not always consistent, and they are sometimes converted to regular (rather than non-breaking) spaces. In addition, it doesn’t act like a tab stop – it inserts four spaces wherever you are on the line, which didn’t match the client’s requirement.

I adapted the plugin to insert a <span> into the text, using variable padding to ensure it was the right width. This worked reasonably well, after a not insignificant amount of head scratching trying to work around issues with spacing and space handling. Unfortunately, we struck usability problems when trying to backspace over the tab. The ideal situation would be that a single backspace would remove the entire tab, leaving the user at the start of the line (or the point before they hit the tab key). In fact, a single backspace would leave the user inside the span – two backspaces were required to visibly remove the tab from the editor, and the user could not tell that they were inside the span either. You couldn’t reliably select the “tab” with the mouse either. In addition, Firefox started to behave oddly at this point, putting the cursor in unexpected positions.

My final solution was ugly but workable. We switched to using a monospace font in the rich text editor and, after discussion with the client, started using a variable number of arrow characters to represent the tabs (we actually used ›, or a closing single quote, if you are reading and writing in German). This made life immediately simpler – dropping the proportional font meant that we didn’t have to worry about getting the width right, just the number of characters to insert. It does mean that in order to remove the tab, the user has to backspace over up to four characters, but the characters are clearly visible: you don’t find yourself inside a span that can’t be seen without viewing the underlying HTML.

While I’m sure this isn’t a unique problem, I couldn’t find anyone else that had been trying to do something similar. I am also not sure whether our choice of rich text editor affected how tricky this problem turned out to be. If anybody reading has suggestions of better approaches to this, we’d be interested to hear from them.

ArnoldIT wondered today why we were bothering to announce an upgrade to the venerable dtSearch engine, when they “weren’t aware of too many people still using that software”. Perhaps it’s time for a quick reality check here – we regularly see clients with search engines that many would consider prehistoric still in active use. Here’s some reasons why that might be so:

Search isn’t seen as essential. If your accounting software goes down, nobody gets paid: but if the search engine has gradually degraded in accuracy, doesn’t always contain the most recent documents and is generally too hard to use then most of your users will try and find a way around it – they’ll Google for content on the corporate website, dig slowly through the filestores or call up a colleague to ask. Of course, all of this will take time and there’s the risk they won’t find anything useful (or worse, find something inaccurate or out-of-date), but time is only money, surely?

The magic has gone. The sharp suited salesman who told you all the magical things your search engine could do – it could understand concepts, human language and the meaning of life – is a distant memory. Somehow those magical features were never implemented, perhaps the unexpected extra cost put you off (surely the magic came as standard? No?). You’ve also probably turned off a lot of the clever features of your engine as either no-one could understand how to use them, or they affected performance so much that search results took minutes to appear.

Upgrading search is hard and expensive. Small changes to the existing engine can cost huge consultancy fees but if you change supplier, you’ll have a whole new team of salesmen to meet, lots more buzzwords to learn, there’s expensive new license fees to pay, you’ll also have to overhaul your content management system, your metadata, your front ends…better to leave everything alone, surely?

There are search engines out there, chugging away quietly behind a corporate firewall, whose antiquity would astonish. Any chance of a support contract has long gone as the supplier would prefer it if you upgraded to their latest-and-greatest version – that’s if the supplier still exists at all. However there is always a way to upgrade that reduces the risk and cost – an incremental, agile and open-source based approach will prevent future lock-in to a single supplier and give you more control of the code your search engine depends on. Recently we’ve used this approach to help clients successfully upgrade search applications based on dtSearch, FAST ESP and Oracle and in the near future we’ll be doing the same for clients with several other well-known engines – and a few lost in the mists of time!

One of the best things about the increased use of open source search technology is that features that were previously unattainable for clients with small budgets are now within reach. Our client Bride and Groom Direct, a UK-based business selling wedding gifts and stationery, asked us if we could help improve the search features on their website and in particular the auto-suggest – and they asked us to take a look at the website of US mega-retailer Sears.com for inspiration. They particularly liked the way that while you type, Sears’ website doesn’t just show you suggested words but also clickable picture previews of products you might be looking for.

Using Apache Solr and in under two days we built them a similar feature for their website: since we didn’t have direct access to their development servers we provided both Solr configuration files and a simple JQuery/Javascript demo of the features they needed (it’s about 170 lines of code). Their own developers then integrated these changes based on our notes. I think it’s safe to say that Bride and Groom Direct are a rather smaller business than Sears, but with open source they can have access to equally good search facilities. They’ve been kind enough to let us feature them on our Clients page and as you can see, they’re happy with the results.

The most well known open source search engine, Apache Lucene/Solr, has a rival in Elasticsearch, also based on Apache Lucene. Or maybe it doesn’t. I’m not convinced that there’s an actual battle going on here, above and beyond the fact that the commercial companies formed to support each technology (Lucidworks and Elasticsearch [the company]) are obviously competitors. Let’s look at the evidence:

Elasticsearch contains (by some measures) 64 years of effort, Solr only 55 years….a point to Elasticsearch!

Elasticsearch commits are 31% down on last year, Solr commits are 85% up…a point to Solr!

Elasticsearch, sorry elasticsearch, has a cool lower case logo and fancy website…a point to Elasticsearch!

This is of course before we get to any actual technical differences in terms of performance, scalability, ease-of-use etc. which are probably a lot more important than the list above. There are vocal critics and supporters of each project on Twitter and other media, but the great thing in our view is that there is a choice of two such excellent search technologies, both open source, so for real world applications one can try both at little cost and choose whichever is most appropriate (there are even proven migration routes between the two – we’ve helped one client with this process).

2012 has been a fascinating and stormy year for those of us in the search business. We’ve seen a raft of further acquisitions of commercial closed source search companies by bigger players, some convinced that what used to be called Enterprise Search is now a solution to Big Data (like Stephen Arnold we wonder what will succeed Big Data as the next marketing term – I love his phrase “In a quest for revenue, the vendors will wrap basic ideas in a cloud of unknowing”). One acquisition hasn’t gone so smoothly: Autonomy, bought by HP for a price that no-one in the search business thought was remotely sensible, has been accused of being oversold vapourware: this is a story that will continue to develop in 2013. If you want a great overview of the current market read Martin White’s latest research note.

Here in the slightly calmer waters of open source search, we’ve seen a huge rise in enquiries from often blue-chip companies, no longer needing persuasion that open source is a serious contender for even the largest search and content projects. Often these companies have considered large commercial solutions but are put off by both the price and high-pressure marketing tactics – in a world of reduced budgets you simply can’t sell magic beans for a pile of gold. We’ve also seen increased interest in related technologies such as machine learning and automatic categorisation – search really isn’t just about search any more.

At Flax we’re busier than we have ever been and we’re expected the trend to continue. We’re looking forward to running more Cambridge Search Meetups, visiting and helping organise conferences such as Enterprise Search Europe and Lucene Revolution, building our network of carefully chosen partners and of course working on exciting and cutting-edge development projects.

As the storms in our sector continue to rage overhead we’ll simply be getting on with what we do best, building effective search.

I’m not going to comment on the various financial aspects of the recent news about HP’s write-down of the value of its Autonomy acquisition – others are able to do this far better than me – but I would urge anyone interested to re-read the documents Oracle released earlier this year. However, I am going to write about the IDOL technology itself (I’d also recommend Tony Byrne’s excellent post).

Autonomy’s ability to market its technology has never been in doubt: aggressive and fearless, it painted IDOL as unique and magical, able to understand the meaning of data in multiple forms. However, this has never been true; computers simply don’t understand ‘meaning’ like we do. IDOL’s foundation was just a search engine using Bayesian probabilistic ranking; although most other search technologies use the vector space model there are a few other examples of this approach: Muscat, a company founded a few years before and literally across the hall from Autonomy in a Cambridge incubator, grew to a £30m business with customers including Fujitsu and the Daily Telegraph newspaper. Sadly Muscat was a casualty of the dot-com years but it is where the founders of Flax first met and worked together on a project to build a half-billion-page web search engine.

Another even less well-known example is OmniQ, eventually acquired and subsequently shelved by Sybase. Digging in the archives reveals some familiar-sounding phrases such as “automatically capture and retrieve information based on concepts”.

Originally developed at Muscat, the open source library Xapian also uses Bayesian ranking and we’ve used this successfully to build systems for the Financial Times, Newspaper Licensing Agency and Tait Electronics. Recently, Apache Lucene/Solr version 4.0 has introduced the idea of ‘pluggable’ ranking models, with one option being the Bayesian BM25. It’s important to remember though that Bayesian ranking is only one way to approach a search problem and in many cases, simply unnecessary.

It’s now eleven years since we started Flax (initially as Lemur Consulting Ltd) in late July 2001, deciding to specialise in search application development with a focus on open source software. At the time the fallout from the dotcom crash was still evident and like today the economic picture was far from rosy. Since few people even knew what a search engine was (Google was relatively new and had only started selling advertising a year before) it wasn’t always easy for us to find a market for our services.

When we visited clients they would list their requirements and we would then tell them how we believed open source search could help (often having to explain the open source movement first). Things are different these days: most of our enquiries come from those who have already chosen open source search software such as Apache Lucene/Solr but need our help in installing, integrating or supporting it. There’s also a rise in those clients considering applications and techniques outside the traditional site search or intranet search – web scraping and crawling for data aggregation, taxonomies and automatic classification, automatic media monitoring and of course massive scalability, distributed processing and Big Data. Even the UK government are using open source search.

So after all this time I’m tending to agree with Roger Magoulas of O’Reilly: open source won, and we made the right choice all those years ago.

We’ve just published a case study on our work for C Spencer Ltd., a UK-based civil engineering company who take a pro-active approach to document management – instead of taking the default Sharepoint route or buying another product off the shelf, they decided to create their own in-house system based on open source components, hosted on the Amazon AWS Cloud. We’ve helped them integrate Apache Solr to provide full text search across the millions of items held in the document management system, with a sub-second response. Their staff can now find letters, contracts, emails and designs quickly via a web interface.

C Spencer are known for their innovative and modern approach – they’re even building their own green power station on a brownfield site in Hull. It’s thus not surprising that they chose cutting-edge open source technology for search: tracking and managing documents correctly is extremely important to their business.