Main menu

Search Solutions 2017 review

Search Solutions is one of my favourite search events of the year – small, focused and varied, with presentations from both the largest and smallest players in the world of search, drawn from both industry and academia.

This year’s event started with Edgar Meij of Bloomberg, who Flax have helped in the past with their large-scale search and alerting systems. I’d seen most of the details in this talk before so I won’t dwell on them but will thank Bloomberg again for their commitment and contributions to the open source community, particularly to Solr and our Luwak stored search library. Mark Fea of LexisNexis was up next with a talk about taxonomies and how they have built a semi-automated classification system combining supervised machine learning and Boolean rules-based systems: a pragmatic approach to combine the strengths of both approaches as machine learning isn’t always as clever as one might want, and Boolean rules can be hard to build and maintain. Like Bloomberg they are working at large scale: Mark mentioned taxonomies of 21,000 terms and 9 levels, applied to over 1 billion documents.

Mark Harwood of Elastic was up next with one of his always fascinating talks on discovering unknown patterns in data with Elasticsearch. He showed how he had explored ‘toxic’ content (far-right music and those who like it) and fake reviews on Amazon with some great visual demonstrations. An interesting conclusion was how ‘bad actors’ make strange, recognisable shapes in visualised data. [Mark later won the Best Presentation award, richly deserved!]. Anna Kolliakou of King’s College London spoke next on ‘veracity intelligence’ tools to help monitor terms connected to mental health across news media and social networks: an interesting example was ‘mephedrone’ around the time of reclassification of this particular recreational drug. Next up was independent consultant Phil Bradley with a detailed, well-researched and passionate talk on fake news and how one cannot trust any web search engine to present the full picture. Phil is obviously extremely concerned about this issue and his talk spurred discussion amongst the audience about how user education is essential to counter the usual viewpoint of ‘it’s on Google, it must be true’.

Coincidentally, Filip Radlinski of Google started the next session, describing a model for conversation information retrieval. He spoke about how the user and IR system reveal information about themselves as the conversation progresses, how the system may need a memory of past interactions and how it may present a set of potential answers. This is a useful model for the future, although most current ‘conversational’ systems are simplistic. Fabrizio Silvestri then spoke on the various types of search Facebook provides, mostly related to finding people but also images, video and news. He explained how every search operation needs to consider privacy and how Facebook use query rewriting to expand enhance the terms provided by the user. Nicola Cancedda of Microsoft was next with a talk on automated query extraction from emails, to help the user find and attach relevant documents in response (for example, after a colleague asks ‘can you send me the cost projections for 2017’). Her work involves training machine learning models after extracting candidate terms with high TF/IDF values from the email. [Interestingly this reminded me of work I carried out nearly 20 years ago on an email signature that when clicked would search for content relevant to the email – although this relied on Javascript working in an email client which is rather a security problem!].

Last of our scheduled talks was from Mark Stanger of Search Technologies (recently acquired by Accenture) about their work on Elsevier’s DataSearch platform. He described how they developed a Phrase Service that identifies phrases in the user’s query using various methods including acronym detection, dictionary lookup and natural language processing, then expands these phrases as necessary to provide enhanced search. After identifying these key terms they can be boosted appropriately for search (DataSearch itself is based on Solr).

The DataSearch project is impressive, and later on it won the Best Search Project award (I am proud to say I served as part of the judging panel for these awards this year). The other winner of most promising search startup Search|hub by CXP Commerce Experts GmbH.

We finished with some lightning talks and a brief Fishbowl session, dominated this time by discussions on Fake News and how it affects the world of search technology. Thanks to the BCS IRSG again for a fascinating and enlightening day.

Apache Lucene, Apache Solr, Apache Kafka, Apache Hadoop and their respective logos are trademarks of the
Apache Software Foundation. Elasticsearch is a trademark of Elasticsearch BV,
registered in the U.S. and in other countries.