Category: Apache Lucene

When using nested documents and the Apache Solr Block Join functionality it is a common requirement to query for an entity (for example the parent entity) and then retrieve for each search result all(or some of) the related children.

Let’s see the most important aspects of such functionality and how to apply complex queries when retrieving children of search results.

How to Index Nested Documents

If we are providing the documents in Json format, the syntax is quite intuitive:

The children documents are passed as an array of Json nodes, each one with a specific IdN.B. if you rely on Apache Solr to assign the ID for you, using the UUIDUpdateProcessorFactory, this doesn’t work with child documents yet.In such scenario you should implement your own Update Request Processor, that iterates over the children and assign an id to each one of them (and then contribute it to the community 🙂 )

If you are using SolrJ and you plan to index and retrieve children documents via code, the situation is a little bit more difficult.First of all, let’s annotate the POJO properly:

How to Query and Retrieve Nested Documents

Ok, we covered the indexing side, it’s not straightforward but at this point we should have nested documents in the index, nicely in adjacent blocks with the parent, to allow a fast retrieval at query time.First of all let’s see how we can query parent/children and get an appropriate response.

Query Children and Retrieve Parents

N.B.allParents is a query that matches all the parents, if you want to filter some parents later on, you can use filter queries or some additional clause:e.g.q= +title:join +{!parent which=”content_type:parentDocument“}comments:SolrCloud

The child query must always return only child documents.

Query Parents and Retrieve Children

N.B. The parameter allParents is a filter that matches only parent documents; here you would define the field and value that you used to identify all parent documents.The parameter someParents identifies a query that will match some of the parent documents. The output is the children.

How to Retrieve Children Independently of the Query

If you have a query that returns parents, independently if it was a Block Join Query or just a plain query, you may be interested in retrieving child documents as well.This is possible through the Child Transformer

[child] – ChildDocTransformerFactory

fl=id,[child parentFilter=doc_type:book childFilter=doc_type:chapter]

When using this transformer, the parentFilter parameter must be specified unless the schema declares _nest_path_. It works the same as in all Block Join Queries. Additional optional parameters are:

childFilter: A query to filter which child documents should be included. This can be particularly useful when you have multiple levels of hierarchical documents. The default is all children. This query supports a special syntax to match nested doc patterns so long as _nest_path_ is defined in the schema and the query contains a / preceding the first :. Example: childFilter=/comments/content:recipe

limit: The maximum number of child documents to be returned per parent document. The default is 10

fl: The field list which the transformer is to return. The default is the top level fl).There is a further limitation in which the fields here should be a subset of those specified by the top level fl parameter.

Complex childFilter queries

Let’s focus on the childFilter query.This query must match only child documents.Then It can be as complex as you like to retrieve only a specific subset of child documents.Unfortunately is less intuitive than expected to pass complex queries here because by default spaces will work against you.

… childFilter=field:(classic OR boolean AND query)]

… childFilter=field: I am a complex query]

You can certainly try complex approaches in text analysis an debugging the parsed query, but I recommend to use local params placeholders and substitution, this will solve most of your issues:

Using the placeholder substitution will solve you the whitespace local params splitting problems and help you in formulating complex queries to retrieve only subsets of children documents out of parent results.

Retrieve Child Documents in SolrJ

Once you have a query that is returning child documents (and potentially also parents) let’s see how you can use it in SolrJ to get back the Java objects.

In this way you’ll obtain the Parent objects that satisfy your query including all the requested fields and the nested children.

Conclusion

Working with Nested Documents is extremely funny and can solve a lot of problems and tricky user requirements, but they are also not easy to master so I hope this blog can help you to navigate the rough sea of the Block Join and Nested Documents in Apache Solr!

After the very warm reception of the first edition, the second London Information Retrieval Meetup is approaching (25/06/2019) and we are excited to add more details about our speakers and talks!The event is free and you are invited to register :

René has been working as a freelance search consultant for clients in Germany and abroad for more than ten years. Although he is interested in all aspects of search and NLP, key areas include search relevance consulting and e-commerce search. His technological focus is on Solr/Lucene. René co-organises MICES (Mix-Camp E-Commerce Search, Berlin, 19 June). He maintains the Querqy open source library.

Query Relaxation – a Rewriting Technique between Search and Recommendations

In search quality optimisation, various techniques are used to improve recall, especially in order to avoid empty search result sets. In most of the solutions, such as spelling correction and query expansion, the search query is modified while the original query intent is normally preserved.In my talk, I shall describe my experiments with different approaches to query relaxation. Query relaxation is a query rewriting technique which removes one or more terms from multi-term queries that would otherwise lead to zero results. In many cases the removal of a query term entails a change of the query intent, making it difficult to judge the quality of the rewritten query and hence to decide which query term should be removed.I argue that query relaxation might be best understood if it is seen as a technique on the border between search and recommendations. My focus is on a solution in the context of e-commerce search which is based on using Word2Vec embeddings.

This blog is a quick summary of my (subjective) experience at Haystack 2019 : the Search Relevance Conference, hosted in Charlottesville (Virginia, USA) from 24/04/2019 to 25/04/2019.References to the slides will be updated as soon as they become available.

First of all my feedback on the Haystack Conference is extremely positive.From my perspective the conference has been a success.Charlottesville is a delightful small city in the heart of Virginia, clean, organized, spatious and definitely relaxing, it has been a pleasure to spend my time there.The venue chosen for the conference was a Cinema, initially I was surprised but it worked really well, kudos to OpenSource Connections for the idea.The conference and talks were meticulously organised, on time and with a relaxed pace, that definitely helped both the audience and the speakers to enjoy it more: thanks to the whole organisation for such quality!Let’s take a look to the conference itself now: it has been 2 days of very interesting talks, exploring the latest trends in the industry in regards to search relevance with a delightful tech agnostic approach.That’s been one of my favourite aspects of the conference: no one was trying to sell its product, it was just a genuine discussion of interesting problems and practical solutions, no comparison between Apache Solr and Elasticsearch, just pure reasoning on challenging problems, that’s brilliant!Last but not least, the conference allowed amazing search people from all over the world and cultures to meet, interact and discuss about search problems and technologies, it may sound obvious for a conference but it’s a great achievement nonetheless!

Keynote: What is Search Relevance?

Max Irwin opened the conference with its keynote on the meaning of Search Relevance, the talk was a smooth and nice introduction to the topic, making sure everyone was on the same page, ready for the following talks.A good part of the opening was dedicated to the problem of collecting ground truth ratings (from explicit to implicit and hybrid approaches).

After the keynote it was our turn, it has been an honour to open the track sessions in theatre 5 with our talk “Rated Ranking Evaluator: An Open Source Approach to Search Quality Evaluation”.Our talk was a revised version on the introduction to RRE with a focus on the whole picture and how our software fits industry requirements.Building on the introduction, we explored what search quality evaluation means for a generic information retrieval system and how you can apply the fundamental concepts of the topic to the real world with a full journey of assessing your system quality in an open source ecosystem.Last part of the session was reserved for a quick demo, showing the key components in the RRE framework.Really happy of the reception from the audience, I take the occasion to say a big thank you to everyone present in the theatre that day, this really encourages us to continue our work and make RRE even better.

Making the Case for Human Judgement Relevance Testing

After our talk, it was the turn of LexisNexis with an overview on judgement relevancy testing with the talk by Tito Serra and Tara Diedrichsen “Making the Case for Human Judgement Relevance Testing”.The talk was quite interesting and explored the ways to practically setup a human relevance testing programme.When dealing with humans, reaching or estimating consensus is not trivial and it is also quite important to details as much as possible why a document is rated that way (the reason is as important as the rating).

Lunch break and we’re back to the business with “Query Relaxation – a Rewriting Technique between Searching and Recommendations” by Rene Kriegler.This one has been personally one of my favourites: from a clear definition of the problem (reducing the occurrence of zero results searches), the speaker illustrated various approaches, starting from just naive techniques (based on random removal of terms or term frequencies based removal) to the final word2vec + neural network system, able to drop words to maximise the probability of presenting a query reformulation that appeared in past sessions.The overview of the entire journey was detailed and direct, especially because all the iterations were described and not only the final successful steps.

Beyond the Search Engine: Improving Relevancy through Query Expansion

And to conclude the first day I chose “Beyond the Search Engine: Improving Relevancy through Query Expansion”, a journey to improve the relevance in an e-commerce domain, from Taylor Rose and David Mitchell from Ibotta.Focus of the talk was to describe a successful inter-team collaboration where a curated knowledge base used by the Machine Learning team proved quite useful to improve the mechanics of synonym matching and product categorisation.

Lightning Talks

After the sessions the first day ended with lightning talks.They were very quick and thoughts provoking, some of them that caught my attention:

Addressing Variance in AB Tests: Interleaved Evaluation of Rankers

The second day opened for me with “Addressing Variance in AB Tests: Interleaved Evaluation of Rankers” where Erik Bernhardson went through the way the Wikimedia foundation faced the necessity of speeding up their AB tests, reducing the data necessary to validate the statistical significance of such tests.The concept of interleaving results to assess rankers is well known to the academic community, but it was extremely useful to see a real life application and comparison of some of the available techniques.Especially useful was the description of 2 tentative approaches: – Balanced Interleaving– Team Draft InterleavingTo learn more about the topic Erik recommended this very interesting blog post by Netflix : Innovating Faster on Personalization Algorithms at Netflix Using InterleavingIn addition to that, for people curious of exploring more the topic I would recommend this github project : https://github.com/mpkato/interleaving .It offers the python implementations of various interleaving algorithms and present a valid bibliography of solid publications on the matter.

Solving for Satisfaction: Introduction to Click Models

Then was Elizabeth Haubert turn with “Solving for Satisfaction: Introduction to Click Models” a very interesting talk, cursed by some technical issues that didn’t prevent Elizabeth to perform brilliantly and detail to the audience various approaches in modelling the attractiveness and utility of search results from the user interactions.If you are curious to learn more about click models I recommend this interesting survey: Click Models for Web Search that explores in details some of the models introduced by Elizabeth.

Last in the morning was “Custom Solr Query Parser Design Option, and Pros & Cons”[8] from Bertrand Rigaldies: a live manual to customise Apache Solr query parsing capabilities to your needs, including a bit of coding to show the key components involved in writing a custom query parser.The example illustrated was about a slight customisation of proximity search behaviour (to parse the user query and build Lucene Span Queries to satisfy a specific requirement in distance tolerance) and capitalisation support.The code and slides used in the presentation are available here : https://github.com/o19s/solr-query-parser-demo

After lunch John Berryman (co-author of Relevant Search) with “Search Logs + Machine Learning = Auto-Tagging Inventory” faced content tagging from a different perspective:can we use query and clicks logs to guess tags for documents?The idea makes sense, when given a query you interact with a document you are effectively generating a correlation between the two entities and this can definitely be used to help in the generation of tags!In the talk John went through few iterative approaches (one based on just query-clicked docs training set and one based on query grouped by session), you find the Jupiter notebooks here for your reference, try them out!First implementationQuery collapsingSecond implementationThird implementation

Learning To Rank Panel

Following up the unfortunate absence of one of the speakers, a panel on Learning To Rank industry application took place, with interesting discussions about one of the hottest technologies right now that presents a lot of challenges still.Various people were involved in the session and it was definitely pleasant to partecipate to the discussion.The main takeaway from the panel has been that even if LTR is an extremely promising technology, few adopters are right now really ready to proceed with the integration:garbage in, garbage out is still valid and extra care is needed when starting a LTR project.

Search with Vectors

Before the conference wrap up, the last session I attended was from Simon Hughes “Search with Vectors”, a beautiful survey of vectorised similarity calculation strategies and how to use them in search nowadays in correlation with word2vec and similar approaches.The focus of the talk is to describe how vector based search can help with synonymy, polysemy, hyper/hypo-nyms and related concepts.The related code and slides from previous talks are available in the Dice repo: https://github.com/DiceTechJobs/VectorsInSearch

Andrea Gazzarini is a curious software engineer, mainly focused on the Java language and Search technologies.With more than 15 years of experience in various software engineering areas, his adventure with the search domain began in 2010, when he met Apache Solr and later Elasticsearch… and it was love at first sight. Since then, he has been involved in many projects across different fields (bibliographic, e-government, e-commerce, geospatial).

In 2015 he wrote “Apache Solr Essentials”, a book about Solr, published by Packt Publishing.He’s an opensource lover; he’s currently involved in several (too many!) projects, always thinking about a “big” idea that will change his (developer) life.

Introduction to Music Information Retrieval

Music Information Retrieval is about retrieving information from music entities. This high-level definition relates to a complex discipline with many real-world applications. Being a former bass player, Andrea will describe a high-level overview about Music Information Retrieval and it will analyse from a musician perspective a set of challenges that the topic offers.We will introduce the basic concepts of the music language, then passing through different kind of music representations we will end up describing some useful low level features that are used when dealing with music entities.

Elia Porciani

Elia is a Software Engineer passionate about algorithms and data structures concerning search engines and efficiency.He is currently involved in many research projects at CNR (National Research Council, Italy ) and for personal purpose.Before joining Sease he worked in Intecs and List where he could experience different fields and levels of computer science, by handling low level programming problems such as embedded and networking up to high level trading algorithms.He graduated with a dissertation about data compression and query performance on search engines.He is active part of the information retrieval research community, attending international conferences such as SIGIR and ECIR.His most recent pubblication is : FASTER BLOCKMAX WAND WITH VARIABLE-SIZED BLOCKS SIGIR 2017 Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017

Modern search engines has to keep up with the enormous growth in the number of documents and queries submitted by users. One of the problem to deal with is finding the best k relevant documents for a given query. This operation has to be fast and this is possible only by using specialised technologies.
Block max wand is one of the best known algorithm for solving this problem without any effectiveness degradation of its ranking.
After a brief introduction, in this talk I’m going to show a strategy introduced in “Faster BlockMax WAND with Variable-sized Blocks” (SIGIR 2017), that applied to BlockMaxWand data has made possible to speed up the algorithm execution by almost 2x.
Then, will be presented another optimisation of the BlockMaxWand algorithm (“Faster BlockMax WAND with Longer Skipping”, ECIR 2019) for reducing the time execution of short queries.

Sambhav Kothari

Sambhav is a software engineer at Bloomberg, working in the News Search Experience team.

Learning To Rank: Explained for Dinosaurs

Internet search has long evolved from days when you had to string up your query in just the right way to get the results you were looking for. Search has to be smart and natural, and people expect it to “just work” and read what’s on their minds.

On the other hand, anyone who has worked behind-the-scenes with a search engine knows exactly how hard it is to get the right result to show up at the right time. Countless hours are spent tuning the boosts before your user can find his favorite two-legged tiny-armed dinosaur on the front page.

When your data is constantly evolving, updating, it’s only realistic that so do your search engines. Search teams thus are on a constant pursuit to refine and improve the ranking and relevance of their search results. But, working smart is not the same as working hard. There are many techniques we can employ, that can help us dynamically improve and automate this process. One such technique is Learning to Rank.

Learning to Rank was initially proposed in academia around 20 years ago and almost all commercial web search-engines utilize it in some form or other. At Bloomberg, we decided that it was time for an open source search-engine to support Learning to Rank, so we spent more than a year designing and implementing it. The result of our efforts has been accepted by the Solr community and our Learning to Rank plugin is now available in Apache Solr.

This talk will serve as an introduction to the LTR(Learning-to-Rank) module in Solr. No prior knowledge about Learning to Rank is needed, but attendees will be expected to know the basics of Python, Solr, and machine learning techniques. We will be going step-by-step through the process of shipping a machine-learned ranking model in Solr, including:

how you can engineer features and build a training data-set as per your needs

how you can train ranking models using popular Python ML(machine learning) libraries like scikit-learn

In this post we’ll cover two additional synonyms scenarios and we’ll try to summarise all previous tips in a coincise form. Following the approach of the previous posts [1] [2] [3], everything can be applied both to Apache Solr and Elasticsearch.

Preconditions

Synonyms and stopwords at query time: this is not just a “theoretical” constraint; imagine if you have to manage a deployment context belonging to the same customer with a lot of small / medium indexes: you cannot re-build from scratch everything each time a synonym or a stopword changes.

Synonyms, not hypernyms or hyponyms: or better, we aren’t talking about what a thesaurus calls broader, narrower or related terms. Although some of the things below could be also valid in those contexts, the broader or narrower scope introduced with hypernyms, hyponyms or related concepts can have some weird side-effect on the scoring phase.

#1: How can I define Multi-terms Concepts?

If you want to manage a multi-terms concept as a whole, regardless it has synonyms or not, you can use the synonyms file. Here’s a couple of examples: the first is a concept with one synonym, the second one doesn’t have any synonym:

As you can see, when a concept doesn’t have any available synonym, we can just repeat it.

Solr users only: don’t forget the following things:

the request handler should use an edismax or lucene query parser, and the SplitOnWhiteSpace flag (sow) must be set to true

the field type which includes the synonyms graph filter must have the autoGeneratePhraseQueries set to true

You can read more here [1] about this approach.

Note: this will work until the Lucene SynonymMap uses a List/Array for collecting the synonyms associated with a given concept. When and if the implementation will switch to a Set-like approach, there’s a high chance this trick will stop working.

#2: What if the query contains multi-terms concepts with stopwords?

Imagine a query like this

q=my car is out of warranty. What can I do?

Well, with the configuration above the stopwords removal after the synonyms detection causes a weird effect on the generated query: the “what” term is wrongly added to the synonym phrase query: “out ? warranty what”.

While the issue affects the FilteringTokenFilter (the superclass of StopFilter) and therefore it has a wider scope, for this specific problem we proposed a solution [2], consisting of a specialised StopFilter which is aware about synonym tokens. The result is that terms which are part of a previously detected synonym are not removed, even if they are stopwords. The query analyzer of our field becomes something like this:

#3: What if the document contains multi-terms concepts with “intruder” stopwords?

We have a document like this:

{
"id": 1,
"title": "how do I transfer my phone number?"
}

and the query:

q=transfer phone number procedure

at query time, the synonym is correctly detected and phrase clauses are generated, but unfortunately it doesn’t match the document above because the intermediate “my” stopwords:

You can read here [3] the proposed solution for this scenario, which basically consists of a two-steps query plan: in the first, the detected synonyms generate phrase clauses, while in the second they are destructured in term clauses.

#4: What if the query contains multi-terms concepts with “intruder” stopwords?

And here we are in the opposite case. We have a document like this:

{
"id": 1,
"title": "transfer phone number procedure"
}

and the query:

q=how do I transfer my phone number?

As you can see, at query time the synonym is not detected because the “my” stopword between terms. While the document above could be still be part of the response of the generated query, here we are focusing on the missing synonym detection.

A possible solution is to double the synonym filter before and after the stopwords filter:

In the first iteration the synonym is not detected, then the StopFilter removes the “my” stopword so in the second iteration the synonym will be correctly recognized. Note the StopFilter is still the custom class we introduced in #2 because we want to cover also that scenario.

What is the drawback of this approach? This is something which worked in my specific case, but be aware that the SynonymGraphFilter documentation states this explicit warning:

NOTE: this cannot consume an incoming graph; results will be undefined.

#5 (UNSOLVED) What if the query contains multi-terms concepts more than one “intruder” stopwords?

This is the worst case, where we have a query like this:

q=out of my warranty

That is: we have a couple of terms which have been declared as stopwords, but the first (of) is potentially part of a synonym (out of warranty) while the second (my) isn’t.

We’re still working on this case so unfortunately there’s no a proposal here, if you got some idea or feedback, it is warmly welcome.

The Context

Brief recap of where we arrived in the preceding article: we had the following synonyms and stopwords settings:

synonyms = {“out of warranty”,”oow”}

stopwords = {“of”}

Both of those filters were configured exclusively at query-time; the synonym filter first and then the stopwords filter.

Using the built-in StopFilter we had a synonym detection issue because the removal of the “of” term in the query string (e.g. “my device ran out of warranty“). For that reason, we introduced a custom StopFilter subclass which was aware about stopwords in synonyms.

The other scenario we are going to describe is a little bit different: let’s suppose we have the following data:

synonyms = {test code, tdd, testing}

stopwords = {my, your, how ,to, in}

Still here, we want to manage synonyms and stopwords only at query time.
We have this document indexed:

The Problem: missing synonym match

The query parser matches the “test code” synonym in the query and produces a query like this:

(title:tdd title:testing PhraseQuery(title:"test code")) title:java

unfortunately there’s no match, because the document title contains an intruder: the “your” term between the “test” and “code”.

A Solution: invisible queries with and without synonym phrases

In the preceding article we’ve underlined the role of the autoGeneratePhraseQueries flag. It is the responsible of creating phrase clauses for all detected multi-terms synonyms. In case this flag is set to false (or even missing) the generated query won’t have any phrase, even if a multi-term synonym is detected.

While ordinarily this is not what you would expect, in this specific case it could be a valid alternative for dealing with such mismatching: a first request would require the “synonym phrasing” behaviour, a second one wouldn’t. The first query would be:

(title:tdd title:testing PhraseQuery(title:"test code")) title:java

After receiving an empty response, a second query will be sent, targeting another (similar) field related to a field type which has the autoGeneratePhraseQueries parameter will be set to false. That would generates the following query:

(title:testing title:tdd (+title:test +title:code)) title:java

and here we would get a match!

A couple of notes:

in the second try we are requiring the disjoint presence of those two terms (“test” and “code”) in whatever order, with whatever proximity, so the increased recall could produce some unexpected results. In case we are using the edismax query parser, a “pf” parameter would be helpful for moving up those results which adhere better to the entered query, in terms of proximity and terms order.

we could put the stop filter at index time, but that violates the precondition: we want a pure query-time management.

How to implement such search workflow? In Solr, we need a couple of fields, the first one is exactly the field + field type we described in the preceding article, the second is similar, the only difference is in the autoGeneratePhraseQueries parameter, which is set to false:

Another option, which moves the search workflow on Solr side, is our CompositeRequestHandler [1], a Solr component which invokes in chain a set of RequestHandler instances: a first request handler, targeting the title_with_synonyms_phrases would be invoked and, in case of zero results, the same query will be sent to another request handler, which would target the title_without_synonyms_phrases.

Note for Elasticsearch users: you will find some difference in applying what is described above. Although the auto_generate_phrase_queries attribute is also present in Elasticsearch, it doesn’t have the same effect. What you’re looking for is an attribute which is not related with field types, it is a query attribute [2] [3] and it is called auto_generate_synonyms_phrase_query.

The Context

The scenario description is quite simple: we want to use synonyms and stopwords.

Following the path of our previous article, we will introduce an additional component in the analysis chain: a StopFilter, which, as the name suggests, removes a set of words from an incoming token stream.

We will use the following data through the examples:

synonyms = [“out of warranty”,”oow”]

stopwords = [“of”]

Token filters can be configured at index and/or query time. In this context we are focused on the query side: both synonyms and stopwords will be configured only in the query analyzer.

Working exclusively at query time has a great benefit: we can change things at runtime without any reindex need. At the same time, no stopwords filtering will be executed at index time so those terms will be uselessly part of the dictionary.

The Problem: synonyms followed by stopwords

We have the following analyzers:

index analyzer

standard-tokenizer

lowercase

query analyzer

standard-tokenizer

lowercase + synonyms + stopwords

Theoretically, in the query analyzer we would have two options: the stopwords filter could be defined before or after the synonym filter. However, the first way (before) doesn’t make so much sense, because terms that are stopwords and that are, at the same time, part of a synonym will be removed before the synonym detection. As consequence of that those synonym won’t be detected: in the example data, issuing a query like

q=out of warranty

the “of” term will be removed by the StopFilter, the subsequent filter would receive [“out”, “warranty”], which doesn’t match the configured synonym (“out of warranty”).

Elasticsearch users: Elasticsearch doesn’t allow this scenario at all; if you try to use the PUT Settings API with a chain defined as above (first stopwords then synonyms with some term intersection), it will throw an illegal argument exception saying “term: out of warranty analyzed to a token (warranty) with position increment != 1 (got: 2)” .

Apache Solr instead uses a lenient approach: no errors at index creation, but the problem remains (personally I prefer the Elasticsearch approach)

So the obvious choice is to postpone the stopwords management after the synonym filter. Unfortunately, here there’s an issue: the stopword(s) removal has some unwanted side-effect in the generated token graph and the query parser generates a wrong query because it consumes the token stream at the end of the chain.

As you can see, the synonym (out of warranty -> oow) is correctly detected but the stopwords filter removes all the “of” tokens, even if the first occurrence is part of a synonym. In the generated query you can see the sneaky effect: the “hole” created by the first “of” occurrence removal, produces the inclusion, in the phrase query, of the next available token in the stream (“something”, in the example).

In other words, the oow token synonym is marked with a positionLength = 3, which correctly means it spans three tokens (1=out, 2=of, 3=warranty); later, the query parser will include the next three available terms for generating a synonym phrase queries but since we no longer have the 2nd token (of), such count includes also “something”, which is the 3rd available token in the stream.

Before proceeding: this is a known problem, a long-standing issue [1] in Lucene which has a broader domain because it is related with the FilteringTokenFilter, the superclass of StopFilter.

The problem we will try to solve is: how can we manage synonyms and stopwords at query time without generating the conflict above?

A Solution

A note first: the token filter we are going to create is something that deals only with Lucene classes. However, when things need to be plugged in a runtime container (e.g. Apache Solr or Elasticsearch) the deployment procedure depends on the target platform: we won’t cover this part here.

The proposed solution is to create a StopFilter subclass which will be “synonym-aware”; it will check the tokenType and positionLength attributes before deciding if a token needs to be removed from the stream. The goal is to avoid removing those terms which have been defined in the stopwords list but are part of a synonym definition.

The class that we are going to extends is org.apache.lucene.analysis.core.StopFlter. This is an empty class, because all the filtering logic is in the superclasses (org.apache.lucene.analysis.StopFilter and the more generic org.apache.lucene.analysis.FilteringTokenFilter). The stopwords logic resides in the accept() method, which as you can see is very simple:

If the stopwords list contains the current term, it will be removed. So far, so good. We need to extend (actually we could also decorate) the StopFilter class for doing something else before calling the logic above.

First we need to check the token type: if a token has been marked as a SYNONYM then our filter doesn’t have to remove it. Then we need to check the positionLength attribute, because, within a synonym detection context, a position length greater than 1 means we have traversing a multi-term synonym:

Everything seems working as expected! This is probably just one specific scenario among those addressed by LUCENE-4065; however, it helped me a lot because this is (at least in my experience) a frequent use case.

A Software Engineer is always required to give his customers a concrete evidence about deliverables quality. A Search Engineer deals with a specialisation of such generic Software Quality, which is called Search Quality.

What is Search Quality? And why is it so important in a search infrastructure? After all, the “Software Quality” should be omni-comprensive, it should always include everything (and actually it is), but when we are dealing with search systems, the quality is a very abstract term, which is very hard to define in advance.

The functional correctness of a search infrastructure (assuming the correctness is the only factor which influences the system quality – and it isn’t) is naturally associated with human judgments, with opinions, and unfortunately we know opinions can be different among people.

The business stakeholders, which will get a value from a search system, can belong to different categories, can have different expectations, and they can have in mind a different idea about the expected system correctness.

In this scenario a Search Engineer is facing many challenges in terms of choices, and at the end, he has to provide concrete evidences about the functional coverage of those choices.

This is the context where we developed the Rated Ranking Evaluator (hereafter RRE).

What it is?

The Rated Ranking Evaluator (RRE) is a search quality evaluation tool which evaluates the quality of results coming from a search infrastructure.

It helps a Search Engineer in his daily job. Are you a Search Engineer? Are you tuning/implementing/changing/configuring a search infrastructure? Do you want to have something that gives you an evidence about the improvements between changes? RRE could give you a hand on that.

RRE formalises how well a search system satisfies the user information needs, at “technical” level, combining a rich tree-like domain model with several evaluation measures, but also at “functional” level, providing human-readable outputs that could target the business stakeholders.

It encourages an incremental/iterative/immutable approach during the deveoopment and the evolution of a search system: assuming we’re starting our system from version x.y: when it’s time to apply some relevant change to its configuration, instead of applying changes to x.y, is better to clone it and apply those changes to the new fresh version.

In this way, RRE will execute the evaluation process on all available versions, it will provide the delta/trend between subsequent versions, so you can immediately get a fine-grained picture about where the system is going, in terms of relevance.

This post is only a brief summary about RRE. You can find more detailed information in the project Wiki.

In a few words, what can I get from RRE?

You can configure RRE as a compounding part of your project build cycle. That means, every time a build is triggered, an evaluation process will be executed.

RRE is not tied to a given search platform: it provides a mini-framework for plugging-in different search platforms. At the moment we have two available bindings: Apache Solr and Elasticsearch (see here for supported versions).

The output evaluation data will be available:

as a JSON file: for further elaborations

as a spreadsheet: for delivering the evaluation results to someone else (e.g. a business stakeholder)

in a Web Console where metrics and their values get refreshed in real time (after each build)

How it works

RRE provides a rich, composite, tree-like, domain model, where the evaluation concept can be seen at different levels.

The Evaluation at the top level is just a container of the nested entities. Note that all entities relationships are 1 to many. In this context, a Corpus is defined as a test dataset. RRE will use it for executing the evaluation process; in a single evaluation process you can have multiple datasets.

A Topic is an information need: it defines a functional requirement on the end-user perspective. Within a topic we can have several queries, which express the same need but more close to a technical layer. RRE provides a further abstraction in the middle: query groups. A Query Group is a group of queries which are supposed to produce the same results (and therefore are associated with the same judgments set).

Queries, which are the technical leaves of RRE domain model, are furtherly decomposed in several perspectives, one for each available version of our system. A query itself is of course a single entity, but during an evaluation session, its concrete execution happens several times, one for each available version. That because RRE needs to measure the search results (i.e. the query executions) against all versions.

For each version we will finally have one or more metrics, depending on the configuration. Last but not least, even if metrics are computed at query/version level, RRE will aggregate those values at upper levels (see the dashed vertical lines in the diagram) so each entity/level in the domain model will offer an aggregate perspective of all available metrics (i.e I could be interested in the NDCG for a given query, or I could just stop my analysis at a topic level).

Input

In order to execute an evaluation process, RRE needs the following things:

One or more corpus / test collection: these are the representative datasets of a specific domain, that will be used for populating and querying a target search platform

One or more configuration sets: although there’s nothing against having one single configuration, a minimum of two versions are required in order to provide a comparison between evaluation measures.

One or more ratings sets: this is where judgments are defined, in terms of relevant documents for each query group.

Output

The RRE concrete output depends on the runtime container where it is running. The RRE core itself is just a library, so when used programmatically within a project, it outputs a set of objects corresponding to the domain model described above.

When it is used as a Maven plugin, it primarily outputs the same structure in JSON format. This data is then used for producing further outputs, like a spreadsheet. The same payload can be sent to another module called RRE Server, which offers an AngularJS based web console that gets automatically refreshed.

The RRE console is very useful when we are doing internal iterations / tries around some issue, which usually requires very short edit-and-immediately-check cycles. Imagine if you can have a couple of monitors on your desk: in the first there’s your favourite IDE, where you change things, run builds. In the second there’s the RRE Console (see below). After each build, just have a look on the console in order to get an immediate feedback of your changes.

Where can I start?

The project repository in Github offers all what you need: a detailed documentation about how it works and how to quick start with RRE.

If you need some help, feel free to contact us! We appreciate any feedback, suggestion and, last but not least, contribution.

Future works

As you can imagine, the topic is quite huge. We have a lot of interesting ideas about the platform evolution.

The Apache Lucene/Solr suggesters are important to Sease : we explored the topic in the past[1] and we strongly believe the autocomplete feature to be vital for a lot of search applications.
This blog post explores in details the current status of the Lucene BlendedInfixSuggester, some bugs of the most recent version ( with the solution attached) and some possible improvements.

BlendedInfixSuggester

The BlendedInfixSuggester is an extension of the AnalyzingInfixSuggester with the additional functionality to weight prefix matches of your query across the matched documents.
It scores higher if a hit is closer to the start of the suggestion.N.B. at the current stage only the first term in your query will affect the suggestion score

Let’s see some of the configuration parameters from the official wiki:

blenderType: used to calculate the positional weight coefficient using the position of the first matching word. Can be one of:

position_linear: weightFieldValue*(1 – 0.10*position): Matches to the start will be given a higher score (Default)

position_reciprocal: weightFieldValue/(1+position): Matches to the start will be given a score which decays faster than linear

position_exponential_reciprocal: weightFieldValue/pow(1+position,exponent): Matches to the start will be given a score which decays faster than reciprocal

exponent: an optional configuration variable for the position_reciprocal blenderType used to control how fast the score will increase or decrease. Default 2.0.

Description

Data Structure

Auxiliary Lucene Index

Building

For each Document, the stored content from the field is analyzed according to the suggestAnalyzerFieldType and then additionally EdgeNgram token filtered.

Finally an auxiliary index is built with those tokens.

Lookup strategy

The query is analysed according to the suggestAnalyzerFieldType.

Than a phrase search is triggered against the Auxiliary Lucene index

The suggestions are identified starting at the beginning of each token in the field content.

Suggestions returned

The entire content of the field .

This suggester is really common nowadays as it allows to provide suggestions in the middle of a field content, taking advantage of the analysis chain provided with the field.

It will be possible in this way to provide suggestions considering synonyms, stop words, stemming and any other token filter used in the analysis and match the suggestion based on internal tokens.
Finally the suggestion is scored, based on the position match.

The simple corpus of document for the examples will be the following :

The input query is analysed, and the tokens produced are the following : “game” .

In the Auxiliary Index , for each of the field content we have the EdgeNgram tokens:

“v”,”vi”,”vid”… , “g”,”ga”,”gam”,“game” .

So the match happens and the suggestion are returned.N.B. First two suggestions are ranked higher as the matched term happen to be closer to the start of the suggestion

Let’s explore the score of each Suggestion given various Blender Types :

Query

gaming

Suggestion

First Position Match

Position Linear

Position Reciprocal

Position Exponential Reciprocal

Video gaming: the history

1

1-0.1*position = 0.9

1/(1+position) = 1/2 = 0.5

1/(1+position)^2 = 1/4 = 0.25

Video game: multiplayer gaming

1

1-0.1*position = 0.9

1/(1+position) = 1/2 = 0.5

1/(1+position) = 1/4 = 0.25

Nowadays Video games are a phenomenal economic business

2

1-0.1*position = 0.8

1/(1+position) = 1/3 = 0.3

1/(1+position) = 1/9 = 0.1

The final score of the suggestion will be :

long score = (long) (weight * coefficient)

N.B. the reason I highlighted the data type is because it’s directly affecting the first bug we discuss.

Suggestion Score Approximation

The optional weightField parameter is extremely important for the Blended Infix Suggester.
It assigns the value of the suggestion weight ( extracted from the mentioned field).e.g.The suggestion may come from the product name field, but the suggestion weight depends on how profitable the product suggested is.

Bug 1 – WeightField Not Defined -> Zero suggestion score

How To Reproduce It : Don’t define any weightField in the suggester configEffect : the suggestion ranking is lost, all the suggestions have 0 score, position of the match doesn’t matter anymore
The weightField is not a mandatory configuration for the BlendedInfixSuggester.
Your use case could not involve any weight for your suggestions and you are just interested in the positional scoring (the main reason the BlendedInfixSuggester exists in the first place).
Unfortunately, this is not possible at the moment :
If the weightField is not defined, each suggestion will have a weight of 0.
This is because the weight associated to each document in the document dictionary is a long. If the field to extract the weight from, is not defined (null), the weight returned will just be 0.
This doesn’t allow to differentiate when a weight should be 0 ( value extracted from the field) or null ( no value at all ).
A solution has been proposed here[3].

Bug 2 – Bad Approximation Of Suggestion Score For Small Weights

There is a misleading data type casting in the score calculation for the suggestion :

long score = (long) (weight * coefficient)

This apparently innocent cast, actually brings very nasty effects if the weight associated to a suggestion is unitary or small enough.

Basically you risk to lose the ranking of your suggestions reducing the score to only few possible values : 0 or 1 ( in edge cases)

A solution has been proposed here[3]

Multi Term Matches Handling

It is quite common to have multiple terms in the autocomplete query, so your suggester should be able to manage multiple matches in the suggestion accordingly.

Given a simple corpus (composed just by the following suggestions) and the query :“Mini Bar Frid”

You see these suggestions:

1000 | Mini Bar something Fridge

1000 | Mini Bar something else Fridge

1000 | Mini Bar Fridge something

1000 | Mini Bar Fridge something else

1000 | Mini something Bar Fridge

This is because at the moment, the first matching term wins all ( and the other positions are ignored).
This brings a lot of possible ties (1000), that should be broken to give the user a nice and intuitive ranking.

But intuitively I would expect in the results something like (note that allTermsRequired=true and the schema weight field always returns 1000)

Mini Bar Fridge something

Mini Bar Fridge something else

Mini Bar something Fridge

Mini Bar something else Fridge

Mini something Bar Fridge

Let’s see a proposed Solution [4] :

Positional Coefficient

Instead of taking into account just the first term position in the suggestion, it’s possible to use all the matching positions from the matched terms [“mini”,”bar”,”fridge”].
Each position match will affect the score with :

How much the matched term position is distant from the ideal position match

If we compare the suggestion score for both these queries, it would seem unfair to penalise the first one just because it matches 2 terms ( consecutive) while the second query has just one match ( positioned worst than the first match in query1)

Query : Mini Bar Fri
100 |Mini Bar Fridge something
100 |Mini Bar Fridge something else
100 |Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a
26 |Mini Bar something Fridge
22 |Mini Bar something else Fridge
17 |Mini something Bar Fridge
8 |something Mini Bar Fridge
7 |something else Mini Bar Fridge

There is still a tie for the exact prefix matches, but let’s see if we can finalise that improvement as well .

Token Count Coefficient

Let’s focus on the first three ranking suggestions we just saw :

Query : Mini Bar Fri100 |Mini Bar Fridge something100 |Mini Bar Fridge something else100 |Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a

Intuitively we want this order to break the ties.
Closer the number of matched terms with the total number of terms for the suggestion, the better.
Ideally we want our top scoring suggestion to just have the matched terms if possible.
We also don’t want to bring strong inconsistencies for the other suggestions, we should ideally only affect the ties.
This is achievable calculating an additional coefficient, dependant on the term counts :Token Count Coefficient = matched terms count / total terms count

Then we can scale this value accordingly :
90% of the final score will derive from the positional coefficient
10% of the final score will derive from the token count coefficient

Query : Mini Bar Fri
90 * 1.0 + 10*3/4 = 97|Mini Bar Fridge something
90 * 1.0 + 10*3/5 = 96|Mini Bar Fridge something else
90 * 1.0 + 10*3/25 = 91|Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a

It will require some additional tuning but the overall idea should bring a better ranking function to the BlendedInfix when multiple terms matches are involved!
If you have any suggestion, feel free to leave a comment below!
Code is available in the Github Pull Request attached to the Lucene Jira issue[4]

Scenario

You’re working as a search engineer for XYZ Ltd, a company which sells electric components. XYZ provided you the application logs of the last six months, and some business requirements.

Two kinds of customers, two kinds of requirements, two kinds of search

The log analysis shows that XYZ has mainly two kinds of customers: a first group, the “expert” users (e.g. electricians, resellers, shops) whose members are querying the system by product identifiers, codes (e.g. SKU, model codes, thinks like Y-M8GB, 140-213/A and ABD9881); it’s clear, at least it seems so, they already know what they want and what they are looking for. However, you noticed a lot of such queries produce no results. After investigating, the problem seems to be that codes and identifiers are definitely hard to remember: queries use a lot of disparate forms for pointing to the same product. For example:

y-m8gb (lowercase)

YM8GB (no delimiters)

YM-8GB (delimiter in a wrong place)

Y/M8GB (wrong delimiter)

Y M8GB (whitespace instead of delimiter)

y M8/gb (a combination of cases above)

This kind of scenario, where there’s only one relevant document in the collection, is usually referred to as “Known Item Search”: our first requirement is to make sure this “product identifier intent” is satisfied.

The other group of customers are end-users, like me and you. Being not so familiar with product specs like codes or model codes, the behaviour here is different: they use a plain keyword search, trying to match products by entering terms which represents names, brands, manufacturer. An here it comes the second requirement which can be summarized as follows: people must be able to find products by entering plain free-text queries.

As you can imagine, in this case search requirements are different from the other scenario: the focus here is more “term-centric”, therefore involving different considerations about the text analysis we’d need to apply.

While the expert group query is supposed to point to one and only one product (we are in a black / white scenario: match or not), the needs on the other side require the system to provide a list of “relevant” documents, according to the terms entered.

An important thing / assumption before proceeding: for illustration purposes we will consider those two queries / user groups as disjoint: that is, a given user belongs only to one of the mentioned groups, not both. Better, a given user query will contain product identifiers or terms, not both.

Schema & configuration notes

The expert group, and the “Known Item Search”

The “product identifier” intent, which is assumed to be implicit in the query behaviour of this group, can be captured, both at index and query time, by applying the following analyzer, which basically treats the incoming value as a whole, normalizes it to lower case, removes all delimiters and finally collapses everything in a single output token.

In the following table you can see the analyzer in action with some example:

As you can see, the analyzer doesn’t declare a type attribute because it is supposed to be applied both at index and query time. However, there’s a difference in the incoming value: at index time the analyzer is dealing with a field content (i.e. the value of a field of an incoming document), while at query time the value which flows through the pipeline is composed by one or more terms entered by the user (a query, briefly).

While at index time everything works as expected, at query time the analyzer above requires a feature that has been introduced in Solr 6.5: the “Split On Whitespace” flag [1]. When it is set to “false” (as we need here in this context), it causes the incoming query text to be kept as a single whole unit, when sent to the analyzer.

Prior to Solr 6.5 we didn’t have such control, and the analyzers were receiving a “pre-tokenized-by-whitespaces” tokens; in other words, the unit of work of the query-time analysis was the single term: the analyzer chain (including the tokenizer itself) was invoked for each term outputted by that pre-whitespace-tokenization. As consequence of that our analyzer, at query time, couldn’t work as expected: if we take the example #5 and #6 from the table above, you can see the user entered a whitespace. With the “Split on Whitespace” flag set to true (explicitly, or using a Solr < 6.5), the pre-tokenization described above produces two tokens:

#5 = {“Y”, ”M8GB”}

#6 = {“y”, “M8/gb”}

so our analyzer would receive 2 tokens (for each case) and there won’t be any match with the single term ym8gb stored in the index. So, prior to Solr 6.5 we had two ways for dealing with this requirement:

client side: wrapping the whole query with double quotes, escaping whitespaces with “\”, or replacing them with a delimiter like “-“. Easy, but it requires a control on the client code, and this is not always possible.

Solr side: applying to the incoming query the same transformations as above but this time at query parser level. Easy, if you know some Lucene / Solr internals. In addition it requires a context where you have permissions for installing custom plugins in Solr. A similar effect could be obtained also using an UpdateRequestProcessor which would create a new field with the same value of the original field but without any whitespace.

The end-users group, and the full-text search query

In this case we are within a “plain” full-text search context, where the analysis identified a couple of target fields: product names and brands.

Differently from the previous scenario, here we don’t have a unique and deterministic way to satisfy the search requirement. It depends on a lot of factors: the catalog, the terms distribution, the implementor experience, the customer expectations in terms of user search experience. All these things can lead to different answers. Just for example, here’s a possible option:

The focus here is not on the schema design itself: the important thing to underline is that this requirement needs a completely different configuration from the “Known Item Search” previously described.

Specifically, let’s assume we ended up following a “term-centric” approach for satisfying the second requirement. The approach requires a different value for the “Split on Whitespace” parameter, which has to be set to true, in this case.

The “sow” parameter can be set at SearchHandler level, so it is applied at query time. It can be declared within the solrconfig.xml and, depending on the configuration, it can be overridden using a named (HTTP) query parameter.

A “split on whitespace” pre-tokenisation leads us on a scenario which is really different from the “Known Item Search”, where instead we “should” be in a field-centric search; “should” is double-quoted because if, from one side, we are actually using a field-centric search, on the other side we are on an edge case where we’re querying one single field with one single query term (the first analyzer in this post always outputs one term).

The implementation

Where?

Although one could think the first thing is about how to combine those two different query strategies, prior to that, the question we need to answer is where to implement the solution? Clearly, regardless the way we will decide to follow, we will have to implement a (search) workflow, which can be summarised in the following diagram:

On Solr side, each “search” task needs to be executed in a different SearchHandler, so returning to our question: where do we want to implement such workflow? We have three options: outside, between or inside Solr.

#1: Client-side implementation

The first option is to implement the flow depicted above in the client application. That assumes you have the required control and programming skills on that side. If this assumption is true, then it’s relatively easy to code the workflow: you can choose one of the client API binding available for your language and then implement the double + conditional search illustrated above.

Cons: the search workflow / logic is moved on the client side. Programming is required, so you must be in a context where this can be done and where the client application code is under your control.

#2: Man-in-the-middle

Moving things outside the client sphere, another popular option, which can be still seen as a client-side alternative (from the Solr perspective), is a proxy / adapter / facade. Whatever is the name you want to give to this stuff, this is a new module which sits between the client application and Solr; it would intercept all requests and it would implement the custom logic by orchestrating the search endpoints exposed in Solr.

Being a new module, it has several advantages:

it can be coded using your preferred language

it is completely decoupled from the client application, and from Solr as well

but for the same reason, it has also some disadvantages:

it must be created: designed, implemented, tested, installed and maintained

it is a new piece in your system, which necessarily increases the overall complexity of the architecture

Solr exposes a lot of (index & search) services. With this option, all those services should be proxied, therefore resulting in a lot of unnecessary delegations (i.e. delegate services that don’t add any value to the execution chain).

#3: In Solr

The last option moves the workflow implementation (and the search logic) in the place where, in my opinion, it should be: in Solr.

Note that this option is usually not only a “philosophical” choice: if you are a search engineer, most probably you will be hired for designing, implementing and tuning the “search-side of the cake”. That means it’s perfectly possible that, for a lot of reasons, you must think to the client application as an external (sub)system, where you don’t have any kind of control.

The main drawback of this approach is that, as you can imagine, it requires programming skills plus a knowledge about the Solr internals.

In Solr, a search request is consumed by a SearchHandler, a component which is in charge of executing the logic associated with a given search endpoint. In our example, we would have the following search handlers matching the two requirements:

On top of that, we would need a third component, which would be in charge to orchestrate the two search handlers above. I’ll call this component a “Composite Request Handler”.

The composite handler would also provide the public search endpoint called by clients. Once a request is received, the composite request handler implements the search workflow: it invokes all the handlers that compose its chain, and it will stop when one the invocation target produces the expected result.

On the client side, that would require only one request because the entire workflow will be implemented in Solr, by means of the composite request handler. In other words, imagining a GUI with a search bar, the client application, when the search button is pressed, would have to retrieve the term(s) entered by the user and send just one request (to the composite handler endpoint), regardless the intent of the user (i.e. regardless the group the user belongs to).

The composite request handler introduced in this section has been already implemented, you can find it in our Github account, here[2].

Sease Ltd

International House, 776-778 Barking Road
BARKING
London
E13 9PJ

Apache Lucene, Apache Solr, Apache Stanbol, Apache ManifoldCF, Apache OpenNLP and their respective logos are trademarks of the
Apache Software Foundation.
Elasticsearch is a trademark of Elasticsearch BV,
registered in the U.S. and in other countries.