The IsGeneric property can be used to mark an IceCream as a generic brand. The answer given on the question, which aligns with the approach that the question asker is wanting to take, is to slightly negatively boost those documents that have IsGeneric set to false, in order to slightly favour generic ice cream brands. This will work OK, but I think there are more/better signals to improve relevancy.

Signal modelling

There are a few different signals we may wish to incorporate into our search relevancy solution

Disabling norms

norms, or normalization factors as it is short for, is the inverse square root of the number of terms in a field, and contributes to the relevancy score calculation for a field. Concretely, a term appearing in a shorter length field will apply a higher normalization factor to the relevancy score than a term appearing in a longer field, yielding a higher relevancy for a match in a short field than a longer one. This can be useful when searching across multiple fields of disparate length, for example, the chapter title of a book and also the content of the chapter, or a single field containing values of disparate length. Often however, particularly for queries on single fields, it is unwanted.

For this example, disabling norms on the description field is probably a good idea, since a match on the description field for the user query "Double Choc" should produce the same score for a document with the description "Cole's Double Choc Ice cream" as one with the description "Willy Wonka's Marvellous Neverending Double Choc Chip Ice Cream"; both descriptions contain the query "Double Choc" and we wouldn't want to favour the latter over the former purely because it has a longer description.

Custom analyzers

My read on the question is that it seems to really be about favouring exact matches to a query over partial matches of one or more terms within the query. Often when people talk about exact matches, they don't actually mean exact matches; some leniency in matching may be desired such as

lowercasing characters to make search case-insensitive

removing accents and other diacritics by folding characters into their ASCII or Unicode character equivalent, minus the diacritics e.g. è, à, ù into e, a and u, respectively

filtering characters to expand symbols into their common word counterparts e.g. & into and

filtering characters that may often be skipped or misused e.g. apostrophes '

catering for domain-specific common misspellings e.g. "Jerry's" in the question, misspelled as "Jerries"

This is where analysis and custom analyzers come in, and it is not uncommon to wish to analyze a given field in more than one way, to satisfy different search needs.

Combining queries

The match query is a good default search query to start with when looking to implement full-text search with Elasticsearch. For the question here however, a single match query is going to be an insufficient tool to hone in on the two, potentially three, search signals in play. The first signal is that of an exact match to the user query. The second is a match for one or more terms in the user query. And the third potential signal is whether a given document represents a generic ice cream brand; if it does, give it a slight boost over a branded ice cream.

To incorporate these different signals into a single query, we'll need to compose a composite query containing three different queries, and then control how the scores calculated for each different query contribute to the overall score for a given document. The bool query is our friend here.

There is another potential signal that we may wish to incorporate as well; when a set of matching documents have the same relevancy score, we may wish to add a small element of randomness to the order in which these documents appear within the results, to provide an element/perception of freshness to them. If someone is paying us to serve up their particular ice cream brands, they might not want to be the brand that is always listed last amongst an equally scored set of documents! The function_score query's random_score function can help us with this.

An example with NEST

Let's put the previous pieces together into an example. For this, I'm going to use the high level .NET Elasticsearch client, NEST, to make requests to Elasticsearch, and I'm going to be running them against Elasticsearch 5.6.6

The number of shards is set to 1 and replicas to 0. Relevancy scores are calculated per shard, meaning that an index composed of several shards distributed across multiple nodes within an Elasticsearch cluster can yield different relevancy scores for documents within each individual shard, when compared to the relevancy score that may be calculated when looking at the entire corpus of documents. In practice, with an even distribution of documents amongst shards, differences in relevancy scores across shards tend to diminish. For the purposes of this example, and likely for 100,000 small documents as stated in the question, a single shard will be enough. If search throughput and redundancy is needed, this can be achieved by adding replicas.

A custom analyzer, exact_icecream, is created and will be used to find exact matches. We want to perform a little normalization on the input to

replace incorrect spelling jerries with jerrys. The pattern_replace token filter uses optional word boundaries in the regular expression for jerries because it's used as a token filter on terms that may contain the word "jerries" as in the exact_icecream analyzer, or the entire term is the word "jerries", as in the following standard_icecream analyzer

A custom analyzer, standard_icecream, is created and will be used for analysis on the description field. This uses the standard tokenizer to tokenize the input into terms, then will

lowercase terms

replace incorrect spelling "jerries" with "jerrys"

add synonyms for "choc" and "chocolate"

Map the IceCream type using automapping, then override the inferred mapping for description to

disable norms

use the standard_icecream analyzer on the description field

map the description field as a multi-field with a textsub-field, exact, that also disables norms will use the exact_icecream analyzer

As an aside, if you wish to observe the requests and responses to Elasticsearch whilst developing, a good way is to output them somewhere such as standard output or trace, using the OnRequestCompleted() method on ConnectionSettings

The standard tokenizer has tokenized the text into terms according to the rules laid out in Unicode® Standard Annex #29. Notice also that the synonym token "chocolate" has been included in the same position as "choc" in both responses. Additionally, the token "jerrys" is included where "Jerries" appeared in the input.

The character filters have replaced the & with the word and, the keyword tokenizer produced only one token in each case, and the token filters have applied their logic. This is looking like a reasonable output for exact matching. If we try a few different inputs for Ben & Jerry's Double Choc such as

The outer function_score query allows us to run a query and then run one or more functions on the resulting documents from the query to compute a new relevancy score. function_score query can be great for using features of the documents themselves in order to influence the relevancy score. In this example, only a single random_score function is applied to compute a random number between 0 and 1 for each document. The function is seeded with a random integer, but in a real system the seed could be the user ID of the logged in user or similar. A small amount of randomness means that same scoring documents will not always appear in the same order in the results. From the two search queries above, documents in second, third and fourth position in each response are in a different order and have slightly different scores compared to each other. If the random_score function were taken out, you'd see that the scores for second, third and fourth position in each response are the same.

The query part of the function_score query is a bool query composed of a single must clause, a bool query with two should clauses. The two clauses are a match query on the description.exact field with a large boost of 5, and a match query on the description field. For a document to be considered a hit for this query, it needs to satisfy at least one of the match queries, and the large boost on the match query on the description.exact field means that a match here will contribute significantly to the relevancy score computed for the document. You might be wondering why a bool query with two should clauses is nested in a bool query must clause? You'd be right to wonder as it's not actually necessary in this example, but I'll explain why it's written this way shortly.

For both sets of search results, the document with an exact match on description for the input is the top result, followed by the three documents that partially match, in a slightly random order. You can see that the score for the top result is considerably larger than the score for documents in second position onwards, due to the boost applied to the match query on the description.exact field that's using the exact_icecream analyzer.

Referring back to the search signals, the original question asker and answerer were looking at favouring generic brands by slightly boosting them over branded ice creams, or inversely, slightly negatively boosting non-generic brands. Coming back to how the search query has been constructed, we can add this signal in quite easily, by adding a term query as a should clause on the outer most bool query

As expected, the exact match document appears top followed by the generic brand in position two. Here, the should clause with the term query acts as a boosting signal for the relevancy score. That is, a document need only match the must clause but if it also matches the should clause, then it will receive a slightly higher relevancy score. I don't know if this term query provides much value compared to the match queries that we have, but it's interesting to see how it affects results. If we did intend to use it, we may want to hone down the influence of the random_score function on the overall score, to allow the term query's influence to come through more prominently.

I hope this has been useful in exploring some of the features available within Elasticsearch to satisfy your search needs :) I've added all the code as a gist for you to play with