The Beginners Guide to Search Relevance with Elasticsearch

A rundown of analytical concepts to get started with optimizing search functionality using this mighty engine.

Published in

Level Up Coding

7 min readFeb 11, 2021

Photo by Stephen Phillips - Hostreviews.co.uk on Unsplash

Elasticsearch is a distributed, scalable analytical search engine that supports complex aggregations of unstructured data. It’s a free, open-source infrastructure built on Apache Lucene, a Java search engine library.

The robust, encapsulated architecture makes it easy to horizontally scale and manage different parts of the system. Akin to NoSQL datastores, Elasticsearch’s primary data format is JSON (Javascript Object Notation), which allows for flexible data storage. In this article, I want to discuss how Elasticsearch improves search functionality.

Often, we opt to develop a naive implementation of search in practice, where we iterate over datasets for data that matches the user's query. However, the iterative solution can take at least O(n) time to complete, which is not optimal for large datasets. In the grand scheme of things, the naive solution usually doesn’t meet users needs or expectations. The theory begs a very general question, “Is this document relevant?” Unfortunately, that’s not a fulfilling question to get relevant data. Furthermore, the naive solution can be costly to maintain, especially when our data is continuously growing.

You may be wondering, how can we improve this?

The answer is METRICS.

Search Relevance

Search relevance is a measure of accuracy between a search query and search results. Instead of asking, “Is this document relevant?” we can ask, “How relevant is this document?”

It depends on two fundamental metrics: Precision and Recall.

Recall is like a measure of quantity. We want to ensure all relevant documents of a dataset are included in the search results.

Precision, on the other hand, is like a measure of quality. We want to ensure all data in the search results are relevant.

These might sound similar, so let’s go through an example.

Imagine a sea of colorful aquatic creatures; fish, crabs, jellyfish, sharks, dolphins, and maybe even Spongebob. And within that sea, you’re looking for purple fish.

In a high precision situation, we’d only fetch fish of a solid purple color. That may seem desirable, but what if there were purple polka-dotted fish, or fish that were half purple and half yellow? Those are technically still relevant fish to our search.

Conversely, in a high recall situation, we’d fetch anything purple or anything that’s a fish. So that would include purple crabs, sharks, dolphins, and jellyfish. Although we specified “fish,” the other creatures are relevant because they’re purple. But, it would also include regular fish of different colors simply because they’re fish. That’s not an ideal result.

In both cases, we fetched relevant results, but we either overly captured or under captured results.

Make sense?

Precision and recall are often at odds with each other because improving one can impair the other.

However, the balance of the two can help us generate a score that we can use to fetch relevant results.

You may be wondering, how do we calculate that score? Good question.

Elasticsearch uses search relevance to score documents of a dataset. It returns an ordered list of data sorted by a relevance score. We can customize the score by adding and modifying variables that will shift the scale between precision and recall.

How does it work?

Elasticsearch uses text analyzers to convert bodies of text into optimized searchable data. They’re used in two instances, when querying for data and when inserting data.

There are three parts to Elasticsearch’s text analyzer: Character Filter, Tokenizer, and Token Filter. The text analyzer is highly customizable, meaning each part of its anatomy can be custom made or modified to suit a particular use-case.

Slide from my presentation, this is the anatomy of the ES text analyzer. Step 1 is the character filter, step 2 is the tokenizer, and step 3 is the token filter.

The character filter is responsible for adding, removing, and transforming elements of the text. For instance, it can remove HTML characters and replace occurrences of certain strings.

The tokenizer breaks up text into tokens, also known as terms. The default tokenizer is the whitespace tokenizer, and it breaks up the text whenever it encounters a whitespace. However, other tokenizers, such as the letter tokenizer, divides text whenever it encounters a non-letter character.

Lastly, the token filter is a similar character filter. It can add, remove and transform tokens. There are cool filters, such as the synonym token filter, that can add synonymous words.

Inverted Index

The resulting data is added to an inverted index. An inverted index maps tokens to documents. This architectural design allows us to map terms to multiple documents and makes searching for documents by terms much more efficient than the iterative solution.

Slide from my PowerToFly presentation. Document A and Document B terms are in the index, and each term maps back to the document itself.

However, please don’t get this data structure confused as a hash table because that was my first assumption. Underneath Elasticsearch, Apache Lucene uses a particular data structure called a BlockTree Term Dictionary.

A BlockTree Term Dictionary helps us find terms by their prefixes using a prefix tree.

Each character occupies a node, and the leaves of the tree are terms.

This kind of data structure helps us against use-cases like substrings or languages with compound words, such as German or Norwegian. It can also return helpful suggestions like, “Did you mean? <Word>,” or correct a user's spelling mistake.

TF/IDF and Field Length Norm

The inverted index contributes to the measure of TF/IDF.

TF stands for term frequency, which measures the number of times a term appears within a particular document. Think of it like a CTRL+F search on a page that finds all the search query occurrences and tells you the number of occurrences.

IDF stands for inverse document frequency, which looks across multiple documents within Elasticsearch and tells us that if a term appears too frequently, then there’s a higher probability it’s not as relevant.

For instance, if we refer to the index of a book, there are particular terms excluded. Terms like “about,” “there,” and “from” are not included within the index. Those terms appear so many times they lose relevance.

Finally, field-length norm is the length of a field, and by field, I mean a title field or description field. The shorter the field's length, the more relevant the terms within that field are, whereas the longer the field's length, the less relevant the terms.

The weight of the term fox is 0.15. The TF is 1, the IDF is 0.3, and the field-length norm is 0.5

Elasticsearch combines these three metrics to calculate and store as a weight for a particular term. However, since queries can contain multiple terms such as “purple-spotted fish,” Elasticsearch uses vector space models to compare multi-term queries against documents.

A vector space model is a multi-dimensional array containing the weight of each term in the query.

Similarity Algorithms

Essentially, all these components combined, more or less, create a type of similarity algorithm that Elasticsearch calls the Lucene Practical Scoring Function. This function generates a relevance score that Elasticsearch uses to sort documents when data is requested.

A multi-term query is broken down into multiple single-term queries and then goes through the Lucene Practical Scoring Algorithm steps described in the previous section.

Other types of similarity algorithms use similar metrics like TF/IDF but use different ranking functions or metrical frameworks. And thankfully, Elasticsearch has support for some of those similarity algorithms. For example, Elasticsearch supports Okapi BM25, which uses a probabilistic model rather than the vector space model.

Although Elasticsearch introduces some complexity in comparison to the naive solution, the trade-off is a scalable and intelligent system that has the potential to sustain itself. Regardless of users behavior, or how our data chances, or how big our data grows, we have confidence this system can deliver optimal results.

Ultimately, we’re setting ourselves up for success when we take the time to implement things in a scalable environment. And thanks to infrastructures like Elasticsearch, we’re not bogged down by the overhead of learning data analytics and complicated mathematical models. All of those things are conveniently abstracted, and it’s sufficient just to understand the theory so we can modify the exposed variables.

In the end, the business is happy because the users are happy, and we’re happy because we don’t have to continuously revisit our code every time there’s a minor change or inconvenience.

Doesn’t that sound nice?

I hope you enjoyed this article! Happy Coding!

Level Up Coding

The Beginners Guide to Search Relevance with Elasticsearch

A rundown of analytical concepts to get started with optimizing search functionality using this mighty engine.

Search Relevance

How does it work?

Inverted Index

TF/IDF and Field Length Norm

Similarity Algorithms

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Level Up Coding

Written by Landy

Responses (2)