What is Trie Data Structure in Lucene numeric range query

Trie is a clever data structure. It is used in Lucene to make numeric range query more efficient.

The vocabulary, ontology, search engine and information retrieval

The Trie data structure is inherently about search, actually the name comes from the word part in retrieval.

Information retrieval is a complex topic, but a simple model of IR is split information to basic units like English words and build connections between them.

The basic units constitute vocabulary, a Trie data structure is always about an vocabulary. Based on this vocabulary we create all kinds of information. We have many other names for these basic units: word, term , jargon, terminology, concept, entity, name, noun, phrase, etc.

Let's talk about corpus, a corpus is a collection of information, for example a collection of articles, books, product descriptions, etc. And we would like to conduct searching on the corpus. I won't go into details about how the search should be implemented, but one thing for sure is we need to build an vocabulary for the corpus.

A simple method is to collect all English words from all documents in the corpus and add them to a set. Notice in a set, all items are unique, there is no duplication. All documents may contains word "the", probably occurred many times, but in the final set, there is only one item for "the".

The set is the vocabulary of the corpus. This is a very usual vocabulary. There are also unusual vocabularies. For example, think about a corpus that contains only one kind of information: days in a year. The vocabulary will be a set with 365 items from 1 to 365, by the way, the size of this vocabulary set is so called cardinality, we will talk about it later. You would think the two are not related, but they are actually using the same idea in the context of IR.

Another example: Google. The corpus is the whole internet, and Google also built the vocabulary for the corpus in the same sense of the examples above. A simple model rules all searches.

This model is fundamental in context of IR, you would get a better understanding when see concept like inverted index and truely understand the trie data structure. Many people feel hard to grasp the trie data structure, not because it's hard, it's just very different and strange, most programming tasks involves no full-text search or information retrieval, in most cases the search is a minor problem, the trie is rarely used in day to day programming.

As the big data gets more and more pervasive, this may change. Have a firm understanding of IR will definitely a big help for your career.

Range searches

Trie is used to make numeric range searches in search engines like Lucene more efficient. It did this by using prefix to represent a sub range.

How this works? Take the English dictionary as example, the prefix "th" represent all words start with "th" it will include "the", "think", "theft", etc. If you are using a dictionary software, it probably used a trie internally, whenever you type a letter it display all words with the prefix you input. If your month format is something like "201701", the prefix "2017" will represent all months of the year 2017. The sub range defined by prefix will be precaculated and stored as special term in index, they will be used when do the range query so Lucene don't have to calculate at query time, thus more efficient.

The whole range of course is the whole vocabulary of the numeric field. A sub range is the collection of terms with same prefix, it actually an internal node in the trie data structure. Thus a range query will consists of a few hands of sub ranges, instead of the whole collection of all single term between the range specified by the query. Actually the latter is the original implementation which was used for many years, the trie implementation was contributed by open source community. See The Evolution of Numeric Range Filters in Apache Lucene.

The key point here is those sub ranges can be generated by looking at the trie data structure and it's very efficient.

The sub ranges also called predefined bracket, or special term. An index without trie:

num1 = [doc1, doc2, doc3]
num2 = [doc2, doc3]
numN = [doc5, doc6]

An index with trie

num1 = [doc1, doc2, doc3]
num2 = [doc2, doc3]
num400-num499 = [doc1, doc3,...]
numN = [doc5, doc6]

This is how trie data structure works in numeric range search in Lucene, there are more details about how to construct the trie data structure from an vocabulary and how the query is parsed to the combination of a few sub ranges, this is the magic done by Lucene for you , as a user you need to understand it on a high level and what the configuration means, for example what it means to specify the precision step, it's just the minimal length of the prefix node of the trie. So you know what you are doingn when tuning a particular parameter.


This post Grokking Solr Trie Fields has a detailed illustration about how trie data structure works.

This post Lucene: The Good Parts has an image of a trie data structure.