How to use Lucene DocValues

Lucene's main data structure is inverted index, a big hashmap which use the term as the key and document list as the value. It's very good at searching by terms.

But it's not so good for other tasks, like sorting, faceting or highlighting. As of Lucene 4.0, a new structure was introduced, the DocValues.

For example, our sorting code example uses DocValue field to sort the results: How to sort Lucene search results

The idea is uninvert the inverted index. This is exactly what the term vector do. Any performance problem can be solved by another inverting in Lucene, that is what Lucene is all about.

Searching is about find document ids with given terms, other tasks are inverse, given document ids to find the values contained in the document. Inverted index and stored field are not good at that in its nature. Because the data is stored sparsely. Like relational database, Lucene using a row based model, all fields of a document are stored together in a continuous disk block.

When we are only interesting in a specific field, this storage model is very inefficient. The solution is column based data storage, store the all the data of a field in a compact column. Data belong to same field are stored continuously, with help of free cache provided by the operating system memory mapped files, read out all data from a column is very efficient. The system memory map file mechanism is very useful, MongoDB also takes advantage of it by the way. Even you only read the field of one document, you may already read the surrounding values of the same field into the memory. This generally incurs much less disk seek for field centric tasks. When you perform aggregation queries like max,avg, column database can be very fast.

There are serious columnar databases are usually used dedicately in data analytics, business intelligence, data mining, reporting and OLAP systems. While Lucene is a search engine, generally it's not supposed to do serious analytic jobs, but there are some cases in which we need to do some analytic tasks in Lucene. Some would argue we should use the right tool for the right job, let real database do the data storage job, let the real columnar database do the analytic job, Lucene should dedicately focus on full text search, the so called polyglot persistence. But if you think about it, it's an idealized situation, nowadays many database tends to be hybrid in characteristics, but they all have specialty in one aspect and borrow some ideas from other data solutions, strictly speaking this is called multi-model. The Lucene Doc Valus is such an example. Another example is Mongodb, it's not supposed to do serious transactions, but it has a minimal level of supporting of it.

For example, sorting, when all the matched documents are found, Lucene need to get the value of a field of each of them, usually not analyzed, single term field like numeric or datetime and sort them. The naive method will be to retrieve the value of a field for each document id in the result, there are some performance problems here. The result document set can be very large, and the data may be widely scattered on disk. And retrieving value from row based storage model is low efficient because it involves a lot of disk seeks.

Sorting itself may very fast, but load all these data from disk is slow.

The solution is uninvert, build another miniature index from document id to value, document id becomes the term, and the single not analyzed field which used to be the term now becomes document and they are stored continuously on disk. Even better, the continuous disk block can be designed to be aligned with document id, thus you can perform random access to the file, it's like an on disk array with the document id as the index to the item. Actually you can legitimately treat it as a hash table stored on disk but also utilizing the memory mapped file provided by operating system at the same time, it's the highest level optimization you can achieve for such a problem, that is, you can't load them all in to memory at once but also want to enjoy the minimal disk seeks. Actually this is the trick used by major columnar databases.

In this sense, it is indeed uninverted, but doesn't necessarily has to be the same as Lucene invert index for search. Each document id is pointed to exactly one value. Instead it uses columnar data structure introduced in databases like Cassandra.

Here is how to add a field that will be sorted

        doc.add(new SortedDocValuesField ("date", new BytesRef(date) ));
        doc.add(new StoredField("date", date));