How to use Lucene DocValues

Lucene's main data structure is inverted index, a big hashmap use the term as the key and document list as the value. It's very good at searching by terms.

But it's not so good for other tasks, like sorting, faceting or highlighting. As of Lucene 4.0, a new structure was introduced, the DocValues.

For example, our sorting code example uses DocValue field to sort the results: How to sort Lucene search results

The idea is uninvert the inverted index. This is exactly what the term vector do. Any performance problem can be solved by another inverting in Lucene, that is what Lucene is all about.

Searching is about find document ids with given terms, other tasks are inverse, given document ids to find the values contained in the document. Inverted index and stored field are not good at that in its nature. Because the data is stored sparsely.

When we are only interesting in a column, this data format is very inefficient. The solution is column based data structure, store the all the data of a field in a compact column. Data are stored continuously, combined with the operating system mapped files, read out all data from a column is very efficient. Actually, even you only read one field of a record, you may already read the surrounding field of the same attribute into the memory. When you perform aggregation queries like max,avg, column database can be very fast.

Column databases are usually used in data analytics, business intelligence, data mining, reporting and OLAP systems.

For example, sorting, when all the matched documents are found, Lucene need to get a value of a field of each of them, usually not analyzed, single term field and sort them. The document set can be very large. And retrieve value from index is expensive, involves a lot of disk seeks.

Sorting itself may very fast, but load all these data from disk is slow.

The solution is uninvert, build another index from document id to value, document id becomes term, and the single not analyzed field which used to be the term now becomes document.

In this sense, it is indeed uninverted, but doesn't necessarily has to be the same as Lucene invert index for search. Each document id is pointed to exactly one value. Instead it uses columnar data structure introduced in databases like Cassandra.

Here is how to add a field that will be sorted

 
        doc.add(new SortedDocValuesField ("date", new BytesRef(date) ));
        doc.add(new StoredField("date", date));