Term is a fundamental concept in Lucene, its like a token but powerful than plain tokens. In most cases a term is just a English word separated by whitespace and punctuations. This is the common form of a term, actually a term can be anything as you wish it to be: an URL, date, number, a piece of text (using NOT_ANALYZED, doesn't make much sense, but its totally valid). Lucene index option analyzed vs not analyzed

Generally, terms are case insensitive, because most analyzers will lowercase the term at index time, and all queries also go through the same analyzers and get lowercased at query time.

There is one exception, if the field is set as "not analyzed", the whole field is a term, thus it won't be processed by any analyzer, to search it, your query must be exact the same as the textual value of the field.

When we talking about index or search, we always mean terms. In Lucene, the only way to turn things to be searchable is break them into terms. Put it simply, the core data structure of Lucene is just a big hash map, the keys are terms, values are list of document ids. Querying is an O(1) operation.

How Lucene represent a term

In Lucene's implementation, a term consists of two parts: field and term text value. This indicate that two terms has the same text but with different field are different terms. In Lucene's architecture, document consists of fields, fields contains text, the text is broke up into terms at index time. The text itself can be optionally stored, the really useful thing is the terms. A Term object can be constructed like this

 
Term t = new Term("doc_body", "lucene");
 

The toString method of Term object will print a string contains field and term text separated by a colon:

 
"doc_body:lucene"
 

Term class provides field and text method to retrieve the two parts of a term object

 
System.out.println (t.text());
System.out.println (t.field));
 

Term in inverted index and searching

Term is the basic unit both for indexing (stored as inverted index) and searching (user's query text will be converted to terms).

In Lucene inverted index format, a term point to a list of documents, this list called posting list. Because we build index from terms to documents, we call this kind of index as inverted index.

Here are some diagrams help you understand what a term looks like

lucene term

Term Dictionary in Lucene

The most imporant data structure that keep terms and its associated information is Term Dictionary, this is the central data structure of the Lucene Index format. In Lucene, Term Dictionary is saved as .tis file.

We have two documents, each one has two fields: title and content

 
doc 1: {
  "title" : "brown fox",
  "content" : "brown fox lazy dog"
}
 
doc 2: {
  "title" : "lazy dog",
  "content" : "about lazy dog"
}
 

How many terms we have?

 
["content", "about"],1
["content", "brown"],1
["content", "dog"],2
["content", "fox"],1
["content", "lazy"],2
["title", "brown"],1
["title", "dog"],1
["title", "fox"],1
["title", "lazy"],1
 

The number after the term is the document frequency, that is how many documents contains this term. Also notice the terms are ordered first by field name then by term value.

This is just a logical view, actual store will not save field name in this duplicated way, field name is stored in .fnm file, and identified by a number, other places use the number to refer to the field name string. But this is just the storage details, we only need to focus on the logical view.

Term frequency

Each term has an associated list, the list contains all documents that contains this term. Also called posting list.

Every item in this list contains the document id and the frequency of the term in the document. It may also contains additional information like position of the every occurrences of the term. This information allows Lucene calculate the relevance value of the document. Or tell the user where exactly the term show up in the document.

Term dictionary and associated posting list comprise the core part of Lucene invert index. It connects the field, term text, document, frequency, position all together.

Term "doc_body:lucene" -> (doc1,doc2,doc3) means doc 1 2 3 all has a field called "doc_body" and they all contains the word "lucene".

Conclusion

Term is the central component in Lucene, terms are stored in Term Dictionary, file name suffix with .tis. Different values are associated with term, the field name, document frequency, term frequencies, etc.