What is term vector in Lucene

The vector is one of the most overloaded word in math, programming and many other fields.

Etymology of vector

The original meaning of the root of vector means "carries", you can see it in words with the same root such as vehicle, wagon, way, etc. Vector also refer to the human or animal carries and spreads pathogen or virus, for example, Children aren't major vectors of Covid virus.

"How it relates to vector in mathematics?" you may ask. Vector in math is a quantity with size and direction. Math is all about numbers, every number is already a size by itself, so what really makes vector special is the sense of direction. This is the key that connects the origin of vector and vector in math. It's especially obvious for the word way. When talks about way, it's always about the direction, this way, that way, way back, way forward, Norway literally means north way. It's not so hard to draw a similarity between a road and an arrow.

The vector is a very useful and powerful thinking model. It's used extensively in math, but you will encounter it more in programming and software engineering. A branch of artificial intelligence is entirely based on vector.

To understand vector we can use geometry to help us, because it's the most intuitive way. But you should not stop there, vector is much more than just geometry.

Consider any number, you can think of it as an one dimensional vector. In geometry, it will be a point on the one dimensional coordinate axis. To illustrate it, draw an arrow from the origin to the point, this is an vector.

Consider two numbers, it will define a point on a plane, and there is also an arrow vector.

And then the three dimensional spaces, the vector will contains three numbers, it's heavily used in 3D graphics programming.

And so on. In data structure an vector is represented as an array.

Consider a map

 
{
:name "Bob"
:age 20
:address "xx street"
:floor "10"
}
 

This is a typical domain object in software. If you treat every key as an axis and every value is a point on the axis, then the object is an vector.

Thus, data modeling is a process of vectorizing real world object. Most of the vectors end up in a database for example, relational, key value, document or columnar. Whenever you want to describe something in computer, you have to vectorize it.

Programming language like C++ use vector to represent a particular kind of data structure, but obviously vector is much more abstract than that. The C++ vector data structure basically just a dynamic array, that is you don't have to specify the array size and when you keep adding items to it, it grows automatically. This has nothing to do with the concept of vector.

The object can be a document. So we can manipulate them as if they are geometry objects in a multi dimensional space, for example calculate the angle between two vectors. It's extensively used in text search and information retrieval.

The object can also be image or sound, there are AI and deep learning algorithms using vectors to recognize objects or speeches. In essence it's a similarity calculation problem.

Actually the vector model is a fundamental way of human thinking, sometimes unconsciously. When we make decisions we consider multiple aspects of an event. And many computer programs are just trying to simulate this process. In deep learning, everything are vectorized, or so called thought vector or word vector, and then the complex geometry transformation are conducted on the vectors.

In Lucene's JAVA Doc, term vector is defined as "A term vector is a list of the document's terms and their number of occurrences in that document.". Indicated that each document has one term vector which is a list .

What is a term?

A term is the basic unit searchable in Lucene. In analyzing and index phrase, text are broken to streams, the element in the stream is term, in query phrase, the query first be parsed to terms and then use it to query the Lucene index.

A term is a pair, the first element is a string represent the field name the term belongs to, the second element is string represent literal text of the term, it can be an English word, an URL, an email address or anything generated by your analyzer at index phrase. Following code create a term

 
Term t = new Term( "field" , "TermText");
 

A term is always associated with a field. Technically when you search Lucene index, you are searching terms, this is different with grep like search which actually searching characters, no matter a plain text search or a regular expression search, the smallest unit of search is character. In a typical Lucene search, user gives information about field and keywords(terms), Lucene gives back documents which contains fields which contains keywords.

Many important concepts in Lucene relate to term, such as term frequency, term dictionary, term vector, etc. So its very important to clearly understand it.

The confusing definition of term vector

If a field of the document enabled term vector, all terms in that field will be added to document's term vector. I think the description in JAVA Doc is a little confusing, since every term belongs exactly one field(same term text in different field are two different terms), the number of occurrences of a term in document is always the same as occurrences of the term's text in its field. The Java Doc sounds like terms are belong to document, document is just a collection of terms. Terms first categorized by fields, then belong to document.

Also notice the term vector is enabled or disabled at level of field, a document may contains both fields which enabled or disabled term vector at the same time. Here is how you retrieve a term vector, you need a doc id and a field name:

 
            Terms terms = idxReader.getTermVector(0, "title"); 
 

Here is another definition I think is better. For each field in each document, the term vector (sometimes called document vector) may be enabled or disabled.If term vector is enabled, all terms in that field will be added to the document's term vector list, the list contains the term and other information about the term: the frequency or position or offsets.

Index options and term vector

In Lucene, you add document to index, the document consists of fields, just like a database table row consists of columns. For each field you can set various options to control how Lucene will deal with it when creating index for the document.

There are three field options in Lucene: indexing , storing, and term vectors. The indexing and storing are easy to understand. The index option control how the field will be indexed, thus determine how it will be searched. Is it break up to tokens and those tokens are searched? Is it searched as a whole value, for example an ID number string. The storing option decide whether store the actual data in the index. The term vector, just like storing, it can be stored, but it stores different information, which generated by indexing. The index is a map from terms to documents, term vector is also a map, but from terms to position, offset, frequencey information in the document that the term belongs to.

 
term -> (frequency, position, offset)
 

In document's term vector, for each term, we know following information: the document id, the field name, the text of term, the frequency, position and offsets. With this information computed, we can do a lot of interesting things when searching.

The indexed option means the inverted index information will be calculated and stored , but this information only tells you which documents or fields contains a particular term, in another words, it only let you do the simple matching, this can be just a map look up, key is term , value is document id. To get more useful search results, we need more details information provided by term vector.

Sometimes, we need the "uninverted" index: given a document, find all its terms and the positions information of these terms. Index tell us which document matched , term vector tells us how and where its matched. You can think of term vectors as a miniature inverted index for just one document.

A classic example is search result highlighting. Suppose user typed some query string and Lucene find some documents, now how we gonna present the results to user? Most search engines will give user the title, URL and a digest, and the searched keywords will be highlighted in them. The title and URL can be stored in the document, so its not a problem , but the digest can not be stored, for the same document, different query will generate different digest. To generate digest we need to know where the search terms occurs in document, so we can select the piece of texts around the search term and optionally highlight the term. The term vector contains all the necessary information let us do this. See how to high light a blog post with Lucene 6.0.0 and Gradle build How to do Lucene search highlight example

Term vector is truly the final inch for user to reach their search target, the spots where the searched terms exists in target document.

Another interesting thing we can do with term vector is find similar documents of a particular document, for example the "related posts" feature in a blog entry which is a list of links point to other documents similar to current blog entry. With term vector information we can actually calculate how much two documents similar with each other with a simple formula.

The term vector also play an important role when scoring matching documents in vector space model.

The term vectors is like a micro version of inverted index against only one document. This index will answer such query: for a search term how many times it occurs in this document and where it show up? Or simply: frequencies and positions.

The term vector is generated in the analyzing process. When analyzer generate tokens, it also provide position and offset information . You can specify whether to store these information in term vectors:

TermVector.YES: Only store number of occurrences.

TermVector.WITH_POSITIONS: Store number of occurrence and positions of terms, but no offset.

TermVector.WITH_OFFSETS: Store number of occurrence and offsets of terms, but no positions.

TermVector.WITH_POSITIONS_OFFSETS:number of occurrence and positions , offsets of terms.

TermVector.NO:Don't store any term vector information.

If those information is not stored, you can also compute it on the fly when searching.

Code example to create a field with term vector enabled

 
Document doc = new Document();
        doc.add(new Field("title", "quick fox brown fox", 
                Field.Store.YES, 
                Field.Index.ANALYZED,
                Field.TermVector.WITH_POSITIONS_OFFSETS));
 

You can always generate term vectors dynamically

The only thing you need is the text of field is stored. At query time, you can always generate term vectors as necessary. For example, to highlight text fragments, you use getAnyTokenStream of class TokenSources to get a token stream, it will use term vectors if it has been computed at index time, it will perform the analyzing if not.

Term vector and similarity measurement

But why call it vector, vector is a mathematic and geometry concept. The reason is the design of the term vector data structure in Lucene is based on a solid mathematical and IR foundation called vector space model. Suppose a document contains only two terms, we can imagine the term vector as an vector in two dimensional space.

 
 
[ term1:3, term2:4 ]
 
 

The number is the term frequency .

If you write term1 as x, term2 as y

 
[x:3, y:4]
 

:lucene term vector illustration

This can be considered as an vector in two dimensional coordinate system.

If document contains 3 terms ,the vector will be a three dimensional vector, and so on. Every document can be represented as a vector in a common vector space, and the query is also a document , its also a vector in this space.

Each term represent an axis, the number on the axis is term frequency, all available terms across all documents will generate a space with the same number of axises. It sounds complicated, but its no different with normal 2D or 3D space, the only difference is the number of axises.

And a document will simply be a dot in this space. Draw an arrow from the origin to the dot, you get a term vector. The query you input in the search box, after analyzing, can be also a document, it also has a position in this space.

The interesting thing is the angle between two vectors:

The angle provides a measurement of similarity(cosine similarity) of two documents which is very useful when you want to present the related document to users. So this is the second usefulness of term vectors: find all documents similar to a matched document. We can simply build a n-dimensional space, each document is represented as a vector in the space, we find those vectors that has small angle with current document.

Alternative way to get term vector

Store term vector information may consume a lot of disk space, an alternative way is reanalyze document and get term vector information on the fly, if the document is tiny then this may be a better solution. The analysis is the same as the indexing time analysis. It will add overhead to search performance but save disk space.

Example code in Clojure display term vectors information

To make it visible I experiment in Clojure REPL with Lucene 4.10, the code list below:

 
(def rd (org.apache.lucene.store.RAMDirectory.))
(def iw (org.apache.lucene.index.IndexWriter.
          rd
          (org.apache.lucene.index.IndexWriterConfig. org.apache.lucene.util.Version/LUCENE_47 (org.apache.lucene.analysis.miscellaneous.LimitTokenCountAnalyzer. (org.apache.lucene.analysis.core.WhitespaceAnalyzer.) 1000))
          )
)
 
(defn create-doc [title content]
  (let [ doc (org.apache.lucene.document.Document.)]
    (.add doc (org.apache.lucene.document.Field. "title" title org.apache.lucene.document.Field$Store/YES org.apache.lucene.document.Field$Index/ANALYZED org.apache.lucene.document.Field$TermVector/WITH_POSITIONS_OFFSETS))
    (.add doc (org.apache.lucene.document.Field. "body" content org.apache.lucene.document.Field$Store/YES org.apache.lucene.document.Field$Index/ANALYZED org.apache.lucene.document.Field$TermVector/WITH_POSITIONS_OFFSETS))
    doc  
  )
)
 
(def doc1 (create-doc "quick fox brown fox" "quick fox run faster"))
(def doc2 (create-doc "lucene in action" "lucene runs like fox"))
 
 
(.addDocument iw doc1)
(.addDocument iw doc2)
(.close iw)
 
(def is (org.apache.lucene.search.IndexSearcher. (org.apache.lucene.index.DirectoryReader/open rd )))
(def tops (.search is (org.apache.lucene.search.TermQuery. (org.apache.lucene.index.Term. "title" "fox")) 10))
(def tops (.search is (org.apache.lucene.search.TermQuery. (org.apache.lucene.index.Term. "title" "lucene")) 10))
(def scoreTops (. tops scoreDocs ))
(.toString (first scoreTops) )
(. (first scoreTops) doc)
 
 
(def ir (org.apache.lucene.index.IndexReader/open rd ))
 
(def terms (.getTermVector ir 0 "title"))
 
(let [ termsEnum (.iterator terms nil)
     ]
  (loop [text (.next termsEnum)]
    (if (nil? text) nil
      (do
        (println "freq: " (.totalTermFreq termsEnum) " text: " (.utf8ToString text))
        (recur (.next termsEnum))
      )
    )
  )
)
 
 
 

The output looks like

 
freq:  1  text:  brown
freq:  2  text:  fox
freq:  1  text:  quick
 

You can easily convert the code to Java code if you want, but experiment in REPL is much more fun.

What we do here ? First we create a RAMDirectory rd to store our index and a IndexWriter iw. Then is the function generate a document with two fields: title and body.

Then write document to the index.

Then we create an IndexReader object and call getTermVector method of IndexReader this call returns term vectors of the title field of the first document. The last let statement loop through the term vectors and print term text value and its frequency for each term.

From the code we can see each field has a term vector , the whole document has a map of term vector, the key is the field name, value is the term vector belongs to the key field.

Java version:

 
package com.makble.lucenetest;
 
import java.io.IOException;
 
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.Version;
 
public class TestTermVector {
    public static Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
    public static IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
    public static RAMDirectory ramDirectory = new RAMDirectory();
    public static IndexWriter indexWriter;
 
    public static void main (String [] args ){
        Document doc = new Document();
        doc.add(new Field("title", "quick fox brown fox", 
                Field.Store.YES, 
                Field.Index.ANALYZED,
                Field.TermVector.WITH_POSITIONS_OFFSETS));
        doc.add(new Field("body", "quick fox run faster", Field.Store.YES, Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS));
 
        try {
            indexWriter = new IndexWriter(ramDirectory, config);
            indexWriter.addDocument(doc);
            indexWriter.close();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
 
        try {
            IndexReader idxReader = DirectoryReader.open(ramDirectory);
            IndexSearcher idxSearcher = new IndexSearcher(idxReader);
            // The returned Fields instance acts like a single-document 
            // inverted index (the docID will be 0).
            Terms terms = idxReader.getTermVector(0, "title"); 
            TermsEnum termsEnum = terms.iterator(null);
            BytesRef bytesRef = termsEnum.next();
            while(bytesRef  != null){
                System.out.println("BytesRef: " + bytesRef.utf8ToString());
                System.out.println("docFreq: " + termsEnum.docFreq());
                System.out.println("totalTermFreq: " + termsEnum.totalTermFreq());
                bytesRef = termsEnum.next();
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}
 

Build this example with Gradle.build

 
apply plugin: 'java'
apply plugin: 'eclipse'
 
ext.luceneVersion= "4.0.0"
 
sourceCompatibility = 1.5
version = '1.0'
jar {
    manifest {
        attributes 'Implementation-Title': 'Gradle Quickstart', 'Implementation-Version': version
    }
}
 
repositories {
    mavenCentral()
}
 
dependencies {
    compile group: 'commons-collections', name: 'commons-collections', version: '3.2'
    testCompile group: 'junit', name: 'junit', version: '4.+'
    compile "org.apache.lucene:lucene-core:${luceneVersion}"
    compile "org.apache.lucene:lucene-analyzers-common:${luceneVersion}"
    compile "org.apache.lucene:lucene-queryparser:${luceneVersion}"
}
 
test {
    systemProperties 'property': 'value'
}
 
uploadArchives {
    repositories {
       flatDir {
           dirs 'repos'
       }
    }
}
 
task debug << {
    println "java.home is " + System.properties['java.home']
    configurations.compile.each { println it }
}
 
 

Reference

Putting term vectors on a diet

WHAT IS A TERM VECTOR?