Lucene field boost and query time boost example

For a search engine or search application, search quality is one of the most important factors for user experience. By default Lucene uses a predefined formula to score each matched document and ranks them accordingly. Obviously there is no one size fits all solution for search relevancy. Every search application has it's own special cases and domain knowledges.

This is where boost comes in, boosting in Lucene is the process that tune the search results by changing the components in the scoring formula. The two main ways are index time boost and query time boost. More complex methods include change the default formula, this post focus on the default formula.

For a general search engine like Google, boosting is far more complex, there could be hundreds of factors to affect ranks of a Web page and they changes every minute. For example, Google's formula may ranks the page with more backlinks higher, other factors include CTR, length of content, the domain age, etc. They all are some kind of boosting mechanism. To set up a Lucene application is actually very easy, but it's not an end, to provide a high quality and relevant search results is harder.

What is boost

The boost in Lucene is both an verb and a noun. As a noun, it represent a number, usually a float number, there are several boost number supported by Lucene, for example, the document boost, field boost, query boost, etc. They take part in the calculation of the document score when rank the boolean matched documents. Lucene search include two phases, first find the matched documents, this is a boolean match process, then calculate score for them and rank them. Boosting comes in at the second phase.

The default value of a boost number is 1.0, since in score calculation formula, the boost number is multiplied by other factors, the default value has the effect of ignoring the boost number. When its set to larger than 1.0, the final score will increase, this means in the search results set, the document will rank higher.

As an verb, it means increase the boost number of the subject, for example, boosting a field means increase the boost number of a field, or boosting document, increase the boost number of a document.

Boost is a way to adjust the search relevance of a search application, there are many other methods, but boost is the basic one.

How boost works in Lucene

We must know how boost affect the search ranking, then we know what we are doing when we change boost numbers. We should first study the score calculation formula:

 
score(q,d) =
  for t in q : [  Σ tf(t) * idf^2(t) * t.getBoost() * norm(t, d) ] 
  * coord(q,d )
  * queryNorm(q)
 
 
norm(t , d) = d.getBoost() * t.getField().getBoost() * lengthNorm(t.getField())             
 

In this formula, we can see there are three boosts: t.getBoost(), d.getBoost(), and t.getField().getBoost() which represent query time boost, document boost and field boost, both document boost and field boost are index time boost. Now you can safely ignore other factors in this formula because they don't affect our understanding of these boosts.

Actually, for index time boost Lucene only support field level boost, prior Lucece 4.0, the document boost is implemented by field boost, as of 4.0, the setBoost method of Document class is removed.

Index time boost and Field boost

Index time boost means you set boost number on field or document before document is written to index. We focused on Field boost here.

What it means when you boost a field? How it will affect the ranking order of search results? Considering the formula, we can see that when we search against a field, the boost number of the field will be included in score calculation, suppose there are 10 documents all contains field "title", when searching against this field, for example for a given search the result has the order: [doc1, doc2, doc3, doc4, doc5], if we increase the title field boost of doc5 at index time, then for the same search, the doc5 will move up in search results.

The moving up and down of search result is the really interesting and valuable part of a search engine. If you are an internet marketer, blogger or SEO guy, make your page moves up or even get the the first for some keywords on Google is the most challenge and profitable business.

What you need to do is find out the boost factors and improve them. The principle is the same. Google just has its own formula to calculate score for web pages.

The reasons we want to change field boost maybe because we think the default score algorithm didn't reflect the real relevance of documents, maybe the algorithm missed some important factors, these factors can be domain specific. For example, you already know the author or source of doc5 is more authoritative at index time, if we simply search against the title field by default algorithm, this important factor will unable to contribute to the relevance of results.

So when you should boost a field? When you think that the default algorithm is not enough for determining a field's importance or relevance, you want the documents which contains the field ranks higher. Boost a field in a document simply means you think the piece of text in this field is more important for whatever reason, when user search against this field, in all matched documents, the documents which contains the boosted field should rank higher.

Lucene field boost example

Now lets create an example Lucene project to illustrate field boost. Our project will be based on a starter Gradle project with Lucene support, see Create an starter Eclipse project to test Lucene API.

 lucene field boost example project gradle

Add a new class TestFieldBoost in package com.makble.lucenetest

 
package com.makble.lucenetest;
 
import java.io.IOException;
 
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Explanation;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
 
public class TestFieldBoost {
    public static Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
    public static IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
    public static RAMDirectory ramDirectory = new RAMDirectory();
    public static IndexWriter indexWriter;
 
 
    public static void createIndexWithoutBoost() {
        try {
                indexWriter = new IndexWriter(ramDirectory, config);    
                createDoc("Sam", "brown fox jump");    
                createDoc("Sam", "fox jumps over lazy dog");    
                createDoc("Jack", "brown fox jump");
                createDoc("Smith", "brown fox jump");
                createDoc("Smith", "brown fox jumps over lazy dog");
 
                indexWriter.close();
        } catch (IOException | NullPointerException ex) {
            System.out.println("Exception : " + ex.getLocalizedMessage());
        } 
    }
 
    public static void createIndex() {
        try {
                indexWriter = new IndexWriter(ramDirectory, config);    
                createDoc("Sam", "brown fox jump");    
                createDoc("Sam", "fox jumps over lazy dog");    
                createDoc("Jack", "brown fox jump");
                createDocBoostTitle("Smith", "brown fox jump", 2.0f);
                createDocBoostTitle("Smith", "brown fox jumps over lazy dog", 2.0f);
 
                indexWriter.close();
        } catch (IOException | NullPointerException ex) {
            System.out.println("Exception : " + ex.getLocalizedMessage());
        } 
    }
 
    public static void createDoc(String author, String title) throws IOException {
        Document doc = new Document();
        doc.add(new TextField("author", author, Field.Store.YES));
        doc.add(new TextField("title", title, Field.Store.YES));
 
        indexWriter.addDocument(doc);
    }
 
    public static void createDocBoostTitle(String author, String title, float titleBoost) throws IOException {
        Document doc = new Document();
 
        doc.add(new TextField("author", author, Field.Store.YES));
 
        Field titleField = new TextField("title", title, Field.Store.YES);
        doc.add(titleField);
        titleField.setBoost(titleBoost);
 
        indexWriter.addDocument(doc);
    }
 
    public static void searchIndex() {
        try {
            IndexReader idxReader = DirectoryReader.open(ramDirectory);
            IndexSearcher idxSearcher = new IndexSearcher(idxReader);
 
            TermQuery query = new TermQuery(new Term("title", "fox"));
 
            TopDocs docs = idxSearcher.search(query, 10);
            System.out.println ("length of top docs: " + docs.scoreDocs.length);
            for( ScoreDoc doc : docs.scoreDocs) {
                Document thisDoc = idxSearcher.doc(doc.doc);
                System.out.println(doc.doc + "\t" + thisDoc.get("author") + "\t" + thisDoc.get("title"));
                Explanation explanation = idxSearcher.explain(query, doc.doc);
                System.out.println("----------");
                System.out.println(explanation.toString());
                System.out.println("----------");
                System.out.println("");
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } finally {
            ramDirectory.close();
        }
    }
 
    public static void main (String [] args ){
        createIndexWithoutBoost();
        searchIndex();
 
        ramDirectory = new RAMDirectory();
        createIndex();
        searchIndex();
    }
 
}
 

In this example, we have two index creation methods, they add exactly the same documents to index, but one of them changes field boost of two documents by calling setBoost on the field. The intention is the author "Smith" makes the title field more important, so we boost the title field when the author is "Smith", the following is documents rank for query title:fox against the two index.

 
length of top docs: 5
0    Sam    brown fox jump
2    Jack    brown fox jump
3    Smith    brown fox jump
1    Sam    fox jumps over lazy dog
4    Smith    brown fox jumps over lazy dog
length of top docs: 5
3    Smith    brown fox jump
4    Smith    brown fox jumps over lazy dog
0    Sam    brown fox jump
2    Jack    brown fox jump
1    Sam    fox jumps over lazy dog
 
 

As you can see, after boosting, the document which contains boosted field ranks higher, the document authored by "Smith" are listed as the first and second position.

As the formula shows, the product of field boost , document boost and lengthNorm is norm(t , d). In Lucene's implementation, this product is calculated and encapsulated in to a byte at index time, at query time, the norm(t , d) is called fieldNorm, here is the explanation of the query

 
length of top docs: 5
0    Sam    brown fox jump
----------
0.40883923 = (MATCH) weight(title:fox in 0) [DefaultSimilarity], result of:
  0.40883923 = fieldWeight in 0, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    0.81767845 = idf(docFreq=5, maxDocs=5)
    0.5 = fieldNorm(doc=0)
 
----------
 
2    Jack    brown fox jump
----------
0.40883923 = (MATCH) weight(title:fox in 2) [DefaultSimilarity], result of:
  0.40883923 = fieldWeight in 2, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    0.81767845 = idf(docFreq=5, maxDocs=5)
    0.5 = fieldNorm(doc=2)
 
----------
 
3    Smith    brown fox jump
----------
0.40883923 = (MATCH) weight(title:fox in 3) [DefaultSimilarity], result of:
  0.40883923 = fieldWeight in 3, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    0.81767845 = idf(docFreq=5, maxDocs=5)
    0.5 = fieldNorm(doc=3)
 
----------
 
1    Sam    fox jumps over lazy dog
----------
0.35773432 = (MATCH) weight(title:fox in 1) [DefaultSimilarity], result of:
  0.35773432 = fieldWeight in 1, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    0.81767845 = idf(docFreq=5, maxDocs=5)
    0.4375 = fieldNorm(doc=1)
 
----------
 
4    Smith    brown fox jumps over lazy dog
----------
0.30662942 = (MATCH) weight(title:fox in 4) [DefaultSimilarity], result of:
  0.30662942 = fieldWeight in 4, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    0.81767845 = idf(docFreq=5, maxDocs=5)
    0.375 = fieldNorm(doc=4)
 
----------
 
length of top docs: 5
3    Smith    brown fox jump
----------
0.81767845 = (MATCH) weight(title:fox in 3) [DefaultSimilarity], result of:
  0.81767845 = fieldWeight in 3, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    0.81767845 = idf(docFreq=5, maxDocs=5)
    1.0 = fieldNorm(doc=3)
 
----------
 
4    Smith    brown fox jumps over lazy dog
----------
0.61325884 = (MATCH) weight(title:fox in 4) [DefaultSimilarity], result of:
  0.61325884 = fieldWeight in 4, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    0.81767845 = idf(docFreq=5, maxDocs=5)
    0.75 = fieldNorm(doc=4)
 
----------
 
0    Sam    brown fox jump
----------
0.40883923 = (MATCH) weight(title:fox in 0) [DefaultSimilarity], result of:
  0.40883923 = fieldWeight in 0, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    0.81767845 = idf(docFreq=5, maxDocs=5)
    0.5 = fieldNorm(doc=0)
 
----------
 
2    Jack    brown fox jump
----------
0.40883923 = (MATCH) weight(title:fox in 2) [DefaultSimilarity], result of:
  0.40883923 = fieldWeight in 2, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    0.81767845 = idf(docFreq=5, maxDocs=5)
    0.5 = fieldNorm(doc=2)
 
----------
 
1    Sam    fox jumps over lazy dog
----------
0.35773432 = (MATCH) weight(title:fox in 1) [DefaultSimilarity], result of:
  0.35773432 = fieldWeight in 1, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    0.81767845 = idf(docFreq=5, maxDocs=5)
    0.4375 = fieldNorm(doc=1)
 
----------
 
 
 

The fieldNorm of doc3 and doc4 before and after boosting:

 
    0.5 = fieldNorm(doc=3)
    0.375 = fieldNorm(doc=4)
 
 
    1.0 = fieldNorm(doc=3)   
    0.75 = fieldNorm(doc=4)
 
 

Lucene Query time boost

You can achieve the same effect with query time boost. For example you are searching books with title contains term "fox", the author can match "smith" or "sam". But you want the author "smith" rank higher.

 
    public static void searchIndexWithQueryBoost() {
        try {
            IndexReader idxReader = DirectoryReader.open(ramDirectory);
            IndexSearcher idxSearcher = new IndexSearcher(idxReader);
 
            TermQuery query = new TermQuery(new Term("title", "fox"));
            TermQuery query2 = new TermQuery(new Term("author", "smith"));
            TermQuery query3 = new TermQuery(new Term("author", "sam"));
            query2.setBoost((float) 2.0);
 
            BooleanQuery booleanQuery = new BooleanQuery();
            booleanQuery.add(query, Occur.MUST);
            booleanQuery.add(query2, Occur.SHOULD);
            booleanQuery.add(query3, Occur.SHOULD);
 
            TopDocs docs = idxSearcher.search(booleanQuery, 10);
            System.out.println ("length of top docs: " + docs.scoreDocs.length);
            for( ScoreDoc doc : docs.scoreDocs) {
                Document thisDoc = idxSearcher.doc(doc.doc);
                System.out.println(doc.doc + "\t" + thisDoc.get("author") + "\t" + thisDoc.get("title"));
                Explanation explanation = idxSearcher.explain(booleanQuery, doc.doc);
                System.out.println("----------");
                System.out.println(explanation.toString());
                System.out.println("----------");
                System.out.println("");
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } finally {
            ramDirectory.close();
        }
    }
 
 
 

When you use a query parser, the query time boost can be more flexible, for example you can specify it this way

 
author:smith ^2
 

In main function:

 
        ramDirectory = new RAMDirectory();
        createIndexWithoutBoost();
        searchIndexWithQueryBoost()
 

You can see the ranks before and after adding query time boost

 
0    Sam    brown fox jump
3    Smith    brown fox jump
1    Sam    fox jumps over lazy dog
4    Smith    brown fox jumps over lazy dog
2    Jack    brown fox jump
 
after
 
3    Smith    brown fox jump
4    Smith    brown fox jumps over lazy dog
0    Sam    brown fox jump
1    Sam    fox jumps over lazy dog
2    Jack    brown fox jump
 

And how the scores are calculated.

 
length of top docs: 5
3    Smith    brown fox jump
----------
0.93971837 = (MATCH) product of:
  1.4095775 = (MATCH) sum of:
    0.09617749 = (MATCH) weight(title:fox in 3) [DefaultSimilarity], result of:
      0.09617749 = score(doc=3,freq=1.0 = termFreq=1.0
), product of:
        0.23524526 = queryWeight, product of:
          0.81767845 = idf(docFreq=5, maxDocs=5)
          0.28769898 = queryNorm
        0.40883923 = fieldWeight in 3, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          0.81767845 = idf(docFreq=5, maxDocs=5)
          0.5 = fieldNorm(doc=3)
    1.3134 = (MATCH) weight(author:smith^2.0 in 3) [DefaultSimilarity], result of:
      1.3134 = score(doc=3,freq=1.0 = termFreq=1.0
), product of:
        0.869326 = queryWeight, product of:
          2.0 = boost
          1.5108256 = idf(docFreq=2, maxDocs=5)
          0.28769898 = queryNorm
        1.5108256 = fieldWeight in 3, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          1.5108256 = idf(docFreq=2, maxDocs=5)
          1.0 = fieldNorm(doc=3)
  0.6666667 = coord(2/3)
 
----------
 
4    Smith    brown fox jumps over lazy dog
----------
0.92368877 = (MATCH) product of:
  1.3855331 = (MATCH) sum of:
    0.07213312 = (MATCH) weight(title:fox in 4) [DefaultSimilarity], result of:
      0.07213312 = score(doc=4,freq=1.0 = termFreq=1.0
), product of:
        0.23524526 = queryWeight, product of:
          0.81767845 = idf(docFreq=5, maxDocs=5)
          0.28769898 = queryNorm
        0.30662942 = fieldWeight in 4, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          0.81767845 = idf(docFreq=5, maxDocs=5)
          0.375 = fieldNorm(doc=4)
    1.3134 = (MATCH) weight(author:smith^2.0 in 4) [DefaultSimilarity], result of:
      1.3134 = score(doc=4,freq=1.0 = termFreq=1.0
), product of:
        0.869326 = queryWeight, product of:
          2.0 = boost
          1.5108256 = idf(docFreq=2, maxDocs=5)
          0.28769898 = queryNorm
        1.5108256 = fieldWeight in 4, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          1.5108256 = idf(docFreq=2, maxDocs=5)
          1.0 = fieldNorm(doc=4)
  0.6666667 = coord(2/3)
 
----------
 
0    Sam    brown fox jump
----------
0.5019183 = (MATCH) product of:
  0.7528775 = (MATCH) sum of:
    0.09617749 = (MATCH) weight(title:fox in 0) [DefaultSimilarity], result of:
      0.09617749 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
        0.23524526 = queryWeight, product of:
          0.81767845 = idf(docFreq=5, maxDocs=5)
          0.28769898 = queryNorm
        0.40883923 = fieldWeight in 0, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          0.81767845 = idf(docFreq=5, maxDocs=5)
          0.5 = fieldNorm(doc=0)
    0.6567 = (MATCH) weight(author:sam in 0) [DefaultSimilarity], result of:
      0.6567 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
        0.434663 = queryWeight, product of:
          1.5108256 = idf(docFreq=2, maxDocs=5)
          0.28769898 = queryNorm
        1.5108256 = fieldWeight in 0, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          1.5108256 = idf(docFreq=2, maxDocs=5)
          1.0 = fieldNorm(doc=0)
  0.6666667 = coord(2/3)
 
----------
 
1    Sam    fox jumps over lazy dog
----------
0.49390358 = (MATCH) product of:
  0.74085534 = (MATCH) sum of:
    0.084155306 = (MATCH) weight(title:fox in 1) [DefaultSimilarity], result of:
      0.084155306 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
        0.23524526 = queryWeight, product of:
          0.81767845 = idf(docFreq=5, maxDocs=5)
          0.28769898 = queryNorm
        0.35773432 = fieldWeight in 1, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          0.81767845 = idf(docFreq=5, maxDocs=5)
          0.4375 = fieldNorm(doc=1)
    0.6567 = (MATCH) weight(author:sam in 1) [DefaultSimilarity], result of:
      0.6567 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
        0.434663 = queryWeight, product of:
          1.5108256 = idf(docFreq=2, maxDocs=5)
          0.28769898 = queryNorm
        1.5108256 = fieldWeight in 1, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          1.5108256 = idf(docFreq=2, maxDocs=5)
          1.0 = fieldNorm(doc=1)
  0.6666667 = coord(2/3)
 
----------
 
2    Jack    brown fox jump
----------
0.032059163 = (MATCH) product of:
  0.09617749 = (MATCH) sum of:
    0.09617749 = (MATCH) weight(title:fox in 2) [DefaultSimilarity], result of:
      0.09617749 = score(doc=2,freq=1.0 = termFreq=1.0
), product of:
        0.23524526 = queryWeight, product of:
          0.81767845 = idf(docFreq=5, maxDocs=5)
          0.28769898 = queryNorm
        0.40883923 = fieldWeight in 2, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          0.81767845 = idf(docFreq=5, maxDocs=5)
          0.5 = fieldNorm(doc=2)
  0.33333334 = coord(1/3)
 
----------
 
 
 

Both achieve the same goal by changing the different part of the formula. The advantage is you don't have to save the boost norm in the index. But you need to adjust your search interface for example add a checkbox to boost a field.

The downside of index time boost

To change the index time boost , you have to reindex the document. For large index, it may take pretty long time to do.

Field booest is combined with other boost like length norm and converted to a byte. It loses precisions. Lucene can't recognize the tiny different between field lengths.