How to do Lucene search highlight example

When you perform searches in major search engines like Google or Yandex, you input your search terms and press enter or click the search button, the results are displayed:

yandex search engine highlight keywords

You can see the title, url, and a text fragmentation of the target document, with the search terms highlighted in bold at the place they appeared. The lighlighting is generally a useful feature for users, a glimpse of the contextual informational of the submitted keywords helps users to determine whether to take a deep look at the matching document.

What we want to do?

This tutorial will show you how to do the exact same thing as major search engines with Lucene highlight package.

The documents I want to highlight is the blog posts. Most blogging software like Wordpress simply present a title and a short piece of excerpt text of the content in the blog entry list page. I think is not a very optimized way to show a list of your blog posts. Why this list can not be just like the SERPs we flip through everyday in Google? It has all the information the old style has, and has valuable information that the simple list can never has.

This post solves this problem by highlighting a post based on the title and content.

The idea is simple, we will need two fields for each document, the title and the content, text will be indexed with term vector enabled, and then we should extract or manually specify the main keywords of the blog post, and then construct query with those keywords. To know more about term vector What is term vector

Usually, the title can be used as a query for a blog post, a good title should already contains the main keywords of the content. For example, this title How to read UTF8 text file into String in Java, the keywords include read, UTF8, java, text file. If you parse this title with QueryParser, most of them will be identified, stop words like to, in will be removed if you are using a standard analyzer. For simplicity, this tutorial will use this method.

The second method is adding a new field to document, this field will contains keywords specified by the author of the content. The usually used in keywords meta tag for SEO.

The third one is extract keywords by analyzing the title and content with some kind of SEO software or Wordpress SEO plugin.

With the proper query, the Lucene search highlighter will help us find the best text fragments that contains those keywords and highlight them by rendering them as bold .

In essence, this is not generally a very hard problem, given the tokenized streams which contains positional information, the query keywords, and the original text, its easy to come up an algorithm to locate the precise locations in original text and extract out text fragments by selecting texts around these locations.

Thankfully, Lucene search highlight package already provided the optimized algorithms and solutions for us and they are easy to use.

What we need to get highlighted text fragments?

The only one requisite condition is the text of the field is stored, all the other things are optional, like term vectors, tokenized, indexed, offsets.

If you don't store the text, make sure you can retieve the text from data source, the token stream will be retrieved from index, and text will be retrieved from data source, the text should be identical to the text get indexed.

 
            Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(queryToSearch));
            TokenStream tokenStream = TokenSources.getTokenStream(field, text, analyzer) ;
            TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 4);
 

If you stored the text but not indexed, the token stream will be computed, text can be retrieved from index. Just call another overloaded getAnyTokenStream.

Lucene already provided very convenient class and methods to generate highlighted documents, the following code snippet include all class and methods we need in brief:

 
                Highlighter highlighter = new Highlighter(htmlFormatter,
                    new QueryScorer(queryToSearch));
                TokenStream tokenStream = TokenSources.getAnyTokenStream(idxReader, id, "content", analyzer);
                TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 4);                    
 

All we need is a query and the token stream which retrieved by document id, the text content of the field which also retrieved by document id, we will get an array of text fragment by calling getBestTextFragments , the texts are ready to display as HTML.

Just make sure the text is stored, Lucene will handle all other things, if you didn't analyzed at index time, Lucene will do it for you at query time.

Step 1 Create Gradle project with Lucene dependencies

Create a Java Quickstart Gradle project:

Gradle build file

 
apply plugin: 'java'
apply plugin: 'eclipse'
 
ext.luceneVersion= "6.0.0"
 
sourceCompatibility = 1.5
version = '1.0'
jar {
    manifest {
        attributes 'Implementation-Title': 'Gradle Quickstart', 'Implementation-Version': version
    }
}
 
repositories {
    mavenCentral()
}
 
dependencies {
    compile group: 'commons-collections', name: 'commons-collections', version: '3.2'
    testCompile group: 'junit', name: 'junit', version: '4.+'
    compile "org.apache.lucene:lucene-core:${luceneVersion}"
    compile "org.apache.lucene:lucene-analyzers-common:${luceneVersion}"
    compile "org.apache.lucene:lucene-queryparser:${luceneVersion}"
    compile "org.apache.lucene:lucene-highlighter:${luceneVersion}"
}
 
test {
    systemProperties 'property': 'value'
}
 
uploadArchives {
    repositories {
       flatDir {
           dirs 'repos'
       }
    }
}
 
 

This example use Lucene 6.0.0.

Step 2 Index and search it with highlight

Create a new package com.makble.lucenesearchhighlight and add a new class:

 
package com.makble.lucenesearchhighlight;
 
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
 
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.FieldInfo;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.search.highlight.TextFragment;
import org.apache.lucene.search.highlight.TokenSources;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
 
public class Test {
 
    public static Analyzer analyzer = new StandardAnalyzer();
    public static IndexWriterConfig config = new IndexWriterConfig(
            analyzer);
    public static RAMDirectory ramDirectory = new RAMDirectory();
    public static IndexWriter indexWriter;
 
    public static String readFileString(String file) {
        StringBuffer text = new StringBuffer();
        try {
 
            BufferedReader in = new BufferedReader(new InputStreamReader(
                    new FileInputStream(new File(file)), "UTF8"));
            String line;
            while ((line = in.readLine()) != null) {
                text.append(line + "\r\n");
            }
 
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
 
        return text.toString();
    }
 
    @SuppressWarnings("deprecation")
    public static void main(String[] args) {
        Document doc = new Document(); // create a new document
 
        /**
         * Create a field with term vector enabled
         */
        FieldType type = new FieldType();
        type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
        type.setStored(true);
        type.setStoreTermVectors(true);
        type.setTokenized(true);
        type.setStoreTermVectorOffsets(true);
 
        Field field = new Field("title",
                "How to read UTF8 text file into String in Java", type); //term vector enabled
        Field f = new TextField("content", readFileString("c:\\tmp\\content.txt"),
                Field.Store.YES); 
        doc.add(field);
        doc.add(f);
 
        try {
            indexWriter = new IndexWriter(ramDirectory, config);
            indexWriter.addDocument(doc);
            indexWriter.close();
 
            IndexReader idxReader = DirectoryReader.open(ramDirectory);
            IndexSearcher idxSearcher = new IndexSearcher(idxReader);
            Query queryToSearch = new QueryParser("title", analyzer).parse("read file string utf8");
            TopDocs hits = idxSearcher
                    .search(queryToSearch, idxReader.maxDoc());
            SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
            Highlighter highlighter = new Highlighter(htmlFormatter,
                    new QueryScorer(queryToSearch));
 
            System.out.println("reader maxDoc is " + idxReader.maxDoc());
            System.out.println("scoreDoc size: " + hits.scoreDocs.length);
            for (int i = 0; i < hits.totalHits; i++) {
                int id = hits.scoreDocs[i].doc;
                Document docHit = idxSearcher.doc(id);
                String text = docHit.get("content");
                TokenStream tokenStream = TokenSources.getAnyTokenStream(idxReader, id, "content", analyzer);
                TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 4);
                for (int j = 0; j < frag.length; j++) {
                    if ((frag[j] != null) && (frag[j].getScore() > 0)) {
                        System.out.println((frag[j].toString()));
                    }
                }
 
                System.out.println("start highlight the title");
                // Term vector
                text = docHit.get("title");
                tokenStream = TokenSources.getAnyTokenStream(
                        idxSearcher.getIndexReader(), hits.scoreDocs[i].doc,
                        "title", analyzer);
                frag = highlighter.getBestTextFragments(tokenStream, text,
                        false, 4);
                for (int j = 0; j < frag.length; j++) {
                    if ((frag[j] != null) && (frag[j].getScore() > 0)) {
                        System.out.println((frag[j].toString()));
                    }
                }
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (ParseException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (InvalidTokenOffsetsException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}
 
 

In this example, I created a document with two fields, one for title and another for the content. Notice that we enabled term vector.

This is the output we get

 
I was trying to <B>read</B> <B>utf8</B> text from a text <B>file</B>
, <B>string</B> is just a byte array. The function can just <B>read</B> the <B>file</B> raw data to memory and reference
( new FileInputStream(new <B>File</B>(<B>file</B>)), "<B>UTF8</B>") );
            <B>String</B> line;
            while ( (line = in.readLine
. 
 
A <B>file</B>, in its nature its a byte array, even it is a text <B>file</B>. To get a <B>String</B> from the <B>file</B>
start highlight the title
How to <B>read</B> <B>UTF8</B> text <B>file</B> into <B>String</B> in Java
 

What it looks like in browser

Improve the code

To support search highlight, we don't need to enable term vector or other index options manually, the only thing you will need is, the text of the field is stored.

This is the minimal things we need to prepare, highlighting needs two inputs: the text and the token stream derived from the text, the later can be calculated dynamically from the text.

Lucene will perform calculating if necessary to get all necessary information like term vectors, positions, offsets to get the highlighted text fragments.

Another thing I noticed is it doesn't matter what you pass to the first parameter of QueryParser, the only thing matter is the query parameter you passed to parse method.

The following generate same results

 
Query queryToSearch = new QueryParser("asddf", analyzer).parse("read text file string utf8");
 
Query queryToSearch = new QueryParser("", analyzer).parse("read text file string utf8");
 
Query queryToSearch = new QueryParser(null, analyzer).parse("read text file string utf8");
 
Query queryToSearch = new QueryParser("title", analyzer).parse("read text file string utf8"); 
 

It looks like Lucene completely ignored this parameter.

To make the code more concise I refactored the code.

 
    @SuppressWarnings("deprecation")
    public static void main(String[] args) {
        buildIndex();
        DoQuery2();
    }
    public static void DoQuery2(){
        try {
            IndexReader idxReader = DirectoryReader.open(ramDirectory);
            IndexSearcher idxSearcher = new IndexSearcher(idxReader);
            Query queryToSearch = new QueryParser("asddf", analyzer).parse("read text file string utf8"); 
            SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
            Highlighter highlighter = new Highlighter(htmlFormatter,
                    new QueryScorer(queryToSearch));
 
            highLight(0, idxSearcher, idxReader, "content", highlighter);
            highLight(0, idxSearcher, idxReader, "title", highlighter);    
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (ParseException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
    public static void DoQuery(){
        try {
            IndexReader idxReader = DirectoryReader.open(ramDirectory);
            IndexSearcher idxSearcher = new IndexSearcher(idxReader);
            Query queryToSearch = new QueryParser("title", analyzer).parse("read file string utf8");
            TopDocs hits = idxSearcher.search(queryToSearch, idxReader.maxDoc());
            SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
            Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(queryToSearch));
 
            System.out.println("reader maxDoc is " + idxReader.maxDoc());
            System.out.println("scoreDoc size: " + hits.scoreDocs.length);
            for (int i = 0; i < hits.totalHits; i++) {
                int id = hits.scoreDocs[i].doc;
                System.out.println("doc id : " + i);
                highLight(id, idxSearcher, idxReader, "content", highlighter);    
                System.out.println("start highlight the title");
                highLight(id, idxSearcher, idxReader, "title", highlighter);    
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (ParseException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
 
    public static void buildIndex () {
        Document doc = new Document(); 
 
        ieldType type = new FieldType();
        type.setStored(true); // stored is the everything you need to get highlighted
 
        Field field = new Field("title",
                "How to read UTF8 text file into String in Java", type); 
        Field f = new TextField("content", readFileString("c:\\tmp\\content.txt"),
                Field.Store.YES); 
        doc.add(field);
        doc.add(f);
 
        try {
            indexWriter = new IndexWriter(ramDirectory, config);
            indexWriter.addDocument(doc);
            indexWriter.close();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
 
    }
 
    public static void highLight(int id, IndexSearcher idxSearcher, IndexReader idxReader, String field, Highlighter highlighter) {
        try {
            Document doc = idxSearcher.doc(id);
            String text = doc.get(field);
            TokenStream tokenStream = TokenSources.getAnyTokenStream(idxReader, id, field, analyzer);
            TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 4);
            for (int j = 0; j < frag.length; j++) {
                if ((frag[j] != null)) {
                    System.out.println("score: " + frag[j].getScore() + ", frag: " + (frag[j].toString()));
                }
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (InvalidTokenOffsetsException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
 

We don't even need to perform the actual searching to get highlighted text fragments, if you know the document id, and the text is stored, you are ready to go. If all you indexed is a list of blog posts, you can simply loop over each document and highlight it.

A simplest highlight can be a few lines of code

 
    public static void highlight(String text, String query) {
        try {
            Query queryToSearch;
            queryToSearch = new QueryParser("", analyzer).parse(query);
            TokenStream tokenStream = TokenSources.getTokenStream( "default",text, analyzer);            
            Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(),new QueryScorer(queryToSearch));
 
            TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 4);
            for (int j = 0; j < frag.length; j++) {
                if ((frag[j] != null)) {
                    System.out.println("score: " + frag[j].getScore() + ", frag: " + (frag[j].toString()));
                }
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (InvalidTokenOffsetsException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (ParseException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
 
 

There are no document of field or indexing involved, just highlight a piece of text based on given query.