In Lucene, the search highlight consist of two components: the fragmenter and highlighter. The first step is fragmenting, then we can optionally apply highlighting on each text fragment.
The process of fragmenting will select text pieces that best match the searched keywords from the full text of the document. It gives user the a small context about the searched terms, to help users judge how the document relevant to their search.
The top level abstraction of all fragmenter supported by Lucene is the Fragmenter interface:
The Lucene 6.0.0 implementation currently support three different style of fragmenting, implemented by three concrete classes: NullFragmenter, SimpleFragmenter, SimpleSpanFragmenter, each of them serves for different purposes.
NullFragmenter
Return all the text of a field as a single one fragment. Suit for short field like title. It just don't do any fragmentation.
SimpleFragmenter
This is the default fragmenter if you don't specify it when create Highlighter instance.
Here is how to specify the fragmenter for highlighter explicitly
highlighter.setTextFragmenter(fragmenter);
This class simply split the text to fixed size fragments. Here is an example of what the fragments generated by this class looks like:
The score , the fragment text and the highlighted terms are displayed in above image. Notice that each fragment contains exactly 100 characters.
The terms highlighting is done by formatter, here each term is surrounded by a bold tag, which is formatted by SimpleHTMLFormatter.
SimpleSpanFragmenter
The final fragmenter is span query to make sure the span matching won't be broken to two fragments.