Lucene fuzzy search with FuzzyQuery
Lucene allows you specify a term that may not exact match a term in index, but similar one. To measure the similarity between two terms, Lucene introduced Edit Distance that used in Levenshtein distance algorithm.
For example Lucene and Lucece has 1 edit distance, because only changes one character to make them the same. You can specify the max allowed edit distance when creating FuzzyQuery, if not specified the default value 2 is used.
The FuzzyQuery will calculate the edit distance for all terms in the index and select the all terms that in maximum distance. It means the term Lucece will become Lucene if it's found in the index. If it can't find similar term or the nearest distance is larger than a threshold, no results will be returned.
The performance of FuzzyQuery has a significant improvement by using Levenshtein Automaton instead of calculating the distance for all terms in a brute force way.
The following Java code example illustrates how to use FuzzyQuery to perform fuzzy search in Lucene. To setup the Gradle project refer How to do term query in Lucene index example.
package com.makble.luceneexample; import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.TextField; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.Term; import org.apache.lucene.queryparser.classic.ParseException; import org.apache.lucene.search.FuzzyQuery; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopDocs; import org.apache.lucene.store.RAMDirectory; public class FuzzyQueryExample { public static Analyzer analyzer = new StandardAnalyzer(); public static IndexWriterConfig config = new IndexWriterConfig(analyzer); public static RAMDirectory ramDirectory = new RAMDirectory(); public static IndexWriter indexWriter; public static void main(String args []) throws ParseException { createIndex(); System.out.println("fuzzy search:"); searchFuzzyQuery(); ramDirectory.close(); } public static void createDoc(String author, String title) throws IOException { Document doc = new Document(); doc.add(new TextField("author", author, Field.Store.YES)); doc.add(new TextField("title", title, Field.Store.YES)); indexWriter.addDocument(doc); } public static void createIndex() { try { indexWriter = new IndexWriter(ramDirectory, config); createDoc("Sam", "Lucece index option analyzed vs not analyzed"); createDoc("Sam", "Luaene field boost and query time boost example"); createDoc("Jack", "How to do Lxeeci search highlight example"); createDoc("Smith","Lucene BooleanQuery is depreacted as of 5.3.0" ); createDoc("Smith","What is term vector in Lucene" ); indexWriter.close(); } catch (IOException | NullPointerException ex) { System.out.println("Exception : " + ex.getLocalizedMessage()); } } public static void searchIndexAndDisplayResults(Query query) { try { IndexReader idxReader = DirectoryReader.open(ramDirectory); IndexSearcher idxSearcher = new IndexSearcher(idxReader); TopDocs docs = idxSearcher.search(query, 10); System.out.println("length of top docs: " + docs.scoreDocs.length); for (ScoreDoc doc : docs.scoreDocs) { Document thisDoc = idxSearcher.doc(doc.doc); System.out.println(doc.doc + "\t" + thisDoc.get("author") + "\t" + thisDoc.get("title")); } } catch (IOException e) { e.printStackTrace(); } finally { } } public static void searchFuzzyQuery() { FuzzyQuery fuzzyQuery = new FuzzyQuery(new Term("title", "lucece")) ; searchIndexAndDisplayResults(fuzzyQuery ); } }
Outpu
fuzzy search: length of top docs: 4 3 Smith Lucene BooleanQuery is depreacted as of 5.3.0 4 Smith What is term vector in Lucene 0 Sam Lucece index option analyzed vs not analyzed 1 Sam Luaene field boost and query time boost example
The distance between Lxeeci and Lucene is 3, so it is not matched.
Lucene Basics tutorials
Lucene Indexing
Adding fields and options
CRUD operations in index
Lucene Searching
Highlight and Fragmentation
Appendix
Articles