How to delete document from index in Lucene

Like any database, delete things from a data store is a common operation. It may not as common as adding data or querying data, but it's a necessary operation.

Lucene is no exception.

The pattern is the same: set a criteria to match documents and delete them. There is no scoring or relevance issues involves here, just boolean match.

Lucene provided several ways to delete document, you can delete documents by term, list of terms, query or list of queries.

Delete document by term

The example Java code below illustrated how to delete documents by matching a single term.

For setting up the Gradle project: How to do term query in Lucene index example

 
package com.makble.luceneexample;
 
import java.io.IOException;
 
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.RAMDirectory;
 
public class DeleteDocumentExample {
    public static Analyzer analyzer = new StandardAnalyzer();
    public static IndexWriterConfig config = new IndexWriterConfig(analyzer);
    public static RAMDirectory ramDirectory = new RAMDirectory();
    public static IndexWriter indexWriter;
 
    public static void main(String args []) throws ParseException, IOException {
 
        createIndex();
        searchSingleTerm("title", "lucene");
        deleteByTerm("title","example");
        searchSingleTerm("title", "lucene");
        ramDirectory.close();
 
    }
 
    public static IndexWriter getIndexWriter() throws IOException{
        return new IndexWriter(ramDirectory , new IndexWriterConfig(analyzer));
    }
 
    public static void createDoc(String author, String title) throws IOException {
        indexWriter = getIndexWriter();
        Document doc = new Document();
        doc.add(new TextField("author", author, Field.Store.YES));
        doc.add(new TextField("title", title, Field.Store.YES));
 
        indexWriter.addDocument(doc);
        indexWriter.close();
    }
 
    public static void createIndex() {
        try {
                createDoc("Sam", "Lucene index option analyzed vs not analyzed");    
                createDoc("Sam", "Lucene field boost and query time boost example");    
                createDoc("Jack", "How to do Lucene search highlight example");
                createDoc("Smith","Lucene BooleanQuery is depreacted as of 5.3.0" );
                createDoc("Smith","What is term vector in Lucene" );
 
        } catch (IOException | NullPointerException ex) {
            System.out.println("Exception : " + ex.getLocalizedMessage());
        } 
    }
 
    public static void searchIndexAndDisplayResults(Query query) {
        IndexReader idxReader = null ;
        try {
            idxReader = DirectoryReader.open(ramDirectory);
            IndexSearcher idxSearcher = new IndexSearcher(idxReader);
 
            TopDocs docs = idxSearcher.search(query, 10);
            System.out.println("length of top docs: " + docs.scoreDocs.length);
            for (ScoreDoc doc : docs.scoreDocs) {
                Document thisDoc = idxSearcher.doc(doc.doc);
                System.out.println(doc.doc + "\t" + thisDoc.get("author")
                        + "\t" + thisDoc.get("title"));
            }
            idxReader.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
 
    public static void searchSingleTerm(String field, String termText){
        Term term = new Term(field, termText);
        TermQuery termQuery = new TermQuery(term);
 
        searchIndexAndDisplayResults(termQuery);
    }
 
    public static void deleteByTerm(String field,  String termText) throws IOException    {
        Term term = new Term(field,termText);
        indexWriter = getIndexWriter();
        indexWriter.deleteDocuments(term );
        indexWriter.close();
 
    }
}
 
 

Output

 
length of top docs: 5
3    Smith    Lucene BooleanQuery is depreacted as of 5.3.0
4    Smith    What is term vector in Lucene
0    Sam    Lucene index option analyzed vs not analyzed
1    Sam    Lucene field boost and query time boost example
2    Jack    How to do Lucene search highlight example
length of top docs: 3
1    Smith    Lucene BooleanQuery is depreacted as of 5.3.0
2    Smith    What is term vector in Lucene
0    Sam    Lucene index option analyzed vs not analyzed
 
 

Delete document by id

Delete document by general term against text field is not a good idea. The problem is you are not sure which documents will be matched, it's just unpredictable. You could accidently delete large amount of documents with a wrong term.

The more sensible way is setting an unique id for each document. This is very common in almost any database systems. But Lucene don't enforce that or give any default implementation. You have to decide yourself about how to set the unique id.

Lucene wrappers like Solr or ElasticSearch has built-in support and gives the option that let you simply configure it. Automatic generate id for Solr document.

In MySQL, an auto increment unique int column usually is fine. But don't expect it in Lucene, implement this by yourself won't be a good idea.

Below is an example using UUID as the document ID.

 
package com.makble.luceneexample;
 
import java.io.File;
import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.UUID;
 
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StoredField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.store.SimpleFSDirectory;
import org.apache.lucene.util.BytesRef;
 
public class DeleteDocumentExample {
    public static Analyzer analyzer = new StandardAnalyzer();
    public static IndexWriterConfig config = new IndexWriterConfig(analyzer);
    public static RAMDirectory ramDirectory = new RAMDirectory();
    public static FSDirectory fsDirectory ;
    public static Directory indexDirectory ;
    public static IndexWriter indexWriter;
 
    public static void main(String args []) throws ParseException, IOException {
        fsDirectory = new SimpleFSDirectory(Paths.get("testindex"));
        indexDirectory = fsDirectory;
        //indexDirectory = ramDirectory;
 
        //createIndex();
        searchSingleTerm("title", "lucene");
        searchSingleTerm("id", "6d6252ba-d733-4638-915c-1fa4b34ebf1a");
        deleteByTerm("id", "6d6252ba-d733-4638-915c-1fa4b34ebf1a");
        searchSingleTerm("id", "6d6252ba-d733-4638-915c-1fa4b34ebf1a");
 
        indexDirectory.close();
 
    }
 
    public static IndexWriter getIndexWriter() throws IOException{
        return new IndexWriter(indexDirectory , new IndexWriterConfig(analyzer));
    }
 
    public static void createDoc(String author, String title) throws IOException {
        indexWriter = getIndexWriter();
        Document doc = new Document();
        doc.add(new TextField("author", author, Field.Store.YES));
        doc.add(new TextField("title", title, Field.Store.YES));
        String id = UUID.randomUUID().toString();
        doc.add(new StoredField("id", id ));
        doc.add(new Field("id", new BytesRef(id),StringField.TYPE_STORED )); 
 
        indexWriter.addDocument(doc);
        indexWriter.close();
    }
 
    public static void createIndex() {
        try {
                createDoc("Sam", "Lucene index option analyzed vs not analyzed");    
                createDoc("Sam", "Lucene field boost and query time boost example");    
                createDoc("Jack", "How to do Lucene search highlight example");
                createDoc("Smith","Lucene BooleanQuery is depreacted as of 5.3.0" );
                createDoc("Smith","What is term vector in Lucene" );
 
        } catch (IOException | NullPointerException ex) {
            System.out.println("Exception : " + ex.getLocalizedMessage());
        } 
    }
 
    public static void searchIndexAndDisplayResults(Query query) {
        IndexReader idxReader = null ;
        try {
            idxReader = DirectoryReader.open(indexDirectory);
            IndexSearcher idxSearcher = new IndexSearcher(idxReader);
 
            TopDocs docs = idxSearcher.search(query, 10);
            System.out.println("length of top docs: " + docs.scoreDocs.length);
            for (ScoreDoc doc : docs.scoreDocs) {
                Document thisDoc = idxSearcher.doc(doc.doc);
                System.out.println(doc.doc + "\t" + thisDoc.get("author")
                        + "\t" + thisDoc.get("title")
                        + "\t" + thisDoc.get("id"));
            }
            idxReader.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
 
    public static void searchSingleTerm(String field, String termText){
        Term term = new Term(field, termText);
        TermQuery termQuery = new TermQuery(term);
 
        searchIndexAndDisplayResults(termQuery);
    }
 
    public static void deleteByTerm(String field,  String termText) throws IOException    {
        Term term = new Term(field,termText);
        indexWriter = getIndexWriter();
        indexWriter.deleteDocuments(term );
        indexWriter.close();
 
    }
}
 
 

In this example we should use file system directory as index store, you should run the program first call createIndex, then run it again with the line commented out. Each time createIndex will create different uuid.

You will see the first search return one hit , after delete, it return zero.

 
length of top docs: 4
3    Smith    What is term vector in Lucene    62ed3e2b-9d95-464b-bfff-d639d9478be6
0    Sam    Lucene index option analyzed vs not analyzed    3189adb2-273f-4181-b166-626dc0f0f382
1    Sam    Lucene field boost and query time boost example    57e8e377-7d22-491e-8379-4437dd4faa73
2    Jack    How to do Lucene search highlight example    1f8525ea-782e-494c-9488-d595e31bd9c1
length of top docs: 0
length of top docs: 0
 
 

The confusion here is the following code actually will not store the field, we need to add a separate StoredField to store it.

 
doc.add(new Field("id", new BytesRef(id),StringField.TYPE_STORED )); 
 

The id field can also be added this way :

 
doc.add(new StringField("id", id,Field.Store.YES ));
 

This is exactly what we want, a stored, indexed, but not analyzed string field. No need StoredField.