Lucene index option analyzed vs not analyzed

When indexing a field in Lucene, you have two index option choices about how the field value is indexed: analyzed and not analyzed. They both useful and serves different purposes, so make sure you know the differences between them and use them correctly.

The ANALYZED and NOT_ANALYZED, are enum type values defined in Field.Index to specify the index option of a field.

 
public class Field implements IndexableField {
...
      public static enum Index {
 
        /** Index the tokens produced by running the field's
         * value through an Analyzer.  This is useful for
         * common text. */
        ANALYZED ,
 
        /** Index the field's value without using an Analyzer, so it can be searched.
         * As no analyzer is used the value will be stored as a single term. This is
         * useful for unique Ids like product numbers.
         */
        NOT_ANALYZED ,
 
        ...
    }
...
}
 

Other enum values includes: NO, NOT_ANALYZED_NO_NORMS, ANALYZED_NO_NORMS. In this post we focus on ANALYZED and NOT_ANALYZED.

Notice that this API is only for compatible with pre Lucene 4.0 API. It has been changed since the release of Lucene 4.0.0. See LUCENE-2308 Separately specify a field's type for more details about this. It has been removed from Lucene 6.0. You should use the new API as following

 
FieldType type = new FieldType();
type.setTokenized(true);
 

This will set it to analyzed. Actually, analyzed is default, you only need to call setTokenized when you need to set it to false.

For older API

 
  doc.add(new Field("contents",                    
                    "content text goes here",
                    Field.Store.NO,
                    Field.Index.ANALYZED));
 

Analyzed means the text of the field will be analyzed by the analyzer you provided at indexing time, the text will be broken into tokens and terms, this is desirable when the field contains normal text, for example the title and content.

Besides the main body of document, there may be some kind of meta data associated with a document, which should not be simply treated as text, for example, the ISBN number of a book, the serial code of a product, email address, ZIP code, etc.

When the field act like some kind of unique identifier or key to the document, you should use NOT_ANALYZED. The whole field value is indexed as a single term, and case sensitive. Remember in Lucene only terms are searchable, set as NOT_ANALYZED so you can search those meta data.

Most analyzers in Lucene will lowercase all terms, thus search for analyzed field is case insensitive. In NOT_ANALYZED field, there is no analyzer involved, when you search the field, the query text should be exactly the same as the field value, otherwise you will get empty result set.

When querying, the best Query class to query against NOT_ANALYZED field is TermQuery, very like selecting a row by its id in SQL. It would act like a primary key in relational database.

 
Term t = new Term("serial_code", "83004102");
Query query = new TermQuery(t);
TopDocs docs = searcher.search(query, 1);