What is Lucene Norms
Norm, a bizarre name for many Lucene beginners or even experienced one. It's actually a very simple idea.
To understand norms you need to know what is boost in Lucene.Lucene field boost example.
There are different kinds of boost, query time boost, index time boost, field boost, document boost, etc.
The norms is a method to store boost factors for index time boost, they are just numbers.
You can set field boost, document boost at index time, Lucene itself will set a boost value according to the length of a field. All these values are encoded together can saved as norm.
Norms means an authoritative standard, in the context of Lucene search, it is a normalization value, a number of one byte calculated at indexing time which represent boost factor. The boost factor represent the how importance and relevance of a match, it can affect the score of a result document when searching. The scores determine the ranks of search results.
Suppose two documents matched a query, but one document's title field matched the query the other document only matched the query in body field. You would like to rank the one with title matched higher. You do this by set the boost factor of title field higher than document body field.
A field can have many boost factors, some specified by user explicitly using setBoost, some specified by Lucene implicitly, for example the shorter the field's value the higher the boost factor. When indexing, Lucene will encompass those information into one byte and store it per field, this is called Norms. When searching the one byte is decoded back to boosting information and used to calculate the score of results.
To disable norms or not, the trade off
Norms gives you the ability to set index time boost to documents, and let shorter field ranks higher. But it cost memory and disk space consumption. Also the performance of querying.
When you have no needs to set index time boost and the lengths of fields are close, you can turn the norms off.
The computation and storing of the Norms is optional, you can control it by specifying the field's index options. For each of the two options: analyzed and not analyzed there is an option that almost the same but without Norms. Norms consume approximately 1 byte per string field per document in the index, it also needs to be load to memory and decoded at search time, you may want to disable them sometimes. Options are listed here
Index.ANALYZED: Field's valued is broken into tokens and also store boost information. This is for typical field that needs to be searchable and boostable such as title and body of document.
Index.NOT_ANALYZED: Field's valued will be search as a whole, no tokens. Also contains boost information. It used on field logically as a whole such as URL, file path, date, serial number, etc.
Index.ANALYZED_NO_NORMS: Same as Index.ANALYZED but no boost information. The Norms will consume disk space and RAM when searched, if you don't need it, you can disable it. It's useful for fields that you only care about whether they are matched but don't consider the relevance.
Index.NOT_ANALYZED_NO_NORMS: Same as Index.NOT_ANALYZED but no boost information.
Lucene Basics tutorials
Highlight and Fragmentation