In this document, we'll introduce the basic concepts of how Lucene/Solr ranks documents, as well as how to tune the way Solr ranks and returns search results.
To be adept at tuning search relevancy, it helps to understand the Lucene scoring algorithm, also known as the tf.idf model. This scoring model involves a number of scoring factors:
tf - Term Frequency. The frequency with which a term appears in a document. Given a search query, the higher the term frequency, the higher the document score.
idf - Inverse Document Frequency. The rarer a term is across all documents in the index, the higher its contribution to the score.
coord - Coordination Factor. The more query terms that are found in a document, the higher its score.
fieldNorm - Field length. The more words that a field contains, the lower its score. This factor penalizes documents with longer field values.
The exact scoring formula that brings these factors together can be found here on the Lucene site.
In addition to the scoring factors mentioned above, the primary method of modifying document scores is by boosting.
There are 2 kinds of boosts. Index-time and Query-time boosts.
Index-time boosts are applied when adding documents, and apply to the entire document or to specific fields.
Query-time boosts are applied when constructing a search query, and apply to specific fields.
Query boosts are applied by appending the caret character ^ followed by a positive number to query clauses.
title:foo OR (title:foo AND title:bar)^2.0 OR title:"foo bar"^10
Whilst Lucene allows negative boosts, Solr does not.
The only way to meaningfully perform a negative boost, is by applying a positive boost to a negative query. For example:
(*:* -title:foo)^2.0
This boosts all documents which don't have "foo" in the title by 2.0, thereby effectively applying a down boost to documents which do.
Solr provides another way of boosting documents via function queries.
See http://wiki.apache.org/solr/FunctionQuery for a list of function queries and how to apply them to your query.
© Copyright 2024 Kelvin Tan - Solr and Elasticsearch consultant