Highlighting
Highlighting in Solr allows fragments of documents that match the user’s query to be included with the query response.
The fragments are included in a special section of the query response (the highlighting
section), and the client uses the formatting clues also included to determine how to present the snippets to users.
Fragments are a portion of a document field that contains matches from the query and are sometimes also referred to as "snippets" or "passages".
Highlighting is extremely configurable, perhaps more than any other part of Solr. There are many parameters each for fragment sizing, formatting, ordering, backup/alternate behavior, and more options that are hard to categorize. Nonetheless, highlighting is very simple to use.
Usage
Highlighting requires that you have a uniqueKey
defined in your schema.
Common Highlighter Parameters
You need to set the hl
and often hl.fl
parameters to enable highlighting results to be returned.
The following table documents these and some other supported parameters.
Note that many highlighting parameters support per-field overrides, such as: f.title_txt.hl.snippets
.
hl
-
Optional
Default:
false
Use this parameter to enable or disable highlighting. If you want to use highlighting, you must set this to
true
. hl.method
-
Optional
Default:
unified
The highlighting implementation/engine to use. Acceptable values are:
unified
,original
,fastVector
.See the Choosing a Highlighter section below for more details on the differences between the available highlighters.
hl.fl
-
Optional
Default: value of
df
Specifies a list of fields to highlight, either comma- or space-delimited. These must be "stored". A wildcard of
*
(asterisk) can be used to match field globs, such astext_*
or even*
to highlight on all fields where highlighting is possible. When using*
, consider addinghl.requireFieldMatch=true
.Note that the field(s) listed here ought to have compatible text-analysis (defined in the schema) with field(s) referenced in the query to be highlighted. It may be necessary to modify
hl.q
andhl.qparser
and/or modify the text analysis.The following example uses the local params syntax and Extended DisMax (eDisMax) Query Parser to highlight fields in
hl.fl
:&hl.fl=field1 field2&hl.q={!edismax qf=$hl.fl v=$q}&hl.qparser=lucene&hl.requireFieldMatch=true
The default is the value of the
df
parameter which in turn has no default. hl.q
-
Optional
Default: value of
q
A query to use for highlighting. This parameter allows you to highlight different terms or fields than those being used to search for documents. When setting this, you might also need to set
hl.qparser
.The default is the value of the
q
parameter (already parsed). hl.qparser
-
Optional
Default: see description
The query parser to use for the
hl.q
query. It only applies whenhl.q
is set.The default is the value of the
defType
parameter which in turn defaults tolucene
. hl.requireFieldMatch
-
Optional
Default:
false
If
false
, all query terms will be highlighted for each field to be highlighted (hl.fl
) no matter what fields the parsed query refer to. If set totrue
, only query terms aligning with the field being highlighted will in turn be highlighted.If the query references fields different from the field being highlighted and they have different text analysis, the query may not highlight query terms it should have and vice versa. The analysis used is that of the field being highlighted (
hl.fl
), not the query fields. hl.queryFieldPattern
-
Optional
Default:
none
Similar to
hl.requireFieldMatch
but allows for multiple fields to match e.g.q=fieldA:one OR fieldB:two OR fieldC:three
hl.fl=fieldA
hl.queryFieldPattern=fieldA,fieldB
Also allows for the
hl.fl
field to be absent in the query e.g.q=fieldA:one OR fieldB:two
hl.fl=fieldZ
hl.queryFieldPattern=fieldA
If a
hl.queryFieldPattern
andhl.requireFieldMatch=true
are both specified then thehl.queryFieldPattern
is silently ignored. hl.usePhraseHighlighter
-
Optional
Default:
true
If set to
true
, Solr will highlight phrase queries (and other advanced position-sensitive queries) accurately as phrases. Iffalse
, the parts of the phrase will be highlighted everywhere instead of only when it forms the given phrase. hl.highlightMultiTerm
-
Optional
Default:
true
If set to
true
, Solr will highlight wildcard queries (and otherMultiTermQuery
subclasses). Iffalse
, they won’t be highlighted at all. hl.snippets
-
Optional
Default:
1
Specifies maximum number of highlighted snippets to generate per field. It is possible for any number of snippets from zero to this value to be generated.
hl.fragsize
-
Optional
Default:
100
Specifies the approximate size, in characters, of fragments to consider for highlighting. Using
0
indicates that no fragmenting should be considered and the whole field value should be used. hl.tag.pre
-
Optional
Default:
<em>
(
hl.simple.pre
for the Original Highlighter) Specifies the “tag” to use before a highlighted term. This can be any string, but is most often an HTML or XML tag. hl.tag.post
-
Optional
Default:
</em>
(
hl.simple.post
for the Original Highlighter) Specifies the “tag” to use after a highlighted term. This can be any string, but is most often an HTML or XML tag. hl.encoder
-
Optional
Default: empty
If blank, then the stored text will be returned without any escaping/encoding performed by the highlighter. If set to
html
then special HTML/XML characters will be encoded (e.g.,&
becomes&
). The pre- and post-snippet characters are never encoded. hl.maxAnalyzedChars
-
Optional
Default:
51200
The character limit to look for highlights, after which no highlighting will be done. This is mostly only a performance concern for an analysis based offset source since it’s the slowest. See Schema Options and Performance Considerations.
There are more parameters supported as well depending on the highlighter (via hl.method
) chosen.
Highlighting in the Query Response
In the response to a query, Solr includes highlighting data in a section separate from the documents. It is up to a client to determine how to process this response and display the highlights to users.
Using the example documents included with Solr, we can see how this might work:
In response to a query such as:
http://localhost:8983/solr/gettingstarted/select?hl=on&q=apple&hl.fl=manu&fl=id,name,manu,cat
we get a response such as this (truncated slightly for space):
{
"response": {
"numFound": 1,
"start": 0,
"docs": [{
"id": "MA147LL/A",
"name": "Apple 60 GB iPod with Video Playback Black",
"manu": "Apple Computer Inc.",
"cat": [
"electronics",
"music"
]
}]
},
"highlighting": {
"MA147LL/A": {
"manu": [
"<em>Apple</em> Computer Inc."
]
}
}
}
Note the two sections docs
and highlighting
.
The docs
section contains the fields of the document requested with the fl
parameter of the query (only "id", "name", "manu", and "cat").
The highlighting
section includes the ID of each document, and the field that contains the highlighted portion.
In this example, we used the hl.fl
parameter to say we wanted query terms highlighted in the "manu" field.
When there is a match to the query term in that field, it will be included for each document ID in the list.
Choosing a Highlighter
Solr provides a HighlightComponent
(a SearchComponent
) and it’s in the default list of components for search handlers.
It offers a somewhat unified API over multiple actual highlighting implementations / engines (or simply "highlighters") that do the business of highlighting.
There are many parameters supported by more than one highlighter, and sometimes the implementation details and semantics will be a bit different, so don’t expect identical results when switching highlighters.
You should use the hl.method
parameter to choose a highlighter.
There are three highlighters available that can be chosen at runtime with the hl.method
parameter, in order of general recommendation:
- Unified Highlighter
-
(
hl.method=unified
)The Unified Highlighter is the newest highlighter (as of Solr 6.4), which stands out as the most performant and accurate of the options. It can handle typical requirements and others possibly via plugins/extension. We recommend that you use this highlighter as a first choice.
The UH highlights a query very accurately and thus is true to what the underlying Lucene query actually matches. Other highlighters highlight terms more liberally (over-highlight). For esoteric/custom queries, this highlighter has a greater likelihood of supporting it than the others.
A strong benefit to this highlighter is that you can opt to configure Solr to put more information in the underlying index to speed up highlighting of large documents; multiple configurations are supported, even on a per-field basis. There is little or no such flexibility of offset sources for the other highlighters. More on this below.
There are some reasons not to choose this highlighter: Passage scoring does not consider boosts in the query. Some users want more/better passage breaking flexibility. The "alternate" fallback options are more primitive.
- Original Highlighter
-
(
hl.method=original
)The Original Highlighter, sometimes called the "Standard Highlighter" or "Default Highlighter", is Lucene’s original highlighter – a venerable option with a high degree of customization options. Its query accuracy is good enough for most needs, although it’s not quite as good/perfect as the Unified Highlighter.
The Original Highlighter will normally analyze stored text on the fly in order to highlight. It will use full term vectors if available. If the text isn’t "stored" but is in doc values (
docValues="true"
), this highlighter can work with it.Where this highlighter falls short is performance; it’s often twice as slow as the Unified Highlighter. And despite being the most customizable, it doesn’t have a BreakIterator based fragmenter (all the others do), which could pose a challenge for some languages.
- FastVector Highlighter
-
(
hl.method=fastVector
)The FastVector Highlighter requires full term vector options (
termVectors
,termPositions
, andtermOffsets
) on the field, and is optimized with that in mind. It is nearly as configurable as the Original Highlighter with some variability.This highlighter notably supports multi-colored highlighting such that different query words can be denoted in the fragment with different marking, usually expressed as an HTML tag with a unique color.
This highlighter’s query-representation is less advanced than the Original or Unified Highlighters: for example it will not work well with the
surround
parser, and there are multiple reported bugs pertaining to queries with stop-words.
Both the FastVector and Original Highlighters can be used in conjunction in a search request to highlight some fields with one and some the other. In contrast, the Unified Highlighter can only be chosen exclusively.
The Unified Highlighter is exclusively configured via search parameters.
In contrast, some settings for the Original and FastVector Highlighters are set in solrconfig.xml
.
There’s a robust example of the latter in the "techproducts" configset.
In addition to further information below, more information can be found in the Solr javadocs.
Schema Options and Performance Considerations
Fundamental to the internals of highlighting are detecting the offsets of the individual words that match the query. Some of the highlighters can run the stored text through the analysis chain defined in the schema, some can look them up from postings, and some can look them up from term vectors. These choices have different trade-offs:
-
Analysis: Supported by the Unified and Original Highlighters. If you don’t go out of your way to configure the other options below, the highlighter will analyze the stored text on the fly (during highlighting) to calculate offsets.
The benefit of this approach is that your index won’t grow larger with any extra data that isn’t strictly necessary for highlighting.
The down side is that highlighting speed is roughly linear with the amount of text to process, with a large factor being the complexity of your analysis chain.
For "short" text, this is a good choice. Or maybe it’s not short but you’re prioritizing a smaller index and indexing speed over highlighting performance.
-
Postings: Supported by the Unified Highlighter. Set
storeOffsetsWithPositions
totrue
. This adds a moderate amount of extra data to the index but it speeds up highlighting tremendously, especially compared to analysis with longer text fields.However, wildcard queries will fall back to analysis unless "light" term vectors are added.
-
with Term Vectors (light): Supported only by the Unified Highlighter. To enable this mode set
termVectors
totrue
but no other term vector related options on the field being highlighted.This adds even more data to the index than just
storeOffsetsWithPositions
but not as much as enabling all the extra term vector options. Term Vectors are only accessed by the highlighter when a wildcard query is used and will prevent a fall back to analysis of the stored text.This is definitely the fastest option for highlighting wildcard queries on large text fields.
-
-
Term Vectors (full): Supported by the Unified, FastVector, and Original Highlighters. Set
termVectors
,termPositions
, andtermOffsets
totrue
, and potentiallytermPayloads
for advanced use cases.This adds substantial weight to the index – similar in size to the compressed stored text. If you are using the Unified Highlighter then this is not a recommended configuration since it’s slower and heavier than postings with light term vectors. However, this could make sense if full term vectors are already needed for another use-case.
Unified Highlighter
The Unified Highlighter supports these following additional parameters to the ones listed earlier:
hl.offsetSource
-
Optional
Default: see description
By default, the Unified Highlighter will usually pick the right offset source (see above). However it may be ambiguous such as during a migration from one offset source to another that hasn’t completed.
The offset source can be explicitly configured to one of:
ANALYSIS
,POSTINGS
,POSTINGS_WITH_TERM_VECTORS
, orTERM_VECTORS
. hl.fragAlignRatio
-
Optional
Default:
0.33
This parameter influences where the first match (i.e., highlighted text) in a passage is positioned.
The default value of
0.33
means to align the match to the left third. A value of0.0
means to align the match to the left, while1.0
to align it to the right. This setting is a best-effort hint, as there are a variety of factors. When there’s lots of text to be highlighted, lowering this number can help performance a lot. hl.fragsizeIsMinimum
-
Optional
Default:
true
When
true
, thehl.fragsize
parameter is treated as a (soft) minimum fragment size; provided there is enough text, the fragment is at least this size. Whenfalse
, it’s an optimal target — the highlighter will on average produce highlights of this length. Afalse
setting is slower, particularly when there’s lots of text andhl.bs.type=SENTENCE
. hl.tag.ellipsis
-
Optional
Default: see description
By default, each snippet is returned as a separate value (as is done with the other highlighters). Set this parameter to instead return one string with this text as the delimiter. Note: this is likely to be removed in the future.
hl.defaultSummary
-
Optional
Default:
false
If
true
, use the leading portion of the text as a snippet if a proper highlighted snippet can’t otherwise be generated. hl.score.k1
-
Optional
Default:
1.2
Specifies BM25 term frequency normalization parameter 'k1'. For example, it can be set to
0
to rank passages solely based on the number of query terms that match. hl.score.b
-
Optional
Default:
0.75
Specifies BM25 length normalization parameter 'b'. For example, it can be set to "0" to ignore the length of passages entirely when ranking.
hl.score.pivot
-
Optional
Default:
87
Specifies BM25 average passage length in characters.
hl.bs.language
-
Optional
Default: none
Specifies the breakiterator language for dividing the document into passages.
hl.bs.country
-
Optional
Default: none
Specifies the breakiterator country for dividing the document into passages.
hl.bs.variant
-
Optional
Default: none
Specifies the breakiterator variant for dividing the document into passages.
hl.bs.type
-
Optional
Default:
SENTENCE
Specifies the breakiterator type for dividing the document into passages. Can be
SEPARATOR
,SENTENCE
,WORD
*,CHARACTER
,LINE
, orWHOLE
.SEPARATOR
is special value that splits text on a user-provided character inhl.bs.separator
. hl.bs.separator
-
Optional
Default: none
Indicates which character to break the text on. Use only if you have defined
hl.bs.type=SEPARATOR
.This is useful when the text has already been manipulated in advance to have a special delineation character at desired highlight passage boundaries. This character will still appear in the text as the last character of a passage.
hl.weightMatches
-
Optional
Default:
true
Tells the UH to use Lucene’s "Weight Matches" API instead of doing
SpanQuery
conversion. This is the most accurate highlighting mode reflecting the query. Furthermore, phrases will be highlighted as a whole instead of word by word. Currently, this setting slows down the unified highlighter a lot when many fields are highlighted.If either
hl.usePhraseHighlighter
orhl.multiTermQuery
are set tofalse
, then this setting is effectivelyfalse
no matter what you set it to.
Original Highlighter
The Original Highlighter supports these following additional parameters to the ones listed earlier:
hl.mergeContiguous
-
Optional
Default:
false
Instructs Solr to collapse contiguous fragments into a single fragment. A value of
true
indicates contiguous fragments will be collapsed into single fragment. hl.maxMultiValuedToExamine
-
Optional
Default:
Integer.MAX_VALUE
Specifies the maximum number of entries in a multi-valued field to examine before stopping. This can potentially return zero results if the limit is reached before any matches are found.
If used with the
maxMultiValuedToMatch
, whichever limit is reached first will determine when to stop looking. hl.maxMultiValuedToMatch
-
Optional
Default:
Integer.MAX_VALUE
Specifies the maximum number of matches in a multi-valued field that are found before stopping.
If
hl.maxMultiValuedToExamine
is also defined, whichever limit is reached first will determine when to stop looking. hl.alternateField
-
Optional
Default: none
Specifies a field to be used as a backup default summary if Solr cannot generate a snippet (i.e., because no terms match).
hl.maxAlternateFieldLength
-
Optional
Default:
0
Specifies the maximum number of characters of the field to return. Any value less than or equal to
0
means the field’s length is unlimited.This parameter is only used in conjunction with the
hl.alternateField
parameter. hl.highlightAlternate
-
Optional
Default:
true
If set to
true
andhl.alternateFieldName
is active, Solr will show the entire alternate field, with highlighting of occurrences. Ifhl.maxAlternateFieldLength=N
is used, Solr returns maxN
characters surrounding the best matching fragment.If set to
false
, or if there is no match in the alternate field either, the alternate field will be shown without highlighting. hl.formatter
-
Optional
Default:
simple
Selects a formatter for the highlighted output. Currently the only legal value is
simple
, which surrounds a highlighted term with a customizable pre- and post-text snippet. hl.simple.pre
,hl.simple.post
-
Optional
Default: see description
Specifies the text that should appear before (
hl.simple.pre
) and after (hl.simple.post
) a highlighted term, when using thesimple
formatter. The default is<em>
and</em>
. hl.fragmenter
-
Optional
Default:
gap
Specifies a text snippet generator for highlighted text. The standard fragmenter is
gap
, which creates fixed-sized fragments with gaps for multi-valued fields.Another option is
regex
, which tries to create fragments that resemble a specified regular expression. hl.regex.slop
-
Optional
Default:
0.6
When using the regex fragmenter (
hl.fragmenter=regex
), this parameter defines the factor by which the fragmenter can stray from the ideal fragment size (given byhl.fragsize
) to accommodate a regular expression.For instance, a slop of
0.2
withhl.fragsize=100
should yield fragments between 80 and 120 characters in length. It is usually good to provide a slightly smallerhl.fragsize
value when using the regex fragmenter. hl.regex.pattern
-
Optional
Default: none
Specifies the regular expression for fragmenting. This could be used to extract sentences.
hl.regex.maxAnalyzedChars
-
Optional
Default:
10000
Instructs Solr to analyze only this many characters from a field when using the regex fragmenter (after which, the fragmenter produces fixed-sized fragments).
Note, applying a complicated regex to a huge field is computationally expensive.
hl.preserveMulti
-
Optional
Default:
false
If
true
, multi-valued fields will return all values in the order they were saved in the index. Iffalse
, only values that match the highlight request will be returned. hl.payloads
-
Optional
Default:
true
When
hl.usePhraseHighlighter
istrue
and the indexed field has payloads but not term vectors (generally quite rare), the index’s payloads will be read into the highlighter’s memory index along with the postings.If this may happen and you know you don’t need them for highlighting (i.e., your queries don’t filter by payload) then you can save a little memory by setting this to
false
.
The Original Highlighter has a plugin architecture that enables new functionality to be registered in solrconfig.xml
.
The "techproducts" configset shows most of these settings explicitly.
You can use it as a guide to provide your own components to include a SolrFormatter
, SolrEncoder
, and SolrFragmenter.
FastVector Highlighter
The FastVector Highlighter (FVH) can be used in conjunction with the Original Highlighter if not all fields should be highlighted with the FVH.
In such a mode, set hl.method=original
and f.yourTermVecField.hl.method=fastVector
for all fields that should use the FVH.
One annoyance to keep in mind is that the Original Highlighter uses hl.simple.pre
whereas the FVH (and other highlighters) use hl.tag.pre
.
In addition to the Common Highlighter Parameters above, the following parameters documented for the Original Highlighter above are also supported by the FVH:
-
hl.alternateField
-
hl.maxAlternateFieldLength
-
hl.highlightAlternate
And here are additional parameters supported by the FVH:
hl.fragListBuilder
-
Optional
Default:
weighted
The snippet fragmenting algorithm. The
weighted
fragListBuilder uses IDF-weights to order fragments.Other options are
single
, which returns the entire field contents as one snippet, orsimple
. You can select a fragListBuilder with this parameter, or modify an existing implementation insolrconfig.xml
to be the default by adding "default=true". hl.fragmentsBuilder
-
Optional
Default:
default
The fragments builder is responsible for formatting the fragments, which uses
<em>
and</em>
markup by default (ifhl.tag.pre
andhl.tag.post
are not defined).Another pre-configured choice is
colored
, which is an example of how to use the fragments builder to insert HTML into the snippets for colored highlights if you choose. You can also implement your own if you’d like. You can select a fragments builder with this parameter, or modify an existing implementation insolrconfig.xml
to be the default by adding "default=true". hl.boundaryScanner
-
See Using Boundary Scanners with the FastVector Highlighter below.
hl.bs.*
-
See Using Boundary Scanners with the FastVector Highlighter below.
hl.phraseLimit
-
Optional
Default:
5000
The maximum number of phrases to analyze when searching for the highest-scoring phrase.
hl.multiValuedSeparatorChar
-
Optional
Default: space character
Text to use to separate one value from the next for a multi-valued field. The default is " " (a space).
Using Boundary Scanners with the FastVector Highlighter
The FastVector Highlighter will occasionally truncate highlighted words.
To prevent this, implement a boundary scanner in solrconfig.xml
, then use the hl.boundaryScanner
parameter to specify the boundary scanner for highlighting.
Solr supports two boundary scanners: breakIterator
and simple
.
The breakIterator Boundary Scanner
The breakIterator
boundary scanner offers excellent performance right out of the box by taking locale and boundary type into account.
In most cases you will want to use the breakIterator
boundary scanner.
To implement the breakIterator
boundary scanner, add this code to the highlighting
section of your solrconfig.xml
file, adjusting the type, language, and country values as appropriate to your application:
<boundaryScanner name="breakIterator" class="solr.highlight.BreakIteratorBoundaryScanner">
<lst name="defaults">
<str name="hl.bs.type">WORD</str>
<str name="hl.bs.language">en</str>
<str name="hl.bs.country">US</str>
</lst>
</boundaryScanner>
Possible values for the hl.bs.type
parameter are WORD, LINE, SENTENCE, and CHARACTER.
The simple Boundary Scanner
The simple
boundary scanner scans term boundaries for a specified maximum character value (hl.bs.maxScan
) and for common delimiters such as punctuation marks (hl.bs.chars
).
To implement the simple
boundary scanner, add this code to the highlighting
section of your solrconfig.xml
file, adjusting the values as appropriate to your application:
<boundaryScanner name="simple" class="solr.highlight.SimpleBoundaryScanner" default="true">
<lst name="defaults">
<str name="hl.bs.maxScan">10</str>
<str name="hl.bs.chars">.,!?\t\n</str>
</lst>
</boundaryScanner>