Join Query Parser

The Join query parser allows users to run queries that normalize relationships between documents.

Solr runs a subquery of the user’s choosing (the v param), identifies all the values that matching documents have in a field of interest (the from param), and then returns documents where those values are contained in a second field of interest (the to param).

In practice, these semantics are much like "inner queries" in a SQL engine. As an example, consider the Solr query below:

/solr/techproducts/select?q={!join from=manu_id_s to=id}title:ipod

This query, which returns a document for each manufacturer that makes a product with "ipod" in the title, is semantically identical to the SQL query below:

SELECT *
FROM techproducts
WHERE id IN (
    SELECT manu_id_s
    FROM techproducts
    WHERE title='ipod'
  )

The join operation is done on a term basis, so the from and to fields must use compatible field types. For example: joining between a StrField and a IntPointField will not work. Likewise joining between a StrField and a TextField that uses LowerCaseFilterFactory will only work for values that are already lower cased in the string field.

Parameters

This query parser takes the following parameters:

from

Required

Default: none

The name of a field which contains values to look for in the to field. Can be single or multi-valued, but must have a field type compatible with the field represented in the to field.

to

Required

Default: none

The name of a field whose value(s) will be checked against those found in the from field. Can be single or multi-valued, but must have a field type compatible with the from field.

fromIndex

Optional

Default: see description

The name of the index to run the "from" query (v parameter) on and where "from" values are gathered. Must be located on the same node as the core processing the request. If this parameter is not defined, it defaults to the value of the processing core. See Joining Across Single Shard Collections or Cross Collection Join below for more information.

score

Optional

Default: see description

Instructs Solr to return information about the "from" query scores. The value of this parameter controls what type of aggregation information is returned. Options include avg (average), max (maximum), min (minimum), total (total), or none.

If method is not specified but score is, then the dvWithScore method is used. If method is specified and is not dvWithScore, then the score value is ignored. See the method parameter documentation below for more details.

method

Optional

Default: see description

Determines which of several query implementations should be used by Solr. Options are restricted to: index, dvWithScore, and topLevelDV.

If unspecified the default value is index, unless the score parameter is present which overrides it to dvWithScore. Each implementation has its own performance characteristics, and users are encouraged to experiment to determine which implementation is most performant for their use-case. Details and performance heuristics are given below.

index

The default method unless the score parameter is specified. It uses the terms index structures to process the request. Performance scales with the cardinality and number of postings (term occurrences) in the "from" field. Consider this method when the "from" field has low cardinality, when the "to" side returns a large number of documents, or when sporadic post-commit slowdowns cannot be tolerated (this is a disadvantage of other methods that index avoids).

dvWithScore

Returns an optional "score" statistic alongside result documents. It uses docValues structures if available, but falls back to the field cache when necessary. The first access to the field cache slows down the initial requests following a commit and takes up additional space on the JVM heap, so docValues are recommended in most situations. Performance scales linearly with the number of values matched in the "from" field. This method must be used if score information is required, and should also be considered when the "from" query matches few documents, regardless of the number of "to" side documents returned.

dvWithScore and single value numerics

The dvWithScore method doesn’t support single value numeric fields. Users migrating from versions prior to 7.0 are encouraged to change field types to string and rebuild indexes during migration.

topLevelDV

Can only be used when to and from fields have docValues data, and does not currently support numeric fields. It uses top-level docValues data structures to find results. These data structures outperform other methods as the number of values matched in the from field grows high. But they are also expensive to build and need to be lazily populated after each commit, causing a sometimes-noticeable slowdown on the first query to use them after each commit. If you commit frequently and your use-case can tolerate a static warming query, consider adding one to solrconfig.xml so that this work is done as a part of the commit itself and not attached directly to user requests. Consider this method when the "from" query matches a large number of documents and the "to" result set is small to moderate in size, but only if sporadic post-commit slowness is tolerable.

Joining Across Single Shard Collections

You can also specify a fromIndex parameter to join with a field from another core or a single shard collection. If running in SolrCloud mode, then the collection specified in the fromIndex parameter must have a single shard and a replica on all Solr nodes where the collection you’re joining to has a replica.

Let’s consider an example where you want to use a Solr join query to filter movies by directors that have won an Oscar. Specifically, imagine we have two collections with the following fields:

movies: id, title, director_id, …​

movie_directors: id, name, has_oscar, …​

To filter movies by directors that have won an Oscar using a Solr join on the movie_directors collection, you can send the following filter query to the movies collection:

fq={!join from=id fromIndex=movie_directors to=director_id}has_oscar:true

Notice that the query criteria of the filter (has_oscar:true) is based on a field in the collection specified using fromIndex. Keep in mind that you cannot return fields from the fromIndex collection using join queries, you can only use the fields for filtering results in the "to" collection (movies).

Next, let’s understand how these collections need to be deployed in your cluster. Imagine the movies collection is deployed to a four node SolrCloud cluster and has two shards with a replication factor of two. Specifically, the movies collection has replicas on the following four nodes:

node 1: movies_shard1_replica1

node 2: movies_shard1_replica2

node 3: movies_shard2_replica1

node 4: movies_shard2_replica2

To use the movie_directors collection in Solr join queries with the movies collection, it needs to have a replica on each of the four nodes. In other words, movie_directors must have one shard and replication factor of four:

node 1: movie_directors_shard1_replica1

node 2: movie_directors_shard1_replica2

node 3: movie_directors_shard1_replica3

node 4: movie_directors_shard1_replica4

At query time, the JoinQParser will access the local replica of the movie_directors collection to perform the join. If a local replica is not available or active, then the query will fail. At this point, it should be clear that since you’re limited to a single shard and the data must be replicated across all nodes where it is needed, this approach works better with smaller data sets where there is a one-to-many relationship between the from collection and the to collection. Moreover, if you add a replica to the to collection, then you also need to add a replica for the from collection.

For more information, Erick Erickson has written a blog post about join performance titled Solr and Joins.

Joining Multiple Shard Collections

It’s also possible to join collections with:

  1. the same number of shards

  2. same router.name for both collections

  3. router.field or uniqueKey should correspond to fields used in to,from parameters (see details and exceptions below)

  4. collections' shards are collocated (see example below)

Just for simplification we use the example of joining from "many" to "one". Though, same approach may be used in other cases as well.

"to" collection "from" collection

node 1

products_shard1_replica1

skus_shard1_replica2

node 2

products_shard1_replica2

skus_shard1_replica1

node 3

products_shard2_replica1

skus_shard2_replica1

node 4

products_shard2_replica2

skus_shard2_replica2

Notice how shardN corresponds to its' counterpart from other side collection. Use AffinityPlacementPlugin.withCollectionShards to align collection shards.

Here are the supported options for collection routing:

router.name field constraint

implicit

from, to match router.field

plain or compositeId

from, to match either uniqueKey or router.field

compositeId (the default one)

constraint can be disabled by checkRouterField=false

Note: the third case assumes indexer assigns sku.id=<product.id>!<sku_id> in this case compositeId logic put a product’s sku into shard with the same name as their product. Also, you can turn checkRouterField=false and manually put every document via _route_ pseudo field.

For example, if skus collection has router.field=product_id, we can find products of red colored skus via q={!join to=id from=product_id fromIndex=skus}color:red.

Cross Collection Join

The Cross Collection Join Filter is a method for the join parser that will execute a query against a remote Solr collection to get back a set of join keys that will be used to as a filter query against the local Solr collection.

The cross collection join query will create an CrossCollectionQuery object. The CrossCollectionQuery will first query a remote Solr collection and get back a streaming expression result of the join keys. As the join keys are streamed to the node, a bitset of the matching documents in the local index is built up. This avoids keeping the full set of join keys in memory at any given time. This bitset is then inserted into the filter cache upon successful execution as with the normal behavior of the Solr filter cache.

If the local index is sharded according to the join key field, the cross collection join can leverage a secondary query parser called the Hash Range Query Parser. The hash range query parser is responsible for returning only the documents that hash to a given range of values. This allows the CrossCollectionQuery to query the remote Solr collection and return only the join keys that would match a specific shard in the local Solr collection. This has the benefit of making sure that network traffic doesn’t increase as the number of shards increases and allows for much greater scalability.

The cross collection join query works with both string and point types of fields. The fields that are being used for the join key must be single-valued and have docValues enabled.

It’s advised to shard the local collection by the join key as this allows for the optimization mentioned above to be utilized.

Cross collection join queries should not generally be used as part of the q parameter. It is designed to be used as a filter query (fq parameter) to ensure proper caching.

The remote Solr collection that is being queried should have a single-valued field for the join key with docValues enabled.

The remote Solr collection does not have any specific sharding requirements.

Join Query Parser Definition in solrconfig.xml

The cross collection join has some configuration options that can be specified in solrconfig.xml.

routerField

Optional

Default: none

If the documents are routed to shards using the CompositeID router by the join field, then that field name should be specified in the configuration here. This will allow the parser to optimize the resulting HashRange query.

allowSolrUrls

Optional

Default: none

If specified, this array of strings specifies allow-listed Solr URLs that can be passed to the solrUrl query parameter. Without this configuration the solrUrl parameter cannot be used. This restriction is necessary to prevent an attacker from using Solr to explore the network.

  <queryParser name="join" class="org.apache.solr.search.JoinQParserPlugin">
    <str name="routerField">product_id_s</str>
    <arr name="allowSolrUrls">
      <str>http://othersolr.example.com:8983/solr</str>
    </arr>
  </queryParser>

Cross Collection Join Query Parameters

fromIndex

Required

Default: none

The name of the external Solr collection to be queried to retrieve the set of join key values.

zkHost

Optional

Default: none

The connection string to be used to connect to ZooKeeper. zkHost and solrUrl are both optional parameters, and at most one of them should be specified. If neither zkHost nor solrUrl are specified, the local ZooKeeper cluster will be used.

solrUrl

Optional

Default: none

The URL of the external Solr node to be queried. Must be a character-for-character exact match of a allow-listed URL that is listed in the allowSolrUrls parameter in solrconfig.xml. If the URL does not match, this parameter will be effectively disabled.

from

Required

Default: none

The join key field name in the external collection.

to

Optional

Default: none

The join key field name in the local collection.

v

Optional

Default: none

The query substituted in as a local param. This is the query string that will match documents in the remote collection.

routed

Optional

Default: false

If true, the cross collection join query will use each shard’s hash range to determine the set of join keys to retrieve for that shard. This parameter improves the performance of the cross-collection join, but it depends on the local collection being routed by the to field. If this parameter is not specified, the cross collection join query will try to determine the correct value automatically.

ttl

Optional

Default: 3600 seconds

The length of time that a cross collection join query in the cache will be considered valid, in seconds. The cross collection join query will not be aware of changes to the remote collection, so if the remote collection is updated, cached cross collection queries may give inaccurate results. After the ttl period has expired, the cross collection join query will re-execute the join against the remote collection.

Other Parameters

Any normal Solr query parameter can also be specified/passed through as a local param.

Cross Collection Query Examples

http://localhost:8983/solr/localCollection/query?fl=id&q={!join method="crossCollection" fromIndex="otherCollection" from="fromField" to="toField" v="*:*"}