Solr on HDFS
The Solr HDFS Module has support for writing and reading Solr’s index and transaction log files to the HDFS distributed filesystem. It does not use Hadoop MapReduce to process Solr data.
To use HDFS rather than a local filesystem, you must be using Hadoop 2.x and you will need to instruct Solr to use the HdfsDirectoryFactory
.
There are also several additional parameters to define.
These can be set in one of three ways:
-
Pass JVM arguments to the
bin/solr
script. These would need to be passed every time you start Solr withbin/solr
. -
Modify
solr.in.sh
(orsolr.in.cmd
on Windows) to pass the JVM arguments automatically when usingbin/solr
without having to set them manually. -
Define the properties in
solrconfig.xml
. These configuration changes would need to be repeated for every collection, so is a good option if you only want some of your collections stored in HDFS.
Module
This is provided via the hdfs
Solr Module that needs to be enabled before use.
Starting Solr on HDFS
User-Managed Cluters and Single-Node Installations
For user-managed clusters or single-node Solr installations, there are a few parameters you should modify before starting Solr.
These can be set in solrconfig.xml
(more on that below), or passed to the bin/solr
script at startup.
-
You need to use an
HdfsDirectoryFactory
and a data directory in the formhdfs://host:port/path
-
You need to specify an
updateLog
location in the formhdfs://host:port/path
-
You should specify a lock factory type of
'hdfs'
or none.
If you do not modify solrconfig.xml
, you can instead start Solr on HDFS with the following command:
bin/solr start -Dsolr.directoryFactory=HdfsDirectoryFactory
-Dsolr.lock.type=hdfs
-Dsolr.data.dir=hdfs://host:port/path
-Dsolr.updatelog=hdfs://host:port/path
This example will start Solr using the defined JVM properties (explained in more detail below).
SolrCloud Instances
In SolrCloud mode, it’s best to leave the data and update log directories as the defaults Solr comes with and simply specify the solr.hdfs.home
.
All dynamically created collections will create the appropriate directories automatically under the solr.hdfs.home
root directory.
-
Set
solr.hdfs.home
in the formhdfs://host:port/path
-
You should specify a lock factory type of
'hdfs'
or none.
bin/solr start -c -Dsolr.directoryFactory=HdfsDirectoryFactory
-Dsolr.lock.type=hdfs
-Dsolr.hdfs.home=hdfs://host:port/path
This command starts Solr using the defined JVM properties.
Modifying solr.in.sh (*nix) or solr.in.cmd (Windows)
The examples above assume you will pass JVM arguments as part of the start command every time you use bin/solr
to start Solr.
However, bin/solr
looks for an include file named solr.in.sh
(solr.in.cmd
on Windows) to set environment variables.
By default, this file is found in the bin
directory, and you can modify it to permanently add the HdfsDirectoryFactory
settings and ensure they are used every time Solr is started.
For example, to set JVM arguments to always use HDFS when running in SolrCloud mode (as shown above), you would add a section such as this:
# Set HDFS DirectoryFactory & Settings
-Dsolr.directoryFactory=HdfsDirectoryFactory \
-Dsolr.lock.type=hdfs \
-Dsolr.hdfs.home=hdfs://host:port/path \
The Block Cache
For performance, the HdfsDirectoryFactory
uses a Directory that will cache HDFS blocks.
This caching mechanism replaces the standard file system cache that Solr utilizes.
By default, this cache is allocated off-heap.
This cache will often need to be quite large and you may need to raise the off-heap memory limit for the specific JVM you are running Solr in.
For the Oracle/OpenJDK JVMs, the following is an example command-line parameter that you can use to raise the limit when starting Solr:
-XX:MaxDirectMemorySize=20g
HdfsDirectoryFactory Parameters
The HdfsDirectoryFactory
has a number of settings defined as part of the directoryFactory
configuration.
Solr HDFS Settings
solr.hdfs.home
-
Required
Default: none
A root location in HDFS for Solr to write collection data to. Rather than specifying an HDFS location for the data directory or update log directory, use this to specify one root location and have everything automatically created within this HDFS location. The structure of this parameter is
hdfs://host:port/path/solr
.
Block Cache Settings
solr.hdfs.blockcache.enabled
-
Optional
Default:
true
Enable the blockcache.
solr.hdfs.blockcache.read.enabled
-
Optional
Default:
true
Enable the read cache.
solr.hdfs.blockcache.direct.memory.allocation
-
Optional
Default:
true
Enable direct memory allocation. If this is
false
, heap is used. solr.hdfs.blockcache.slab.count
-
Optional
Default:
1
Number of memory slabs to allocate. Each slab is 128 MB in size.
solr.hdfs.blockcache.global
-
Optional
Default:
true
Enable/Disable using one global cache for all SolrCores. The settings used will be from the first HdfsDirectoryFactory created.
NRTCachingDirectory Settings
solr.hdfs.nrtcachingdirectory.enable
-
Optional
Default:
true
Enable the use of NRTCachingDirectory.
solr.hdfs.nrtcachingdirectory.maxmergesizemb
-
Optional
Default:
16
NRTCachingDirectory max segment size for merges.
solr.hdfs.nrtcachingdirectory.maxcachedmb
-
Optional
Default:
192
NRTCachingDirectory max cache size.
HDFS Client Configuration Settings
solr.hdfs.confdir
-
Optional
Default: none
Pass the location of HDFS client configuration files - needed for HDFS HA for example.
Kerberos Authentication Settings
Hadoop can be configured to use the Kerberos protocol to verify user identity when trying to access core services like HDFS. If your HDFS directories are protected using Kerberos, then you need to configure Solr’s HdfsDirectoryFactory to authenticate using Kerberos in order to read and write to HDFS. To enable Kerberos authentication from Solr, you need to set the following parameters:
solr.hdfs.security.kerberos.enabled
-
Optional
Default:
false
Set to
true
to enable Kerberos authentication. solr.hdfs.security.kerberos.keytabfile
-
Required
Default: none
A keytab file contains pairs of Kerberos principals and encrypted keys which allows for password-less authentication when Solr attempts to authenticate with secure Hadoop.
This file will need to be present on all Solr servers at the same path provided in this parameter.
solr.hdfs.security.kerberos.principal
-
Required
Default: none
The Kerberos principal that Solr should use to authenticate to secure Hadoop; the format of a typical Kerberos V5 principal is:
primary/instance@realm
.
Update Log settings
When using HDFS to store Solr indexes, it is recommended to also store the transaction logs on HDFS. This can be done by using the solr.HdfsUpdateLog
update log hander class.
The solrconfig.xml is often used to define an update log handler class name either using a variable reference or direct specification, for example:
<updateLog class="${solr.ulog:solr.UpdateLog}">
When specifying a class like this, it needs to be ensured that the correct class name is specified.
When no class name is specified, Solr automatically picks the correct update log handler class solr.HdfsUpdateLog
for collections which are configured to use the HdfsDirectory Factory.
Example solrconfig.xml for HDFS
Here is a sample solrconfig.xml
configuration for storing Solr indexes on HDFS:
<directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
<str name="solr.hdfs.home">hdfs://host:port/solr</str>
<bool name="solr.hdfs.blockcache.enabled">true</bool>
<int name="solr.hdfs.blockcache.slab.count">1</int>
<bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
<int name="solr.hdfs.blockcache.blocksperbank">16384</int>
<bool name="solr.hdfs.blockcache.read.enabled">true</bool>
<bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
<int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
<int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
</directoryFactory>
If using Kerberos, you will need to add the three Kerberos related properties to the <directoryFactory>
element in solrconfig.xml
, such as:
<directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
...
<bool name="solr.hdfs.security.kerberos.enabled">true</bool>
<str name="solr.hdfs.security.kerberos.keytabfile">/etc/krb5.keytab</str>
<str name="solr.hdfs.security.kerberos.principal">solr/admin@KERBEROS.COM</str>
</directoryFactory>