- Solr Cookbook(Third Edition)
- Rafa? Ku?
- 1046字
- 2021-08-06 19:39:24
Indexing data from a database using Data Import Handler
One of our clients has a problem. His database of users grows to such a size that even a simple SQL select takes too much time, and he seeks how to improve the search times. Of course, he has heard about Solr, but he doesn't want to generate XML or any other data format and push it to Solr; he would like the data to be fetched. What can we do about it? Well, there is one thing—we can use one of the contribute modules of Solr, which is the Data Import Handler. This task will show you how to configure the basic setup of the Data Import Handler and how to use it.
How to do it...
Let's assume that we have a database table. To select users from our table, we use the following SQL query:
SELECT user_id, user_name FROM users
The response might look like this:
| user_id | user_name | | 1 | John Kowalski | | 2 | Amanda Looks |
We also have a second table called users_description
, where we store the descriptions of users. The SQL query to get data about a particular user looks like this:
SELECT desc FROM users_description WHERE user_id = 1
The response will look as follows:
| desc | | superuser|
Now, let's look at the steps we need to take to set up a Data Import Handler and let Solr connect to the database and start indexing the preceding data:
- First, we need to copy the appropriate libraries that are required to use the Data Import Handler. So, let's create the
dih
folder anywhere on the system (I created the folder in the directory where Solr is installed, on the same level as thelib
directory of Solr) and place thesolr-dataimporthandler-4.10.0.jar
andsolr-dataimporthandler-extras-4.10.0.jar
files from the Solr distributiondist
directory. In addition to this, we need the following entry to be added to thesolrconfig.xml
file:<lib dir="../../dih" regex=".*\.jar" />
- Next, we need to modify the
solrconfig.xml
file. You should add an entry like this:<requestHandler name="/dataimport" class="solr.DataImportHandler"> <lst name="defaults"> <str name="config">db-data-config.xml</str> </lst> </requestHandler>
- Now, we will create the
db-data-config.xml
file that is responsible for the Data Import Handler configuration. It should have contents like the following example:<dataConfig> <dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://localhost:5432/users" user="users" password="secret" /> <document> <entity name="user" query="SELECT user_id, user_name FROM users"> <field column="user_id" name="id" /> <field column="user_name" name="name" /> <entity name="user_desc" query="SELECT desc FROM users_description WHERE user_id=${user.user_id}"> <field column="desc" name="description" /> </entity> </entity> </document> </dataConfig>
If you want to use other database engines, change the
driver
,url
,user
, andpassword
fields. - Now, let's create a sample index structure. We just add the following section to our
schema.xml
file:<field name="id" type="string" indexed="true" stored="true" required="true"/> <field name="name" type="text" indexed="true" stored="true" /> <field name="description" type="text" indexed="true" stored="true"/>
- One more thing before the indexation—you should copy an appropriate JDBC driver to the
lib
directory of your Solr installation or thedih
directory we created before. You can get the driver library for PostgreSQL at http://jdbc.postgresql.org/download.html. - Now, we can start indexing. We run the following query to Solr:
http://localhost:8983/solr/cookbook/dataimport?command=full-import
- As you might know, the HTTP protocol is asynchronous, and thus, you won't be updated on how the process of indexing is going. To check the status of the indexing process, you can run the
status
command, which looks like this:http://localhost:8983/solr/cookbook/dataimport?command=status
This is how we configure the Data Import Handler.
How it works...
First, we have a solrconfig.xml
part that actually defines a new request handler, the Data Import Handler, to be used by Solr. We specify the class
attribute, telling Solr which handler to use, which is solr.DataImportHandler
, in our case. We also said it will be available under the /dataimport
address by using the name
property. The <str name="config">
XML tag specifies the name of the Data Import Handler configuration file.
The second listing is the actual configuration of the Data Import Handler. I used the JDBC source connection sample to illustrate how to configure the Data Import Handler. The contents of this configuration file start with the root tag named dataConfig
, which is followed by a second tag defining a data source named dataSource
. In the example I used, the PostgreSQL (http://www.postgresql.org/) database, and thus, the JDBC driver, org.postgresql.Driver
. We also define the database connection URL (the attribute named url
) and the database credentials (the attributes user
and password
).
Next, we have a document definition, a document that will be constructed by the Data Import Handler and indexed to Solr. To define this, we use the tag named document
. The document definition is made of database queries, which are the entities.
The entity is defined by a name (the name
attribute) and SQL query (the query
attribute). The entity name can be used to reference values in subqueries; you can see an example of such a behavior in the second entity named user_desc
. As you might already have noticed, entities can be nested to handle subqueries. The SQL query is there to fetch the data from the database and use it to fill the entity variables that will be indexed.
After the entity comes the mapping definition. There is a single field
tag for every column returned by a query, but this is not a must; the Data Import Handler can guess what the mapping is (for example, where the entity field name matches the column name), but I tend to use mappings because I find it easier to maintain. However, let's get back to fields. A field is defined by two attributes: column
, which is the column name returned by a query and name
, which is the field to which the data will be written.
Next, we have a Solr query to start the indexing process. There are actually five commands that can be run:
/dataimport?command=status
: This command will return the actual status./dataimport?command=full-import
: This command will start the full import process. Remember that the default behavior is to delete the index contents at the beginning./dataimport?command=delta-import
: This command will start the incremental indexing process (which is explained in the Incremental imports with DIH recipe later in this chapter)./dataimport?command=reload-config
: This command will force the configuration reload./dataimport?command=abort
: This command will stop the indexing process.
There's more...
There is one more thing that I think you should know, which is explained in the following section.
If you don't want to delete the index contents at the start of full indexing using the Data Import Handler, add the clean=false
parameter to your query. An example query should look like this:
http://localhost:8983/solr/cookbook/data?command=full-import&clean=false