官术网_书友最值得收藏!

Indexing data from a database using Data Import Handler

One of our clients has a problem. His database of users grows to such a size that even a simple SQL select takes too much time, and he seeks how to improve the search times. Of course, he has heard about Solr, but he doesn't want to generate XML or any other data format and push it to Solr; he would like the data to be fetched. What can we do about it? Well, there is one thing—we can use one of the contribute modules of Solr, which is the Data Import Handler. This task will show you how to configure the basic setup of the Data Import Handler and how to use it.

How to do it...

Let's assume that we have a database table. To select users from our table, we use the following SQL query:

SELECT user_id, user_name FROM users

The response might look like this:

| user_id | user_name     |
| 1       | John Kowalski |
| 2       | Amanda Looks  |

We also have a second table called users_description, where we store the descriptions of users. The SQL query to get data about a particular user looks like this:

SELECT desc FROM users_description WHERE user_id = 1

The response will look as follows:

| desc     |
| superuser|

Now, let's look at the steps we need to take to set up a Data Import Handler and let Solr connect to the database and start indexing the preceding data:

  1. First, we need to copy the appropriate libraries that are required to use the Data Import Handler. So, let's create the dih folder anywhere on the system (I created the folder in the directory where Solr is installed, on the same level as the lib directory of Solr) and place the solr-dataimporthandler-4.10.0.jar and solr-dataimporthandler-extras-4.10.0.jar files from the Solr distribution dist directory. In addition to this, we need the following entry to be added to the solrconfig.xml file:
    <lib dir="../../dih" regex=".*\.jar" />
  2. Next, we need to modify the solrconfig.xml file. You should add an entry like this:
    <requestHandler name="/dataimport" class="solr.DataImportHandler">
     <lst name="defaults">
      <str name="config">db-data-config.xml</str>
     </lst>
    </requestHandler>
  3. Now, we will create the db-data-config.xml file that is responsible for the Data Import Handler configuration. It should have contents like the following example:
    <dataConfig>
     <dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://localhost:5432/users" user="users" password="secret" />
     <document>
      <entity name="user" query="SELECT user_id, user_name FROM users">
       <field column="user_id" name="id" />
       <field column="user_name" name="name" />
       <entity name="user_desc" query="SELECT desc FROM users_description WHERE user_id=${user.user_id}">
        <field column="desc" name="description" />
       </entity>
      </entity>
     </document>
    </dataConfig>

    If you want to use other database engines, change the driver, url, user, and password fields.

  4. Now, let's create a sample index structure. We just add the following section to our schema.xml file:
    <field name="id" type="string" indexed="true" stored="true" required="true"/>
    <field name="name" type="text" indexed="true" stored="true" />
    <field name="description" type="text" indexed="true" stored="true"/>
  5. One more thing before the indexation—you should copy an appropriate JDBC driver to the lib directory of your Solr installation or the dih directory we created before. You can get the driver library for PostgreSQL at http://jdbc.postgresql.org/download.html.
  6. Now, we can start indexing. We run the following query to Solr:
    http://localhost:8983/solr/cookbook/dataimport?command=full-import
  7. As you might know, the HTTP protocol is asynchronous, and thus, you won't be updated on how the process of indexing is going. To check the status of the indexing process, you can run the status command, which looks like this:
    http://localhost:8983/solr/cookbook/dataimport?command=status

This is how we configure the Data Import Handler.

How it works...

First, we have a solrconfig.xml part that actually defines a new request handler, the Data Import Handler, to be used by Solr. We specify the class attribute, telling Solr which handler to use, which is solr.DataImportHandler, in our case. We also said it will be available under the /dataimport address by using the name property. The <str name="config"> XML tag specifies the name of the Data Import Handler configuration file.

The second listing is the actual configuration of the Data Import Handler. I used the JDBC source connection sample to illustrate how to configure the Data Import Handler. The contents of this configuration file start with the root tag named dataConfig, which is followed by a second tag defining a data source named dataSource. In the example I used, the PostgreSQL (http://www.postgresql.org/) database, and thus, the JDBC driver, org.postgresql.Driver. We also define the database connection URL (the attribute named url) and the database credentials (the attributes user and password).

Next, we have a document definition, a document that will be constructed by the Data Import Handler and indexed to Solr. To define this, we use the tag named document. The document definition is made of database queries, which are the entities.

The entity is defined by a name (the name attribute) and SQL query (the query attribute). The entity name can be used to reference values in subqueries; you can see an example of such a behavior in the second entity named user_desc. As you might already have noticed, entities can be nested to handle subqueries. The SQL query is there to fetch the data from the database and use it to fill the entity variables that will be indexed.

After the entity comes the mapping definition. There is a single field tag for every column returned by a query, but this is not a must; the Data Import Handler can guess what the mapping is (for example, where the entity field name matches the column name), but I tend to use mappings because I find it easier to maintain. However, let's get back to fields. A field is defined by two attributes: column, which is the column name returned by a query and name, which is the field to which the data will be written.

Next, we have a Solr query to start the indexing process. There are actually five commands that can be run:

  • /dataimport?command=status: This command will return the actual status.
  • /dataimport?command=full-import: This command will start the full import process. Remember that the default behavior is to delete the index contents at the beginning.
  • /dataimport?command=delta-import: This command will start the incremental indexing process (which is explained in the Incremental imports with DIH recipe later in this chapter).
  • /dataimport?command=reload-config: This command will force the configuration reload.
  • /dataimport?command=abort: This command will stop the indexing process.

There's more...

There is one more thing that I think you should know, which is explained in the following section.

How to change the default behavior of deleting index contents at the beginning of a full import

If you don't want to delete the index contents at the start of full indexing using the Data Import Handler, add the clean=false parameter to your query. An example query should look like this:

http://localhost:8983/solr/cookbook/data?command=full-import&clean=false
主站蜘蛛池模板: 安平县| 麦盖提县| 那曲县| 惠来县| 阿鲁科尔沁旗| 五寨县| 昭平县| 河池市| 手游| 龙泉市| 醴陵市| 昂仁县| 班玛县| 东乡| 密云县| 江城| 永州市| 茶陵县| 孝感市| 巩义市| 虞城县| 安吉县| 绥化市| 黔东| 离岛区| 临城县| 宝坻区| 尼玛县| 南岸区| 南京市| 隆化县| 鹤岗市| 梁河县| 漳平市| 兴海县| 云霄县| 托克逊县| 宜黄县| 临高县| 东光县| 柘城县|