官术网_书友最值得收藏!

Incremental imports with DIH

In most use cases, indexing the data from scratch during every indexation doesn't make sense. Why index your 1,00,000 documents when only 1,000 were modified or added? This is where the Solr Data Import Handler delta queries come in handy. Using them, we can index our data incrementally. This recipe will show you how to set up the Data Import Handler to use delta queries and index data in an incremental way.

Getting ready

Refer to the Indexing data from a database using Data Import Handler recipe in this chapter to get to know the basics of the Data Import Handler configuration. I assume that Solr is set up according to the description given in the mentioned recipe.

How to do it...

We will reuse parts of the configuration shown in the Indexing data from a database using Data Import Handler recipe in this chapter, and we will modify it. Execute the following steps:

  1. The first thing you should do is add an additional column to the tables you use, a column that will specify the last modification date of the record. So, in our case, let's assume that we added a column named last_modified (which should be a timestamp-based column). Now, our db-data-config.xml will look like this:
    <dataConfig>
     <dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://localhost:5432/users" user="users" password="secret" />
     <document>
      <entity name="user" query="SELECT user_id, user_name FROM users" deltaImportQuery="select user_id, user_name FROM users WHERE user_id = '${dih.delta.user_id}'" deltaQuery="SELECT user_id FROM users WHERE last_modified &gt; '${dih.last_index_time}'">
       <field column="user_id" name="id" />
       <field column="user_name" name="name" />
       <entity name="user_desc" query="select desc from users_description where user_id=${user.user_id}">
        <field column="desc" name="description" />
       </entity>
      </entity>
     </document>
    </dataConfig>
  2. After this, we run a new kind of query to start the delta import:
    http://localhost:8983/solr/cookbook/dataimport?command=delta-import

How it works...

First, we modified our database table to include a column named last_modified. We need to ensure that the column will contain the last modified date of the record it corresponds to. Solr will not modify the database, so you have to ensure that your application will do this.

When running a delta import, the Data Import Handler will start by reading a file named dataimport.properties inside a Solr configuration directory. If it is not present, the Data Import Handler will assume that no indexing was ever made. Solr will use this file to store information about the last indexation time, and this file will be updated or created after indexation is finished. The last index time will be stored as a timestamp. As you can guess, the Data Import Handler uses this timestamp to distinguish whether the data was changed. It can be used in a query by using a special variable, ${dih.last_index_time}.

You might already have noticed the two differences—two additional attributes defining entities named user, deltaQuery, and deltaImportQuery. The deltaQuery attribute is responsible for getting the information about users that were modified since the last index. Actually, it only gets the users' unique identifiers and uses the last_modified column we added to determine which users were modified since the last import. The deltaImportQuery attribute gets users with the appropriate unique identifier (which was returned by deltaQuery) to get all the needed information about the user. One thing worth noticing is the way that I used the user identifier in the deltaImportQuery attribute; we did this using ${dih.delta.user_id}. We used the dih.delta variable with its user_id property (which is the same as the table column name) to refer to the user identifier.

You might notice that I left the query attribute in the entity definition. It's left on purpose; you might need to index the full data once again so that the configuration will be useful for full as well as partial imports.

Next, we have a query that shows how to run the delta import. You might notice that compared to the full import, we didn't use the full-import command; we sent the delta-import command instead.

The statuses that are returned by Solr are the same as those with the full import, so refer to the appropriate chapters to see what information they carry.

One more thing—the delta queries are only supported for the default SqlEntityProcessor. This means that you can only use these queries with JDBC data sources.

See also

主站蜘蛛池模板: 渝中区| 葵青区| 长岛县| 介休市| 内乡县| 沂源县| 平潭县| 舒城县| 明溪县| 丰台区| 镇宁| 揭阳市| 苏尼特左旗| 金坛市| 锦州市| 沭阳县| 德惠市| 商城县| 张家口市| 山丹县| 营口市| 东阿县| 孝昌县| 伽师县| 观塘区| 扎鲁特旗| 固镇县| 武城县| 绵阳市| 鄯善县| 乌审旗| 乌拉特中旗| 泰宁县| 福鼎市| 利辛县| 安徽省| 漠河县| 永寿县| 贡觉县| 鲜城| 洛南县|