- Solr Cookbook(Third Edition)
- Rafa? Ku?
- 604字
- 2021-08-06 19:39:23
Using parsing update processors to parse data
Let's assume that we are running a bookstore, we want to sort our books by the publication date, and run faceting on the number of likes each book gets. However, we get all our data in XML, and we don't have data in the proper format, and so on. The good thing is that we can tell Solr to parse our data property so that we don't have to change what we already have. This recipe will show you how to do this.
Getting ready
Before continuing with this recipe, I suggest reading the Counting the number of fields recipe of this chapter to get used to updating the request processor configuration.
How to do it...
Let's look at the steps we need to take to make data parsing work.
- First, we need to prepare our index structure, so we add the following section to the
schema.xml
file:<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="title" type="text_general" indexed="true" stored="true" /> <field name="published" type="date" indexed="true" stored="true" /> <field name="likes" type="long" indexed="true" stored="true" />
- In addition to this, we need a custom update request processor chain defined. To do this, we add the following section to the
solrconfig.xml
file:<updateRequestProcessorChain name="parse"> <processor class="solr.ParseLongFieldUpdateProcessorFactory"> <str name="fieldName">likes</str> </processor> <processor class="solr.ParseDateFieldUpdateProcessorFactory"> <str name="fieldName">published</str> <arr name="format"> <str>yyyy-MM-dd</str> </arr> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
- The third step is to alter the
/update
request handler configuration by adding the following section to oursolrconfig.xml
file:<requestHandler name="/update" class="solr.UpdateRequestHandler"> <lst name="defaults"> <str name="update.chain">parse</str> </lst> </requestHandler>
- Now, we can index our data, which looks like this:
<add> <doc> <field name="id">1</field> <field name="title">Solr Cookbook 4</field> <field name="published">2013-01-10</field> <field name="likes">10</field> </doc> </add>
- After we send our data, we can check a simple query like this:
http://localhost:8983/solr/cookbook/select?q=*:*&sort=published+desc&facet=true&facet.field=likes
The response from Solr looks as follows:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">106</int> <lst name="params"> <str name="q">*:*</str> <str name="facet.field">likes</str> <str name="sort">published desc</str> <str name="facet">true</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="id">1</str> <str name="title">Solr Cookbook 4</str> <date name="published">2013-01-10T00:00:00Z</date> <long name="likes">10</long> <long name="_version_">1468068127952601088</long></doc> </result> <lst name="facet_counts"> <lst name="facet_queries"/> <lst name="facet_fields"> <lst name="likes"> <int name="10">1</int> </lst> </lst> <lst name="facet_dates"/> <lst name="facet_ranges"/> </lst> </response>
As you can see, the data was properly parsed, the sorting works, and faceting also works, so let's see how it was possible.
How it works...
Our data is very simple. Each book is described with its identifier (the id
field), the title (the title
field), the publication day (the published
field), and the number of likes (the likes
field). The published
field is of the date
type for proper date-based sorting, and the likes
field is of the long
type.
Our defined update request processor chain consists of two new processors that we are not familiar with. The first processor, solr.ParseLongFieldUpdateProcessorFactory
, is responsible for parsing the data to a long
type. It takes the field defined in the fieldName
property from the document sent to indexation and parses it. The second processor is solr.ParseDateFieldUpdateProcessorFactory
, which we already talked about in the Using Solr in a schemaless mode recipe in Chapter 1, Apache Solr Configuration, but let's a recap. It takes the field defined in the fieldName
property from the document sent to indexation and tries to parse its value using the date formats defined using the format
array. We only defined a single format, but you can put multiple formats if this is what you need.
Note
For a description of the possible formats, refer to http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html.
We also defined the solr.UpdateRequestHandler
configuration, and then altered the default configuration by adding the defaults
section and including the update.chain
property to script
(our update request processor chain name). This means that our defined update request processor chain will be used with every indexing request.
After indexing our data and running a query, we will see that our data has proper field types. We will also see that sorting works on the published
field, which was parsed into data, although our published
field content was not in a format understandable by Solr.
See also
- If you want to see all the possibilities of parsing different field types, refer to the Javadoc of
solr.FieldMutatingUpdateProcessorFactory
available at http://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/update/processor/FieldMutatingUpdateProcessorFactory.html. The classes extending this class provide a nice description of the additional possibilities.
- 解構(gòu)產(chǎn)品經(jīng)理:互聯(lián)網(wǎng)產(chǎn)品策劃入門(mén)寶典
- C++ Builder 6.0下OpenGL編程技術(shù)
- 動(dòng)手玩轉(zhuǎn)Scratch3.0編程:人工智能科創(chuàng)教育指南
- 區(qū)塊鏈:以太坊DApp開(kāi)發(fā)實(shí)戰(zhàn)
- Silverlight魔幻銀燈
- 單片機(jī)應(yīng)用技術(shù)
- 實(shí)戰(zhàn)Java高并發(fā)程序設(shè)計(jì)(第3版)
- H5頁(yè)面設(shè)計(jì):Mugeda版(微課版)
- PySpark Cookbook
- 第一行代碼 C語(yǔ)言(視頻講解版)
- Python Data Structures and Algorithms
- SQL Server 2008 R2數(shù)據(jù)庫(kù)技術(shù)及應(yīng)用(第3版)
- Visual Basic程序設(shè)計(jì)(第三版)
- OpenCV 3 Blueprints
- Building Slack Bots