- Solr Cookbook(Third Edition)
- Rafa? Ku?
- 863字
- 2021-08-06 19:39:23
Using scripting update processors to modify documents
Sometimes, we need to modify documents during indexing, and we don't want to do this on the indexing application side. For example, we have documents describing the Internet sites. What we want to be able to do is filter the sites on the basis of the protocol used, for example, http
or https
. We don't have this information; we only have the whole URL address. Let's see how we can achieve this with Solr.
Getting ready
Before continuing with the following recipe, I suggest reading the Counting the number of fields recipe of this chapter to get used to updating request processor configuration.
How to do it...
The following steps will take you through the process of achieving our goal:
- First, we start with the index structure, putting the following section in the
schema.xml
file:<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="url" type="text_general" indexed="true" stored="true"/> <field name="protocol" type="string" indexed="true" stored="true" />
- The next step is configuring Solr by adding a new update request processor chain called
script
. We do this by adding the following section to oursolrconfig.xml
file:<updateRequestProcessorChain name="script"> <processor class="solr.StatelessScriptUpdateProcessorFactory"> <str name="script">script.js</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
- The third step is to alter the
/update
request handler configuration by adding the following section to oursolrconfig.xml
file:<requestHandler name="/update" class="solr.UpdateRequestHandler"> <lst name="defaults"> <str name="update.chain">script</str> </lst> </requestHandler>
- Finally, we need the script mentioned in the update request processor chain configuration, which we called
script.js
and stored in theconf
directory (the same directory where theschema.xml
file is placed). The content of thescript.js
file looks as follows:functionfunction processAdd(cmd) { doc = cmd.solrDoc; url = doc.getFieldValue("url"); if (url != null) { parts = url.split(":"); if (parts != null && parts.length > 0) { doc.setField("protocol", parts[0]); } } } function processDelete(cmd) { } function processMergeIndexes(cmd) { } function processCommit(cmd) { } function processRollback(cmd) { } function finish() { }
Our example data looks as follows:
<add> <doc> <field name="id">1</field> <field name="url">http://solr.pl/</field> </doc> <doc> <field name="id">2</field> <field name="url">https://drive.google.com/</field> </doc> </add>
- After indexing our data, we can try our script out by running the following query:
http://localhost:8983/solr/cookbook/select?q=*:*&fq=protocol:http
The response from Solr should be similar to the following:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1</int> <lst name="params"> <str name="q">*:*</str> <str name="fq">protocol:http</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="id">1</str> <strname="url">http://solr.pl/</str> <strname="protocol">http</str> <long name="_version_">1468022030035058688</long></doc> </result> </response>
As you can see, everything works as it should, so now let's see how it worked.
How it works...
Our data is very simple. Each document is described with its identifier (the id
field), the URL (the url
field), and the field holding the protocol (the protocol
field). The first two fields will be passed in the data; the protocol
field will be filled automatically by our update request processor chain.
The next thing is to configure our update request processor chain. We already described most of the configuration details in the Counting the number of fields recipe of this chapter. The new thing is the solr.StatelessScriptUpdateProcessorFactory
processor. It allows us to define a script (using the script
property) that will be used to process our documents. In our case, this script is called script.js
. Solr will load this script and use it for each document passed through the update request processor chain.
We also defined the solr.UpdateRequestHandler
configuration, and then altered the default configuration by adding the defaults
section and including the update.chain
property to script
(our update request processor chain name). This means that our defined update request processor chain will be used with every indexing request.
Finally, we come to the juicy part of the recipe, the script.js
script. The solr.StatelessScriptUpdateProcessorFactory
processor allows us to alter Solr behavior using the following script functions:
processAdd
: This function is executed when a document is added to the index. In our case, we will put our code in this function.processDelete
: This function is executed when a delete operation is sent to Solr.processMergeIndexes
: This function is executed when theindex merge
command is sent to Solr.processCommit
: This function is executed when thecommit
command is sent to Solr.processRollback
: This function is executed when therollback
command is sent to Solr.finish
: Any code that should be run after the script that finished executing is put in this method.
Apart from the finish
function, all the other functions have a single argument that represents the command sent to Solr.
As already mentioned, we only need to provide logic in the processAdd
function. We start by retrieving the Solr document from the command (the cmd
object) and then store the document in the doc
variable (doc = cmd.solrDoc;
). Next, we get the url
field of the document (url = doc.getFieldValue("url");
). We check whether the field is defined (if (url != null)
); if it is, we split the URL using the :
character. This means that for the http://solr.pl URL, we should get an array containing the two parts http
and //solr.pl
. We are interested in the first value. We check if the parts
variable, which was returned by the split
function, is defined and if it has elements (if (parts != null &&parts.length> 0)
). If the condition is true, we just set a new field using the first element in the parts
array, which will contain the protocol.
After indexing our data and running a query that filters the documents to only those that has the http
protocol, we see that we did the job right.
See also
- If you want to read more about
solr.StatelessScriptUpdateProcessorFactory
, refer to the Solr Javadoc available at http://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html
- Google Flutter Mobile Development Quick Start Guide
- GeoServer Cookbook
- R語言經典實例(原書第2版)
- Twilio Best Practices
- Learning Linux Binary Analysis
- Visual Basic程序設計與應用實踐教程
- Python機器學習編程與實戰
- Mastering KnockoutJS
- PLC編程與調試技術(松下系列)
- Nginx實戰:基于Lua語言的配置、開發與架構詳解
- MATLAB for Machine Learning
- C語言程序設計
- HoloLens與混合現實開發
- Arduino可穿戴設備開發
- 深入解析Java編譯器:源碼剖析與實例詳解