官术网_书友最值得收藏!

Using scripting update processors to modify documents

Sometimes, we need to modify documents during indexing, and we don't want to do this on the indexing application side. For example, we have documents describing the Internet sites. What we want to be able to do is filter the sites on the basis of the protocol used, for example, http or https. We don't have this information; we only have the whole URL address. Let's see how we can achieve this with Solr.

Getting ready

Before continuing with the following recipe, I suggest reading the Counting the number of fields recipe of this chapter to get used to updating request processor configuration.

How to do it...

The following steps will take you through the process of achieving our goal:

  1. First, we start with the index structure, putting the following section in the schema.xml file:
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="url" type="text_general" indexed="true" stored="true"/>
    <field name="protocol" type="string" indexed="true" stored="true" />
  2. The next step is configuring Solr by adding a new update request processor chain called script. We do this by adding the following section to our solrconfig.xml file:
    <updateRequestProcessorChain name="script">
     <processor class="solr.StatelessScriptUpdateProcessorFactory">
      <str name="script">script.js</str>
     </processor>
     <processor class="solr.LogUpdateProcessorFactory" />
     <processor class="solr.RunUpdateProcessorFactory" />
    </updateRequestProcessorChain>
  3. The third step is to alter the /update request handler configuration by adding the following section to our solrconfig.xml file:
    <requestHandler name="/update" class="solr.UpdateRequestHandler">
     <lst name="defaults">
      <str name="update.chain">script</str>
     </lst>
    </requestHandler>
  4. Finally, we need the script mentioned in the update request processor chain configuration, which we called script.js and stored in the conf directory (the same directory where the schema.xml file is placed). The content of the script.js file looks as follows:
    functionfunction processAdd(cmd) {
      doc = cmd.solrDoc;  
      url = doc.getFieldValue("url");
      if (url != null) {
      parts = url.split(":");
      if (parts != null && parts.length > 0) {
         doc.setField("protocol", parts[0]);
        }
      }
    }
    
    function processDelete(cmd) {
    }
    
    function processMergeIndexes(cmd) {
    }
    
    function processCommit(cmd) {
    }
    
    function processRollback(cmd) {
    }
    
    function finish() {
    }

    Our example data looks as follows:

    <add>
     <doc>
      <field name="id">1</field>
      <field name="url">http://solr.pl/</field>
     </doc>
     <doc>
      <field name="id">2</field>
      <field name="url">https://drive.google.com/</field>
     </doc>
    </add>
  5. After indexing our data, we can try our script out by running the following query:
    http://localhost:8983/solr/cookbook/select?q=*:*&fq=protocol:http

    The response from Solr should be similar to the following:

    <?xml version="1.0" encoding="UTF-8"?>
     <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">1</int>
      <lst name="params">
       <str name="q">*:*</str>
       <str name="fq">protocol:http</str>
      </lst>
     </lst>
     <result name="response" numFound="1" start="0">
      <doc>
       <str name="id">1</str>
       <strname="url">http://solr.pl/</str>
       <strname="protocol">http</str>
       <long name="_version_">1468022030035058688</long></doc>
     </result>
    </response>

As you can see, everything works as it should, so now let's see how it worked.

How it works...

Our data is very simple. Each document is described with its identifier (the id field), the URL (the url field), and the field holding the protocol (the protocol field). The first two fields will be passed in the data; the protocol field will be filled automatically by our update request processor chain.

The next thing is to configure our update request processor chain. We already described most of the configuration details in the Counting the number of fields recipe of this chapter. The new thing is the solr.StatelessScriptUpdateProcessorFactory processor. It allows us to define a script (using the script property) that will be used to process our documents. In our case, this script is called script.js. Solr will load this script and use it for each document passed through the update request processor chain.

We also defined the solr.UpdateRequestHandler configuration, and then altered the default configuration by adding the defaults section and including the update.chain property to script (our update request processor chain name). This means that our defined update request processor chain will be used with every indexing request.

Finally, we come to the juicy part of the recipe, the script.js script. The solr.StatelessScriptUpdateProcessorFactory processor allows us to alter Solr behavior using the following script functions:

  • processAdd: This function is executed when a document is added to the index. In our case, we will put our code in this function.
  • processDelete: This function is executed when a delete operation is sent to Solr.
  • processMergeIndexes: This function is executed when the index merge command is sent to Solr.
  • processCommit: This function is executed when the commit command is sent to Solr.
  • processRollback: This function is executed when the rollback command is sent to Solr.
  • finish: Any code that should be run after the script that finished executing is put in this method.

Apart from the finish function, all the other functions have a single argument that represents the command sent to Solr.

As already mentioned, we only need to provide logic in the processAdd function. We start by retrieving the Solr document from the command (the cmd object) and then store the document in the doc variable (doc = cmd.solrDoc;). Next, we get the url field of the document (url = doc.getFieldValue("url");). We check whether the field is defined (if (url != null)); if it is, we split the URL using the : character. This means that for the http://solr.pl URL, we should get an array containing the two parts http and //solr.pl. We are interested in the first value. We check if the parts variable, which was returned by the split function, is defined and if it has elements (if (parts != null &&parts.length> 0)). If the condition is true, we just set a new field using the first element in the parts array, which will contain the protocol.

After indexing our data and running a query that filters the documents to only those that has the http protocol, we see that we did the job right.

See also

主站蜘蛛池模板: 北辰区| 耒阳市| 宜兰市| 桦南县| 余庆县| 平和县| 桑植县| 嘉荫县| 博湖县| 邹城市| 刚察县| 古田县| 安龙县| 廊坊市| 广水市| 万源市| 阜新市| 高陵县| 彰化市| 南汇区| 琼海市| 沙雅县| 阿鲁科尔沁旗| 满城县| 云和县| 泰和县| 休宁县| 光泽县| 峡江县| 巴林左旗| 常山县| 大名县| 资溪县| 碌曲县| 来安县| 阳城县| 定兴县| 南充市| 天峨县| 郧西县| 无锡市|