官术网_书友最值得收藏!

Indexing PDF files

The library on the corner, we used to go to, wants to expand its collection and become available for the wider public through the World Wide Web. It asked its book suppliers to provide sample chapters of all the books in PDF format so that they can share it with online users. With all the samples provided by the supplier comes a problem—how to extract data for the search box from more than 900,000 PDF files. Solr can do it with the use of Apache Tika (http://tika.apache.org/). This recipe will show you how to handle such a task.

How to do it...

To index PDF files, we will need to set up Solr to use extracting request handlers. To do this, we will take the following steps:

  1. First, let's edit our Solr instance, solrconfig.xml, and add the following configuration:
    <requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
     <lst name="defaults">
      <str name="fmap.content">text</str>
      <str name="lowernames">true</str>
      <str name="uprefix">attr_</str>
      <str name="captureAttr">true</str>
     </lst>
    </requestHandler>
  2. Next, create the extract folder anywhere on your system (I created the folder in the directory where Solr is installed, on the same level as the lib directory of Solr) and place the solr-cell-4.10.0.jar file from the dist directory (you can find it in the Solr distribution archive). After this, you have to copy all the libraries from the contrib/extraction/lib/ directory to the extract directory you created before.
  3. In addition to this, we need the following entries added to the solrconfig.xml file (adjust the path to the one matching your system):
    <lib dir="../../extract" regex=".*\.jar" />

    This is actually all that you need to do in terms of configuration.

  4. The next step is the index structure. To simplify the example, I decided to choose the following index structure (place it in your schema.xml file):
    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="text" type="text_general" indexed="true" stored="true"/>"/>
    <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true""/>"/>
  5. To test the indexing process, I created a PDF file, book.pdf, using Bullzip PDF Printer (www.bullzip.com), which contains the text This is an updated version of Solr cookbook only. To index this file, I used the following command:
    curl "http://localhost:8983/solr/cookbook/update/extract?literal.id=1&commit=true" -F "myfile=@book.pdf"
    

    You should see the following response:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
    <lst name="responseHeader"><int name="status">0</int><int name="QTime">1383</int></lst>
    </response>
  6. To see what was indexed, I ran the following within a web browser:
    http://localhost:8983/solr/cookbook/select/?q=text:solr&fl=attr_creator,attr_modified

    In return, I got the following response:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">1</int>
      <lst name="params">
       <str name="q">text:solr</str>
       <str name="fl">attr_creator,attr_modified</str>
      </lst>
     </lst>
     <result name="response" numFound="1" start="0">
      <doc>
       <arr name="attr_creator">
        <str>Rafa? Ku?</str>
       </arr>
       <arr name="attr_modified">
        <str>2014-05-07T11:30:09Z</str>
       </arr>
      </doc>
     </result>
    </response>

How it works...

A binary file parsing is implemented in Solr using the Apache Tika framework. Tika is a toolkit used to detect and extract metadata and structured text from various types of documents, not only binary files but also HTML and XML files.

Solr has a dedicated handler that uses Apache Tika. To be able to use it, we need to add a handler based on the solr.extraction.ExtractingRequestHandler class to our solrconfig.xml file, as shown in the preceding example.

In addition to the handler definition, we need to specify where Solr should look for the additional libraries we placed in the extract directory we created. The dir attribute of the lib tag should be pointing to the path to the created directory. The regex attribute is the regular expression telling Solr which files to load. The base directory is the Solr home directory, so if you use relative paths, you should remember this.

Now, let's discuss the default configuration parameters. The fmap.content parameter tells Solr to what field content the parsed document should be put. In our case, the parsed content will go to the field named text. The next parameter, lowernames, set to true, tells Solr to lower all names that come from Tika and make them lowercased. The next parameter, uprefix, is very important. It tells Solr how to handle fields that are not defined in the schema.xml file. The name of the field returned from Tika will be added to the value of the parameter and sent to Solr. For example, if Tika returns a field named creator, and we don't have such a field in our index, then Solr will try to index it under a field named attr-_creator, which is a dynamic field. The last parameter tells Solr to index Tika XHTML elements into separate fields named after these elements. Remember that Tika can return multiple attributes of the same name; this is why we defined the dynamic field as a multivalued one.

Next, we have a command that sends a PDF file to Solr. We send a file to the /update/extract handler with two parameters. First, we define a unique identifier. It's useful to be able to do this during document sending because most of the binary documents won't have an identifier in its contents. To pass the identifier, we use the literal.id parameter. The second parameter we send to Solr is information to perform a commit immediately after document processing.

The test file I created for the purpose of the recipe contained the simple sentence This is an updated version of Solr cookbook. Of course, Tika will extract way more information from the PDF, such as creation time, creator, and many more attributes. We queried Solr with a simple query, and to keep the response simple, we limited the returned fields to only attr_creator and attr_modified. In response, I got one document that matched the given query. As you can see, Solr was able to extract both the creator and the file modification date. If you want to see all the information extracted by Solr, just remove the fl parameter.

主站蜘蛛池模板: 余庆县| 金湖县| 阜新| 华亭县| 鄂尔多斯市| 琼中| 洛南县| 阿拉尔市| 习水县| 威宁| 阿拉善盟| 蒙城县| 通许县| 沅陵县| 阆中市| 常德市| 遂溪县| 浮梁县| 保康县| 三门县| 古交市| 榆树市| 台山市| 西昌市| 来凤县| 惠安县| 乌拉特中旗| 启东市| 寻甸| 阳江市| 大港区| 聂拉木县| 泰州市| 黔西| 西平县| 关岭| 莲花县| 嵊泗县| 松潘县| 阿合奇县| 灵宝市|