官术网_书友最值得收藏!

Indexing PDF files

The library on the corner, we used to go to, wants to expand its collection and become available for the wider public through the World Wide Web. It asked its book suppliers to provide sample chapters of all the books in PDF format so that they can share it with online users. With all the samples provided by the supplier comes a problem—how to extract data for the search box from more than 900,000 PDF files. Solr can do it with the use of Apache Tika (http://tika.apache.org/). This recipe will show you how to handle such a task.

How to do it...

To index PDF files, we will need to set up Solr to use extracting request handlers. To do this, we will take the following steps:

  1. First, let's edit our Solr instance, solrconfig.xml, and add the following configuration:
    <requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
     <lst name="defaults">
      <str name="fmap.content">text</str>
      <str name="lowernames">true</str>
      <str name="uprefix">attr_</str>
      <str name="captureAttr">true</str>
     </lst>
    </requestHandler>
  2. Next, create the extract folder anywhere on your system (I created the folder in the directory where Solr is installed, on the same level as the lib directory of Solr) and place the solr-cell-4.10.0.jar file from the dist directory (you can find it in the Solr distribution archive). After this, you have to copy all the libraries from the contrib/extraction/lib/ directory to the extract directory you created before.
  3. In addition to this, we need the following entries added to the solrconfig.xml file (adjust the path to the one matching your system):
    <lib dir="../../extract" regex=".*\.jar" />

    This is actually all that you need to do in terms of configuration.

  4. The next step is the index structure. To simplify the example, I decided to choose the following index structure (place it in your schema.xml file):
    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="text" type="text_general" indexed="true" stored="true"/>"/>
    <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true""/>"/>
  5. To test the indexing process, I created a PDF file, book.pdf, using Bullzip PDF Printer (www.bullzip.com), which contains the text This is an updated version of Solr cookbook only. To index this file, I used the following command:
    curl "http://localhost:8983/solr/cookbook/update/extract?literal.id=1&commit=true" -F "myfile=@book.pdf"
    

    You should see the following response:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
    <lst name="responseHeader"><int name="status">0</int><int name="QTime">1383</int></lst>
    </response>
  6. To see what was indexed, I ran the following within a web browser:
    http://localhost:8983/solr/cookbook/select/?q=text:solr&fl=attr_creator,attr_modified

    In return, I got the following response:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">1</int>
      <lst name="params">
       <str name="q">text:solr</str>
       <str name="fl">attr_creator,attr_modified</str>
      </lst>
     </lst>
     <result name="response" numFound="1" start="0">
      <doc>
       <arr name="attr_creator">
        <str>Rafa? Ku?</str>
       </arr>
       <arr name="attr_modified">
        <str>2014-05-07T11:30:09Z</str>
       </arr>
      </doc>
     </result>
    </response>

How it works...

A binary file parsing is implemented in Solr using the Apache Tika framework. Tika is a toolkit used to detect and extract metadata and structured text from various types of documents, not only binary files but also HTML and XML files.

Solr has a dedicated handler that uses Apache Tika. To be able to use it, we need to add a handler based on the solr.extraction.ExtractingRequestHandler class to our solrconfig.xml file, as shown in the preceding example.

In addition to the handler definition, we need to specify where Solr should look for the additional libraries we placed in the extract directory we created. The dir attribute of the lib tag should be pointing to the path to the created directory. The regex attribute is the regular expression telling Solr which files to load. The base directory is the Solr home directory, so if you use relative paths, you should remember this.

Now, let's discuss the default configuration parameters. The fmap.content parameter tells Solr to what field content the parsed document should be put. In our case, the parsed content will go to the field named text. The next parameter, lowernames, set to true, tells Solr to lower all names that come from Tika and make them lowercased. The next parameter, uprefix, is very important. It tells Solr how to handle fields that are not defined in the schema.xml file. The name of the field returned from Tika will be added to the value of the parameter and sent to Solr. For example, if Tika returns a field named creator, and we don't have such a field in our index, then Solr will try to index it under a field named attr-_creator, which is a dynamic field. The last parameter tells Solr to index Tika XHTML elements into separate fields named after these elements. Remember that Tika can return multiple attributes of the same name; this is why we defined the dynamic field as a multivalued one.

Next, we have a command that sends a PDF file to Solr. We send a file to the /update/extract handler with two parameters. First, we define a unique identifier. It's useful to be able to do this during document sending because most of the binary documents won't have an identifier in its contents. To pass the identifier, we use the literal.id parameter. The second parameter we send to Solr is information to perform a commit immediately after document processing.

The test file I created for the purpose of the recipe contained the simple sentence This is an updated version of Solr cookbook. Of course, Tika will extract way more information from the PDF, such as creation time, creator, and many more attributes. We queried Solr with a simple query, and to keep the response simple, we limited the returned fields to only attr_creator and attr_modified. In response, I got one document that matched the given query. As you can see, Solr was able to extract both the creator and the file modification date. If you want to see all the information extracted by Solr, just remove the fl parameter.

主站蜘蛛池模板: 湘阴县| 耒阳市| 芜湖县| 陵水| 宁津县| 祁东县| 南溪县| 门头沟区| 青田县| 综艺| 合作市| 哈尔滨市| 泗阳县| 柏乡县| 大石桥市| 尉犁县| 临湘市| 常州市| 阜城县| 彩票| 丰宁| 綦江县| 弋阳县| 金山区| 牙克石市| 海盐县| 榆中县| 随州市| 栾城县| 汉寿县| 永平县| 大安市| 博乐市| 云和县| 南阳市| 大竹县| 屏东县| 邳州市| 明光市| 盈江县| 仙居县|