- Solr Cookbook(Third Edition)
- Rafa? Ku?
- 878字
- 2021-08-06 19:39:23
Indexing PDF files
The library on the corner, we used to go to, wants to expand its collection and become available for the wider public through the World Wide Web. It asked its book suppliers to provide sample chapters of all the books in PDF format so that they can share it with online users. With all the samples provided by the supplier comes a problem—how to extract data for the search box from more than 900,000 PDF files. Solr can do it with the use of Apache Tika (http://tika.apache.org/). This recipe will show you how to handle such a task.
How to do it...
To index PDF files, we will need to set up Solr to use extracting request handlers. To do this, we will take the following steps:
- First, let's edit our Solr instance,
solrconfig.xml
, and add the following configuration:<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler"> <lst name="defaults"> <str name="fmap.content">text</str> <str name="lowernames">true</str> <str name="uprefix">attr_</str> <str name="captureAttr">true</str> </lst> </requestHandler>
- Next, create the
extract
folder anywhere on your system (I created the folder in the directory where Solr is installed, on the same level as thelib
directory of Solr) and place thesolr-cell-4.10.0.jar
file from thedist
directory (you can find it in the Solr distribution archive). After this, you have to copy all the libraries from thecontrib/extraction/lib/
directory to theextract
directory you created before. - In addition to this, we need the following entries added to the
solrconfig.xml
file (adjust the path to the one matching your system):<lib dir="../../extract" regex=".*\.jar" />
This is actually all that you need to do in terms of configuration.
- The next step is the index structure. To simplify the example, I decided to choose the following index structure (place it in your
schema.xml
file):<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="text" type="text_general" indexed="true" stored="true"/>"/> <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true""/>"/>
- To test the indexing process, I created a PDF file,
book.pdf
, using Bullzip PDF Printer (www.bullzip.com), which contains the textThis is an updated version of Solr cookbook
only. To index this file, I used the following command:curl "http://localhost:8983/solr/cookbook/update/extract?literal.id=1&commit=true" -F "myfile=@book.pdf"
You should see the following response:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">1383</int></lst> </response>
- To see what was indexed, I ran the following within a web browser:
http://localhost:8983/solr/cookbook/select/?q=text:solr&fl=attr_creator,attr_modified
In return, I got the following response:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1</int> <lst name="params"> <str name="q">text:solr</str> <str name="fl">attr_creator,attr_modified</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <arr name="attr_creator"> <str>Rafa? Ku?</str> </arr> <arr name="attr_modified"> <str>2014-05-07T11:30:09Z</str> </arr> </doc> </result> </response>
How it works...
A binary file parsing is implemented in Solr using the Apache Tika framework. Tika is a toolkit used to detect and extract metadata and structured text from various types of documents, not only binary files but also HTML and XML files.
Solr has a dedicated handler that uses Apache Tika. To be able to use it, we need to add a handler based on the solr.extraction.ExtractingRequestHandler
class to our solrconfig.xml
file, as shown in the preceding example.
In addition to the handler definition, we need to specify where Solr should look for the additional libraries we placed in the extract
directory we created. The dir
attribute of the lib
tag should be pointing to the path to the created directory. The regex
attribute is the regular expression telling Solr which files to load. The base directory is the Solr home directory, so if you use relative paths, you should remember this.
Now, let's discuss the default configuration parameters. The fmap.content
parameter tells Solr to what field content the parsed document should be put. In our case, the parsed content will go to the field named text
. The next parameter, lowernames
, set to true
, tells Solr to lower all names that come from Tika and make them lowercased. The next parameter, uprefix
, is very important. It tells Solr how to handle fields that are not defined in the schema.xml
file. The name of the field returned from Tika will be added to the value of the parameter and sent to Solr. For example, if Tika returns a field named creator
, and we don't have such a field in our index, then Solr will try to index it under a field named attr-_creator
, which is a dynamic field. The last parameter tells Solr to index Tika XHTML elements into separate fields named after these elements. Remember that Tika can return multiple attributes of the same name; this is why we defined the dynamic field as a multivalued one.
Next, we have a command that sends a PDF file to Solr. We send a file to the /update/extract
handler with two parameters. First, we define a unique identifier. It's useful to be able to do this during document sending because most of the binary documents won't have an identifier in its contents. To pass the identifier, we use the literal.id
parameter. The second parameter we send to Solr is information to perform a commit immediately after document processing.
The test file I created for the purpose of the recipe contained the simple sentence This is an updated version of Solr cookbook
. Of course, Tika will extract way more information from the PDF, such as creation time, creator, and many more attributes. We queried Solr with a simple query, and to keep the response simple, we limited the returned fields to only attr_creator
and attr_modified
. In response, I got one document that matched the given query. As you can see, Solr was able to extract both the creator and the file modification date. If you want to see all the information extracted by Solr, just remove the fl
parameter.
- LabVIEW2018中文版 虛擬儀器程序設計自學手冊
- Objective-C應用開發全程實錄
- 圖解Java數據結構與算法(微課視頻版)
- PostgreSQL技術內幕:事務處理深度探索
- Linux環境編程:從應用到內核
- Mastering Unity Shaders and Effects
- Learning ArcGIS Pro
- Learning Data Mining with R
- Web Development with MongoDB and Node(Third Edition)
- Kotlin從基礎到實戰
- Nginx實戰:基于Lua語言的配置、開發與架構詳解
- Vue.js光速入門及企業項目開發實戰
- 基于JavaScript的WebGIS開發
- 軟件測試項目實戰之功能測試篇
- Building an E-Commerce Application with MEAN