- Solr Cookbook(Third Edition)
- Rafa? Ku?
- 878字
- 2021-08-06 19:39:23
Indexing PDF files
The library on the corner, we used to go to, wants to expand its collection and become available for the wider public through the World Wide Web. It asked its book suppliers to provide sample chapters of all the books in PDF format so that they can share it with online users. With all the samples provided by the supplier comes a problem—how to extract data for the search box from more than 900,000 PDF files. Solr can do it with the use of Apache Tika (http://tika.apache.org/). This recipe will show you how to handle such a task.
How to do it...
To index PDF files, we will need to set up Solr to use extracting request handlers. To do this, we will take the following steps:
- First, let's edit our Solr instance,
solrconfig.xml
, and add the following configuration:<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler"> <lst name="defaults"> <str name="fmap.content">text</str> <str name="lowernames">true</str> <str name="uprefix">attr_</str> <str name="captureAttr">true</str> </lst> </requestHandler>
- Next, create the
extract
folder anywhere on your system (I created the folder in the directory where Solr is installed, on the same level as thelib
directory of Solr) and place thesolr-cell-4.10.0.jar
file from thedist
directory (you can find it in the Solr distribution archive). After this, you have to copy all the libraries from thecontrib/extraction/lib/
directory to theextract
directory you created before. - In addition to this, we need the following entries added to the
solrconfig.xml
file (adjust the path to the one matching your system):<lib dir="../../extract" regex=".*\.jar" />
This is actually all that you need to do in terms of configuration.
- The next step is the index structure. To simplify the example, I decided to choose the following index structure (place it in your
schema.xml
file):<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="text" type="text_general" indexed="true" stored="true"/>"/> <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true""/>"/>
- To test the indexing process, I created a PDF file,
book.pdf
, using Bullzip PDF Printer (www.bullzip.com), which contains the textThis is an updated version of Solr cookbook
only. To index this file, I used the following command:curl "http://localhost:8983/solr/cookbook/update/extract?literal.id=1&commit=true" -F "myfile=@book.pdf"
You should see the following response:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">1383</int></lst> </response>
- To see what was indexed, I ran the following within a web browser:
http://localhost:8983/solr/cookbook/select/?q=text:solr&fl=attr_creator,attr_modified
In return, I got the following response:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1</int> <lst name="params"> <str name="q">text:solr</str> <str name="fl">attr_creator,attr_modified</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <arr name="attr_creator"> <str>Rafa? Ku?</str> </arr> <arr name="attr_modified"> <str>2014-05-07T11:30:09Z</str> </arr> </doc> </result> </response>
How it works...
A binary file parsing is implemented in Solr using the Apache Tika framework. Tika is a toolkit used to detect and extract metadata and structured text from various types of documents, not only binary files but also HTML and XML files.
Solr has a dedicated handler that uses Apache Tika. To be able to use it, we need to add a handler based on the solr.extraction.ExtractingRequestHandler
class to our solrconfig.xml
file, as shown in the preceding example.
In addition to the handler definition, we need to specify where Solr should look for the additional libraries we placed in the extract
directory we created. The dir
attribute of the lib
tag should be pointing to the path to the created directory. The regex
attribute is the regular expression telling Solr which files to load. The base directory is the Solr home directory, so if you use relative paths, you should remember this.
Now, let's discuss the default configuration parameters. The fmap.content
parameter tells Solr to what field content the parsed document should be put. In our case, the parsed content will go to the field named text
. The next parameter, lowernames
, set to true
, tells Solr to lower all names that come from Tika and make them lowercased. The next parameter, uprefix
, is very important. It tells Solr how to handle fields that are not defined in the schema.xml
file. The name of the field returned from Tika will be added to the value of the parameter and sent to Solr. For example, if Tika returns a field named creator
, and we don't have such a field in our index, then Solr will try to index it under a field named attr-_creator
, which is a dynamic field. The last parameter tells Solr to index Tika XHTML elements into separate fields named after these elements. Remember that Tika can return multiple attributes of the same name; this is why we defined the dynamic field as a multivalued one.
Next, we have a command that sends a PDF file to Solr. We send a file to the /update/extract
handler with two parameters. First, we define a unique identifier. It's useful to be able to do this during document sending because most of the binary documents won't have an identifier in its contents. To pass the identifier, we use the literal.id
parameter. The second parameter we send to Solr is information to perform a commit immediately after document processing.
The test file I created for the purpose of the recipe contained the simple sentence This is an updated version of Solr cookbook
. Of course, Tika will extract way more information from the PDF, such as creation time, creator, and many more attributes. We queried Solr with a simple query, and to keep the response simple, we limited the returned fields to only attr_creator
and attr_modified
. In response, I got one document that matched the given query. As you can see, Solr was able to extract both the creator and the file modification date. If you want to see all the information extracted by Solr, just remove the fl
parameter.
- 演進式架構(原書第2版)
- ASP.NET Core:Cloud-ready,Enterprise Web Application Development
- 架構不再難(全5冊)
- TestNG Beginner's Guide
- STM32F0實戰:基于HAL庫開發
- Java 11 Cookbook
- Hands-On Microservices with Kotlin
- Python數據可視化之Matplotlib與Pyecharts實戰
- 深入理解Android:Wi-Fi、NFC和GPS卷
- 從Java到Web程序設計教程
- 計算機應用基礎實踐教程
- 第一行代碼 C語言(視頻講解版)
- OpenResty完全開發指南:構建百萬級別并發的Web應用
- Arduino計算機視覺編程
- C語言程序設計實訓教程與水平考試指導