- Apache Solr for Indexing Data
- Sachin Handiekar Anshul Johri
- 820字
- 2021-07-30 09:42:45
Introducing analyzers
To make us able to search effectively and efficiently, Solr splits text into tokens during indexing as well as during search (query time). Solr does all of this with the help of its three main components: analyzers, tokenizers, and filters. Analyzers are used during both indexing and searching. An analyzer examines the text of fields and the generated token stream with the help of tokenizers. Then, filters examine the stream of tokens and perform filtering jobs of any one of these: keeping them, discarding them, or creating new tokens. Tokenizers and filters might be combined in the form of pipelines or chains such that the output of one is the input of the other. Such a sequence of tokenizers and filters is called an analyzer, and the resulting output of the analyzer is used to match search queries or build indices. Let's see how we can use these components in Solr and implement them.
Analyzers are core components that preprocess input text at indexing and search time. It's recommended that you use similar or the same analyzers to preprocess text in a compatible manner at query and index time. In simple terms, the role of an analyzer is to examine the input text and generate token streams. An analyzer is specified as a child of a <fieldType>
element in the schema.xml
configuration file.
In normal usage, only fields of the solr.TextField
type specify an analyzer. There are two ways to specify how text fields are analyzed in Solr with the help of analyzers in schema.xml
:
- One way is to specify the class name of an analyzer whose class attribute is a fully qualified Java class name. This is the simplest way of configuring an analyzer with a single
<analyzer>
element. The class name must be derived fromorg.apache.lucene.analysis.Analyzer
. The following is an example:<fieldType name="nametext" class="solr.TextField"> <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/> </fieldType>
In this case, a single class,
WhitespaceAnalyzer
, is responsible for analyzing the content of the named text field and emitting the corresponding tokens. This analyzer can be used in simple cases where only plain English input text is present. But in general, there is always more complex analysis that is done on field content. - The second way is to specify a
TokenizerFactory
followed by a list of optionalTokenFilterFactory
, which are applied in the listed order. This is the way of performing complex analyses on input text content; for example, you can decompose your analysis into discrete and relatively simple steps. Here is an example:<fieldType name="nametext" class="solr.TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory"/> </analyzer> </fieldType>
In the preceding case, we are trying to set up an analysis chain by simply specifying the
<analyzer>
element (no class attribute) and child elements, which are factory classes for the tokenizers and filters in the order that you want to run. In this case, no analyzer class is defined in the<analyzer>
element. Rather, there is a sequence of more specialized classes clubbed together to act as an analyzer for the field which is going to be analyzed. Soon, you will discover that a Solr distribution comes with a large selection of tokenizers and filters that will help by covering most of the scenarios that you are likely to encounter.Note
Note that classes in the
org.apache.solr.analysis
package may be referred to here with the short aliassolr.prefix
.
We will cover tokenizers and filters in detail in upcoming topics.
Analysis phases
We read earlier that analysis happens in two contexts. At index time, when a field is being created, the token stream that results from the analysis is added to the index and defines a set of terms such as position, size, and so on for the field. At query time, the search query is analyzed and the terms are matched against those that are stored in the field's index. In many cases, the same analysis is used at index and query time, but there might be some cases in which you may want to use different steps of analysis during indexing and search time. Here is an example of this:
<fieldType name="nametext" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> filter class="solr.RemoveWordFilterFactory" words="removewords.txt"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
In the preceding example, you can see that we have used two <analyzer>
definitions, distinguished by the type
attribute. Based on the type
attribute, Solr applies the analyzer to the input field at index
and query
time. At the time of indexing data, we told the analyzer to follow different steps, in comparison to query time. At index time, we told Solr to tokenize the text using the solr.StandardTokenizerFactory
class, after which we used a filter called solr.LowerCaseFilterFactory
to make the tokens lowercase. After making the tokens lowercase, we used another filter called solr.RemoveWordFilterFactory
, which removes the tokens as per the words defined in removewords.txt
. The final filter that we used maps the tokens to an alternate value using the solr.SynonymFilterFactory
filter, which uses the synonyms.txt
file. But at query
time, we asked analyzer to apply only the lowercase filter to convert query terms to lowercase. Other filters that were applied at index
time were not applied at query
time.
- Learning Neo4j
- TypeScript Blueprints
- 零基礎學Scratch少兒編程:小學課本中的Scratch創意編程
- Linux核心技術從小白到大牛
- 小程序,巧運營:微信小程序運營招式大全
- QGIS:Becoming a GIS Power User
- Rust Essentials(Second Edition)
- jQuery炫酷應用實例集錦
- Getting Started with Python and Raspberry Pi
- MySQL程序員面試筆試寶典
- Java 9 Programming By Example
- Continuous Delivery and DevOps:A Quickstart Guide Second Edition
- 視窗軟件設計和開發自動化:可視化D++語言
- Java語言程序設計實用教程(第2版)
- Django 3 Web Development Cookbook