- Lucene 4 Cookbook
- Edwood Ng Vineeth Mohan
- 949字
- 2021-07-16 14:07:50
Obtaining a common analyzer
Lucene provides a set of default analyzers in the lucene-analyzers-common
package. Let's take a look at them in detail.
Getting ready
The following are five common analyzers Lucene provides in the lucene-analyzers-common
module:
WhitespaceAnalyzer
: Splits text at whitespaces, just as the name indicates. In fact, this is the only thing this analyzer does.SimpleAnalyzer
: Splits text at non-letter characters and lowercases resulting tokens.StopAnalyzer
: Splits text at non-letter characters, lowercases resulting tokens, and removes stopwords. This analyzer is useful for pure text content and is not ideal if the content contains words with special characters such as product model number. This analyzer comes with a default set of stopwords but you can always have the provision to provide your own set of stopwords.StandardAnalyzer
: Splits text using a grammar-based tokenization, normalizes and lowercases tokens, removes stopwords, and discards punctuations. It can be used to extract company names, e-mail addresses, model numbers, and so on. This analyzer is great for general usage.SnowballAnalyzer
: This analyzer is similar to StandardAnalyzer with an additional SnowballFilter for stemming. This provides even more flexibility than StandardAnalyzer. However, SnowballFilter is very aggressive in stemming, so false positive are possible. Lucene is deprecating this analyzer in the upcoming version, 5.0, and recommends you use a language-specific analyzer instead (for example,org.apache.lucene.analysis.en.*
).
Obtaining the default analyzer is very simple. Note that we don't get to see the actual output, tokenStream, from the analyzer yet. As we progress, we will show you how it's done.
Tip
Make sure the lucene-analyzers-common.jar
library is also added to the classpath or the corresponding dependency in your pom.xml
.
How to do it...
Here is how you instantiate an analyzer:
Analyzer analyzer = new WhitespaceAnalyzer();
You may instantiate any analyzer in the commons
package in a similar fashion. As you see, it is simple to get default analyzers to work.
How it works...
Let's look at some examples to see how each of these analyzers differs. We will use the following sample text – Lucene is mainly used for information retrieval and you can read more about it at http://lucene.apache.org. In the forthcoming sections, we will learn more about customizing analyzers. For now, we shall concern ourselves with output only and review each analyzer's behavior.
First, let's look at WhitespaceAnalyzer
. As we already learned, a WhitespaceAnalyzer
splits text at whitespaces. The following would be the output of a WhitespaceAnalyzer
:
[Lucene] [is] [mainly] [used] [for] [information] [retrieval] [and] [you] [can] [read] [more] [about] [it] [at] [lucene.apache.org.]
Each token is separated in a pair of braces for you to understand clearly. It is quite evident that no normalization has been applied to the text. The split tokens are left as-is. If this analyzer is used exclusively for both indexing and searching, matches will have to be exact (including matching cases) to be found.
Now let's see how SimpleAnalyzer
analyzes the same piece of text. Here is what we get as output:
[lucene] [is] [mainly] [used] [for] [information] [retrieval] [and] [you] [can] [read] [more] [about] [it] [at] [lucene] [apache] [org]
Tokens are split by non-letters and lowercased in this example. Note the web address is split up because ".
" is considered as a delimiter. This analyzer expands search capability a little bit by allowing case-insensitive searches (assuming this is used in both indexing and searching, both index and search terms are lowercased prior to processing).
The next one is StopAnalyzer
:
[lucene] [mainly] [used] [information] [retrieval] [you] [can] [read] [more] [about] [lucene] [apache] [org]
This analyzer builds on top of the SimpleAnalyzer
analyzer's tokenization and filtering with the addition of StopFilter
. Common English stopwords are removed and tokens are lowercased and normalized, similar to SimpleAnalyzer
.
Now, let's look at a more sophisticated general purpose built-in analyzer, StandardAnalyzer
. Here is what it will output:
[lucene] [mainly] [used] [information] [retrieval] [you] [can] [read] [more] [about] [lucene.apache.org]
Note
Note how StandardAnalyzer
treated the web address http://lucene.apache.org.
This analyzer continues to build on top of the features we reviewed so far. It uses a different tokenizer and filter called StandardTokenizer
and StandardFilter
, tokenizing text by grammar and removing punctuation. This analyzer is suitable for most implementations as it able to handle special wording such as product model numbers and web addresses (by not breaking them up into separate tokens).
Last but not least, the SnowballAnalyzer
. Although this analyzer is getting replaced by the language-specific analyzer in org.apach.lucene.analysis.<language code>
packages, it is powerful nonetheless, because this analyzer handles stemming quite effectively. Here is what the output would be:
[lucen] [is] [main] [use] [for] [inform] [retriev] [and] [you] [can] [read] [more] [about] [it] [at] [lucene.apache.org]
Note that several words are changed to their root form (for example, mainly to main), defined by the filter. One of the reasons why this analyzer is getting deprecated is that its performance is not as good as its alternative, another stemmer class based on PorterStemmer
class. However, some users prefer to use this implementation because the word reduction is more accurate. The new recommended per-language analyzer (for example, EnglishAnalyzer
) uses PorterStemmer
(Snowball is also based on Porter) and should give you very good indexing performance and good results that are comparable to SnowballFilter
.
There's more…
We have seen how various built-in analyzers behave and how each may be suitable for your application. But, in real life, we generally find use cases that differ from the standard offering. In the case of the search application, it is very common that people need to do a lot of customization to make a search engine fulfil business requirements. Luckily, Lucene provide such flexibility where you can create custom analyzers to suit your needs. We will continue to pe deeper and will show you how it's done.
- Web程序設(shè)計及應(yīng)用
- Android Jetpack開發(fā):原理解析與應(yīng)用實戰(zhàn)
- Rust編程:入門、實戰(zhàn)與進階
- Python 深度學習
- Android Application Development Cookbook(Second Edition)
- 青少年P(guān)ython編程入門
- Visual FoxPro程序設(shè)計習題集及實驗指導(第四版)
- ElasticSearch Cookbook(Second Edition)
- Java Fundamentals
- 從零開始學Android開發(fā)
- RESTful Web Clients:基于超媒體的可復用客戶端
- LabVIEW數(shù)據(jù)采集
- Implementing Microsoft Dynamics NAV(Third Edition)
- Android 游戲開發(fā)大全(第二版)
- C++ Data Structures and Algorithm Design Principles