官术网_书友最值得收藏!

Obtaining a common analyzer

Lucene provides a set of default analyzers in the lucene-analyzers-common package. Let's take a look at them in detail.

Getting ready

The following are five common analyzers Lucene provides in the lucene-analyzers-common module:

  • WhitespaceAnalyzer: Splits text at whitespaces, just as the name indicates. In fact, this is the only thing this analyzer does.
  • SimpleAnalyzer: Splits text at non-letter characters and lowercases resulting tokens.
  • StopAnalyzer: Splits text at non-letter characters, lowercases resulting tokens, and removes stopwords. This analyzer is useful for pure text content and is not ideal if the content contains words with special characters such as product model number. This analyzer comes with a default set of stopwords but you can always have the provision to provide your own set of stopwords.
  • StandardAnalyzer: Splits text using a grammar-based tokenization, normalizes and lowercases tokens, removes stopwords, and discards punctuations. It can be used to extract company names, e-mail addresses, model numbers, and so on. This analyzer is great for general usage.
  • SnowballAnalyzer: This analyzer is similar to StandardAnalyzer with an additional SnowballFilter for stemming. This provides even more flexibility than StandardAnalyzer. However, SnowballFilter is very aggressive in stemming, so false positive are possible. Lucene is deprecating this analyzer in the upcoming version, 5.0, and recommends you use a language-specific analyzer instead (for example, org.apache.lucene.analysis.en.*).

Obtaining the default analyzer is very simple. Note that we don't get to see the actual output, tokenStream, from the analyzer yet. As we progress, we will show you how it's done.

Tip

Make sure the lucene-analyzers-common.jar library is also added to the classpath or the corresponding dependency in your pom.xml.

How to do it...

Here is how you instantiate an analyzer:

Analyzer analyzer = new WhitespaceAnalyzer();

You may instantiate any analyzer in the commons package in a similar fashion. As you see, it is simple to get default analyzers to work.

How it works...

Let's look at some examples to see how each of these analyzers differs. We will use the following sample text – Lucene is mainly used for information retrieval and you can read more about it at http://lucene.apache.org. In the forthcoming sections, we will learn more about customizing analyzers. For now, we shall concern ourselves with output only and review each analyzer's behavior.

First, let's look at WhitespaceAnalyzer. As we already learned, a WhitespaceAnalyzer splits text at whitespaces. The following would be the output of a WhitespaceAnalyzer:

[Lucene] [is] [mainly] [used] [for] [information] [retrieval] [and] [you] [can] [read] [more] [about] [it] [at] [lucene.apache.org.]

Each token is separated in a pair of braces for you to understand clearly. It is quite evident that no normalization has been applied to the text. The split tokens are left as-is. If this analyzer is used exclusively for both indexing and searching, matches will have to be exact (including matching cases) to be found.

Now let's see how SimpleAnalyzer analyzes the same piece of text. Here is what we get as output:

[lucene] [is] [mainly] [used] [for] [information] [retrieval] [and] [you] [can] [read] [more] [about] [it] [at] [lucene] [apache] [org]

Tokens are split by non-letters and lowercased in this example. Note the web address is split up because "." is considered as a delimiter. This analyzer expands search capability a little bit by allowing case-insensitive searches (assuming this is used in both indexing and searching, both index and search terms are lowercased prior to processing).

The next one is StopAnalyzer:

[lucene] [mainly] [used] [information] [retrieval] [you] [can] [read] [more] [about] [lucene] [apache] [org]

This analyzer builds on top of the SimpleAnalyzer analyzer's tokenization and filtering with the addition of StopFilter. Common English stopwords are removed and tokens are lowercased and normalized, similar to SimpleAnalyzer.

Now, let's look at a more sophisticated general purpose built-in analyzer, StandardAnalyzer. Here is what it will output:

[lucene] [mainly] [used] [information] [retrieval] [you] [can] [read] [more] [about] [lucene.apache.org]
Note

Note how StandardAnalyzer treated the web address http://lucene.apache.org.

This analyzer continues to build on top of the features we reviewed so far. It uses a different tokenizer and filter called StandardTokenizer and StandardFilter, tokenizing text by grammar and removing punctuation. This analyzer is suitable for most implementations as it able to handle special wording such as product model numbers and web addresses (by not breaking them up into separate tokens).

Last but not least, the SnowballAnalyzer. Although this analyzer is getting replaced by the language-specific analyzer in org.apach.lucene.analysis.<language code> packages, it is powerful nonetheless, because this analyzer handles stemming quite effectively. Here is what the output would be:

[lucen] [is] [main] [use] [for] [inform] [retriev] [and] [you] [can] [read] [more] [about] [it] [at] [lucene.apache.org]

Note that several words are changed to their root form (for example, mainly to main), defined by the filter. One of the reasons why this analyzer is getting deprecated is that its performance is not as good as its alternative, another stemmer class based on PorterStemmer class. However, some users prefer to use this implementation because the word reduction is more accurate. The new recommended per-language analyzer (for example, EnglishAnalyzer) uses PorterStemmer (Snowball is also based on Porter) and should give you very good indexing performance and good results that are comparable to SnowballFilter.

There's more…

We have seen how various built-in analyzers behave and how each may be suitable for your application. But, in real life, we generally find use cases that differ from the standard offering. In the case of the search application, it is very common that people need to do a lot of customization to make a search engine fulfil business requirements. Luckily, Lucene provide such flexibility where you can create custom analyzers to suit your needs. We will continue to pe deeper and will show you how it's done.

主站蜘蛛池模板: 涞源县| 巴林右旗| 隆回县| 岗巴县| 孝感市| 乌审旗| 吴堡县| 宁海县| 乌鲁木齐县| 霍城县| 海丰县| 乌拉特中旗| 甘谷县| 二连浩特市| 钟祥市| 琼海市| 五常市| 滕州市| 洛南县| 嘉峪关市| 澄城县| 闵行区| 枣强县| 红桥区| 三明市| 平度市| 天等县| 宁城县| 清苑县| 专栏| 巴彦淖尔市| 兴国县| 江津市| 陆河县| 磐安县| 河津市| 永昌县| 三门县| 大渡口区| 嘉兴市| 日土县|