官术网_书友最值得收藏!

Obtaining a common analyzer

Lucene provides a set of default analyzers in the lucene-analyzers-common package. Let's take a look at them in detail.

Getting ready

The following are five common analyzers Lucene provides in the lucene-analyzers-common module:

  • WhitespaceAnalyzer: Splits text at whitespaces, just as the name indicates. In fact, this is the only thing this analyzer does.
  • SimpleAnalyzer: Splits text at non-letter characters and lowercases resulting tokens.
  • StopAnalyzer: Splits text at non-letter characters, lowercases resulting tokens, and removes stopwords. This analyzer is useful for pure text content and is not ideal if the content contains words with special characters such as product model number. This analyzer comes with a default set of stopwords but you can always have the provision to provide your own set of stopwords.
  • StandardAnalyzer: Splits text using a grammar-based tokenization, normalizes and lowercases tokens, removes stopwords, and discards punctuations. It can be used to extract company names, e-mail addresses, model numbers, and so on. This analyzer is great for general usage.
  • SnowballAnalyzer: This analyzer is similar to StandardAnalyzer with an additional SnowballFilter for stemming. This provides even more flexibility than StandardAnalyzer. However, SnowballFilter is very aggressive in stemming, so false positive are possible. Lucene is deprecating this analyzer in the upcoming version, 5.0, and recommends you use a language-specific analyzer instead (for example, org.apache.lucene.analysis.en.*).

Obtaining the default analyzer is very simple. Note that we don't get to see the actual output, tokenStream, from the analyzer yet. As we progress, we will show you how it's done.

Tip

Make sure the lucene-analyzers-common.jar library is also added to the classpath or the corresponding dependency in your pom.xml.

How to do it...

Here is how you instantiate an analyzer:

Analyzer analyzer = new WhitespaceAnalyzer();

You may instantiate any analyzer in the commons package in a similar fashion. As you see, it is simple to get default analyzers to work.

How it works...

Let's look at some examples to see how each of these analyzers differs. We will use the following sample text – Lucene is mainly used for information retrieval and you can read more about it at http://lucene.apache.org. In the forthcoming sections, we will learn more about customizing analyzers. For now, we shall concern ourselves with output only and review each analyzer's behavior.

First, let's look at WhitespaceAnalyzer. As we already learned, a WhitespaceAnalyzer splits text at whitespaces. The following would be the output of a WhitespaceAnalyzer:

[Lucene] [is] [mainly] [used] [for] [information] [retrieval] [and] [you] [can] [read] [more] [about] [it] [at] [lucene.apache.org.]

Each token is separated in a pair of braces for you to understand clearly. It is quite evident that no normalization has been applied to the text. The split tokens are left as-is. If this analyzer is used exclusively for both indexing and searching, matches will have to be exact (including matching cases) to be found.

Now let's see how SimpleAnalyzer analyzes the same piece of text. Here is what we get as output:

[lucene] [is] [mainly] [used] [for] [information] [retrieval] [and] [you] [can] [read] [more] [about] [it] [at] [lucene] [apache] [org]

Tokens are split by non-letters and lowercased in this example. Note the web address is split up because "." is considered as a delimiter. This analyzer expands search capability a little bit by allowing case-insensitive searches (assuming this is used in both indexing and searching, both index and search terms are lowercased prior to processing).

The next one is StopAnalyzer:

[lucene] [mainly] [used] [information] [retrieval] [you] [can] [read] [more] [about] [lucene] [apache] [org]

This analyzer builds on top of the SimpleAnalyzer analyzer's tokenization and filtering with the addition of StopFilter. Common English stopwords are removed and tokens are lowercased and normalized, similar to SimpleAnalyzer.

Now, let's look at a more sophisticated general purpose built-in analyzer, StandardAnalyzer. Here is what it will output:

[lucene] [mainly] [used] [information] [retrieval] [you] [can] [read] [more] [about] [lucene.apache.org]
Note

Note how StandardAnalyzer treated the web address http://lucene.apache.org.

This analyzer continues to build on top of the features we reviewed so far. It uses a different tokenizer and filter called StandardTokenizer and StandardFilter, tokenizing text by grammar and removing punctuation. This analyzer is suitable for most implementations as it able to handle special wording such as product model numbers and web addresses (by not breaking them up into separate tokens).

Last but not least, the SnowballAnalyzer. Although this analyzer is getting replaced by the language-specific analyzer in org.apach.lucene.analysis.<language code> packages, it is powerful nonetheless, because this analyzer handles stemming quite effectively. Here is what the output would be:

[lucen] [is] [main] [use] [for] [inform] [retriev] [and] [you] [can] [read] [more] [about] [it] [at] [lucene.apache.org]

Note that several words are changed to their root form (for example, mainly to main), defined by the filter. One of the reasons why this analyzer is getting deprecated is that its performance is not as good as its alternative, another stemmer class based on PorterStemmer class. However, some users prefer to use this implementation because the word reduction is more accurate. The new recommended per-language analyzer (for example, EnglishAnalyzer) uses PorterStemmer (Snowball is also based on Porter) and should give you very good indexing performance and good results that are comparable to SnowballFilter.

There's more…

We have seen how various built-in analyzers behave and how each may be suitable for your application. But, in real life, we generally find use cases that differ from the standard offering. In the case of the search application, it is very common that people need to do a lot of customization to make a search engine fulfil business requirements. Luckily, Lucene provide such flexibility where you can create custom analyzers to suit your needs. We will continue to pe deeper and will show you how it's done.

主站蜘蛛池模板: 镇雄县| 汉中市| 南川市| 台前县| 灵武市| 泰兴市| 德州市| 宣威市| 甘德县| 商洛市| 奇台县| 德令哈市| 海原县| 偃师市| 嵊泗县| 卓尼县| 金沙县| 苏尼特左旗| 依安县| 三穗县| 濉溪县| 东至县| 游戏| 漳浦县| 慈利县| 葵青区| 离岛区| 囊谦县| 屏山县| 雷山县| 怀宁县| 措美县| 寻乌县| 雷波县| 永安市| 长海县| 江油市| 四子王旗| 贡嘎县| 景东| 南华县|