官术网_书友最值得收藏!

Introduction

Before we begin, let's review Lucene's analysis process. We learned about various components in creating and searching an index using IndexWriter and IndexSearcher in the previous chapter. We also looked at analyzer; how it's leveraged in tokenizing and cleansing data; and Lucene's internal index structure, the inverted index for high-performance lookup. We touched on Term and how it's used in querying.

A term is a fundamental unit of data in a Lucene index. It associates with a Document and itself has two attributes – field (analogous to column name in a table) and value. So how does Lucene extract terms from text? You may already be betting on an analyzer. It's correct that an analyzer is responsible for generating these terms. An analyzer is a container of tokenization and filtering processes. Tokenization, as discussed, is a process that breaks up text at word boundaries defined by a specific tokenizer component. After tokenization, filtering kicks in to massage data before outputting to IndexWriter for indexing. This is when tokens are transformed to terms and stored. What the analyzer produces has a significant effect on search, so it's important to understand the analysis process and have the knowledge to choose or build your own analyzer in order to create a good search experience. The following figure illustrates an analyzer facilitating the analysis process:

In this illustration, a tokenizer uses a reader object to consume text. It produces a sequential set of tokens that is called TokenStream. TokenFilter accepts the TokenStream, applies the filtering process, and emits filtered data in TokenStream in return. TokenFilters can be chained together to attain the desired results. A character filter can also be used to preprocess data before tokenization. One example use case for character filters is stripping out HTML tags.

Now we have a fair idea of what an analyzer is. Let's see how this necessary evil can be put to good use:

  • Stopword filtering: Analyzers can help remove stopwords from text so they are not indexed. Think of words such as a, and, the, on, of, if, and so on. These are words that do not convey any specific meaning. The chance of users searching on these words is very low. Considering the fact that such words usually have high occurrence counts, it is advisable to not index these terms. The process of filtering out such terms from text is called stopword removal.
  • Text normalization: This can be thought of as changing text to conform to a certain standard format, such as lowercasing and the removal of special characters such as the grave accent used in many languages. It is a way to standardize text before it is indexed. This technique helps to improve relevancy in matching search results and also makes comparisons easy and fast.
  • Stemming: This is another important task that helps to improve accuracy and performance. Stemming in Lucene is a reduction of words to their most basic (root) form, and this process is language-specific. For example, think of the word run in the English language. There are different forms of the word depending on how it is used—runs, running, ran, and so on—across documents. From an information retrieval point of view, we are less concerned with these differences.

    Stemming, by itself, is an algorithmic approach that processes terms inpidually without context, so false positives can be prevalent on words that have similar spelling but very different meanings. However, because the technique can still produce highly relevant results in a majority of searches and can significantly reduce dictionary size by reducing words to root forms, the benefits can outweigh the negatives in some implementations.

    To increase accuracy and reduce false positives, there is a more advanced technique called lemmatization that can be employed to provide stemming that's more context-and language-sensitive. Lemmatization is a linguistic approach to stemming that takes word meanings and potentially grammatical rules into consideration. It will improve matching accuracy but at the cost of computational resources and possibly money because there are currently no lemmatization solutions that are publicly available.

    Lucene provides several stemmer implementations—Snowball, PorterStem, and KStem—that can be leveraged to handle stemmings. The best way to choose the right stemmer is usually the empirical approach as search result quality depends a lot on the types of content being searched on.

  • Synonym Expansion: Words can be expanded with their synonyms to further improve search quality. As the name suggests, this technique expands a word into additional similar-meaning words for matching, for example, beautiful and pretty or unhappy and sad. Matching synonyms can help bring back more relevant results when searching general ideas.
主站蜘蛛池模板: 兰州市| 龙海市| 尖扎县| 连州市| 拉孜县| 中卫市| 濮阳县| 镇宁| 盖州市| 衢州市| 诏安县| 密山市| 大同市| 游戏| 晴隆县| 关岭| 八宿县| 渝中区| 济源市| 固镇县| 军事| 渝中区| 海口市| 镇江市| 称多县| 道孚县| 萍乡市| 宁蒗| 孝感市| 太谷县| 永兴县| 平定县| 大田县| 资中县| 石柱| 桦川县| 故城县| 双鸭山市| 巫溪县| 永州市| 鲁山县|