- Lucene 4 Cookbook
- Edwood Ng Vineeth Mohan
- 443字
- 2021-07-16 14:07:50
Using PositionIncrementAttribute
The PositionIncrementAttribute
class shows the position of the current token relative to the previous token. The default value is 1. Any value greater than 1 implies that the previous token and the current token are not consecutive – there is a gap between the two tokens where some tokens (for example, stopwords) are omitted. This attribute is useful in phrase matching, where the position and order of words matters. For example, say you want to execute an exact phrase match. As we step through TokenStream, the PositionIncrementAttribute
class on each matching token should be 1 so we know the phrase we are matching is matched word for word exactly in the same order as the search phrase.
Another use of this attribute is synonym matching in a phrase query. Synonyms can be inserted into the TokenStream following the term that's being expanded. The position increments for the synonyms would set to 0 as that indicates the synonym term is at the same position as the source term (the previous token). That way, the phrase Lucene is great for search would match Lucene is excellent for search (assuming great is synonymous with excellent in the chosen synonym filter).
Getting ready
PositionIncrementAttribute
can be retrieved by calling addAttribute(PositionIncrementAttribute.class)
on the TokenStream object. As we already learned, the attribute is updated when we call incrementToken
to iterate through the tokens. To illustrate how this attribute is used, we are going to write a simple Filter that will skip stopwords and set increment positions accordingly.
How to do it...
Here is a sample code snippet:
public class MyStopWordFilter extends TokenFilter { private CharTermAttribute charTermAtt; private PositionIncrementAttribute posIncrAtt; public MyStopWordFilter(TokenStream input) { super(input); charTermAtt = addAttribute(CharTermAttribute.class); posIncrAtt = addAttribute(PositionIncrementAttribute.class); } @Override public boolean incrementToken() throws IOException { int extraIncrement = 0; boolean returnValue = false; while (input.incrementToken()) { if (StopAnalyzer.ENGLISH_STOP_WORDS_SET.contains(charTermAtt.toString())) { extraIncrement++;// filter this word continue; } returnValue = true; break; } if(extraIncrement>0){ posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement()+extraIncrement); } return returnValue; } }
How it works…
In this example, we obtain two attributes, CharTermAttribute
(for text value retrieval) and PositionIncrementAttribute
(to set the position increment value). Then we call input.incrementToken()
to iterate through the TokenStream (input is a variable in TokenFilter that points to the incoming TokenStream). In each iteration, we check if the current token is a stopword. If it's a stopword, we increment extraIncrement
by 1 to account for the filtered stopword. The while loop exits either if we find a non-stopword or if we exhaust the list of tokens. An PositionIncrementAttribute
class is set on the next non-stopword token with the addition of extraIncrement
. The updated increment tells you how many tokens this filter filters out.
- Designing Machine Learning Systems with Python
- Drupal 8 Blueprints
- JavaScript高效圖形編程
- 摩登創客:與智能手機和平板電腦共舞
- JavaScript:Functional Programming for JavaScript Developers
- 區塊鏈架構與實現:Cosmos詳解
- 數據庫系統原理及MySQL應用教程
- oreilly精品圖書:軟件開發者路線圖叢書(共8冊)
- 數據結構與算法JavaScript描述
- Visual C++串口通信技術詳解(第2版)
- Learn React with TypeScript 3
- 零基礎輕松學SQL Server 2016
- 執劍而舞:用代碼創作藝術
- 分布式架構原理與實踐
- 實驗編程:PsychoPy從入門到精通