- Lucene 4 Cookbook
- Edwood Ng Vineeth Mohan
- 716字
- 2021-07-16 14:07:50
Obtaining TokenAttribute values
With a TokenStream, we can look at how token values are retrieved. From a high level, TokenStream is an enumeration of tokens. To access the values, we will provide TokenStream with one or more attribute objects. Note that there is only one instance that exists per attribute. This is for performance reasons so we are not creating objects in each iteration; instead, the same attribute instances are updated when we increment the token.
Getting ready
There are several types of attributes; each type provides a different aspect, or metadata, of a token. Here is a list of attributes we will review in this section.
This is the token attribute interface description:
CharTermAttribute
: This exposes a token's actual textual value, equivalent to a term's value.PositionIncrementAttribute
: This returns the position of the current token relative to the previous token. This attribute is useful in phrase-matching as the keyword order and their positions are important. If there are no gaps between the current token and the previous token (for example, no stop words in between), it will be set to its default value, 1.OffsetAttribute
: This gives you information about the start and end positions of the corresponding term in the source text.TypeAttribute
: This is available if it is used in the implementation. This is usually used to identify the data type.FlagsAttribute
: This is somewhat similar toTypeAttribute
, but it serves a different purpose. Suppose you need to add specific information about a token and that information should be available down the analyzer chain, you can pass it as flags. TokenFilters can perform any specific action based on the flags of the token.PayloadAttribute
: This stores the payload at each index position and is generally useful in scoring when used with Payload-based queries. Because it's stored at each position, it is best to have a minimum number of bytes per term in the index to minimize overloading the index with a massive amount of data.
How to do it…
Now we will see Attribute retrieval in action. In this sample, we will use StandardAnalyzer
to process the input text and OffsetAttribute
and CharTermAttribute
to return each token's value and its offsets. Here is the sample code:
StringReader reader = new StringReader("Lucene is mainly used for information retrieval and you can read more about it at lucene.apache.org."); StandardAnalyzer wa = new StandardAnalyzer(); TokenStream ts = null; try { ts = wa.tokenStream("field", reader); OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class); CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class); ts.reset(); while (ts.incrementToken()) { String token = termAtt.toString(); System.out.println("[" + token + "]"); System.out.println("Token starting offset: " + offsetAtt.startOffset()); System.out.println(" Token ending offset: " + offsetAtt.endOffset()); System.out.println(""); } ts.end(); } catch (IOException e) { e.printStackTrace(); } finally { ts.close(); wa.close(); }
Note
Keep in mind that Attribute objects are reused in each iteration as we increment tokens for performance and efficient memory management.
How it works…
In this sample, we are breaking down this text – Lucene is mainly used for information retrieval and you can read more about it at http://lucene.apache.org. using StandardAnalyzer
. Note that we put a try catch block around TokenStream
retrieval and its iteration. This is so we can handle IOException and use the finally
block to cleanly close the TokenStream and Analyzer. The following is a step-by-step guide on what's happening in the sample code:
- To start processing text, we turn our input stream into
StringReader
to pass into the Analyzer'stokenStream
method. - Then we instantiate two attribute objects,
OffsetAttribute
andCharTermAttribute
. - The attribute objects are then registered in TokenStream by calling its
addAttribute
method. - Note that we call
ts.reset()
to reset TokenStream to the beginning. This call is necessary prior to every iteration routine to ensure we always iterate from the beginning. - We iterate TokenStream in a
while
loop by callingts.incrementToken()
. The loop exits whenincrementToken()
returns false. - We call
termAtt.toString()
to return the current token's value and call thestartOffset()
andendOffset()
methods ofoffsetAtt
to get the offset. Note that the variablestermAtt
andoffsetAtt
are reused in every iteration. - Now we call
ts.end()
to end the TokenStream. This call signals the current TokenStream handler to execute any end-of-stream operations. - And lastly, we call the
close()
method to close out the TokenStream and Analyzer to release any resources used during the analysis process.
- 解構產品經理:互聯網產品策劃入門寶典
- 編程的修煉
- 計算機圖形學編程(使用OpenGL和C++)(第2版)
- CentOS 7 Linux Server Cookbook(Second Edition)
- 實戰Java程序設計
- Rust Essentials(Second Edition)
- 微信小程序入門指南
- Python算法從菜鳥到達人
- Learning Python Design Patterns
- Java面向對象程序設計
- Test-Driven Machine Learning
- 零基礎學C語言(升級版)
- Distributed Computing in Java 9
- C編程技巧:117個問題解決方案示例
- Mastering HTML5 Forms