- Lucene 4 Cookbook
- Edwood Ng Vineeth Mohan
- 716字
- 2021-07-16 14:07:50
Obtaining TokenAttribute values
With a TokenStream, we can look at how token values are retrieved. From a high level, TokenStream is an enumeration of tokens. To access the values, we will provide TokenStream with one or more attribute objects. Note that there is only one instance that exists per attribute. This is for performance reasons so we are not creating objects in each iteration; instead, the same attribute instances are updated when we increment the token.
Getting ready
There are several types of attributes; each type provides a different aspect, or metadata, of a token. Here is a list of attributes we will review in this section.
This is the token attribute interface description:
CharTermAttribute
: This exposes a token's actual textual value, equivalent to a term's value.PositionIncrementAttribute
: This returns the position of the current token relative to the previous token. This attribute is useful in phrase-matching as the keyword order and their positions are important. If there are no gaps between the current token and the previous token (for example, no stop words in between), it will be set to its default value, 1.OffsetAttribute
: This gives you information about the start and end positions of the corresponding term in the source text.TypeAttribute
: This is available if it is used in the implementation. This is usually used to identify the data type.FlagsAttribute
: This is somewhat similar toTypeAttribute
, but it serves a different purpose. Suppose you need to add specific information about a token and that information should be available down the analyzer chain, you can pass it as flags. TokenFilters can perform any specific action based on the flags of the token.PayloadAttribute
: This stores the payload at each index position and is generally useful in scoring when used with Payload-based queries. Because it's stored at each position, it is best to have a minimum number of bytes per term in the index to minimize overloading the index with a massive amount of data.
How to do it…
Now we will see Attribute retrieval in action. In this sample, we will use StandardAnalyzer
to process the input text and OffsetAttribute
and CharTermAttribute
to return each token's value and its offsets. Here is the sample code:
StringReader reader = new StringReader("Lucene is mainly used for information retrieval and you can read more about it at lucene.apache.org."); StandardAnalyzer wa = new StandardAnalyzer(); TokenStream ts = null; try { ts = wa.tokenStream("field", reader); OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class); CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class); ts.reset(); while (ts.incrementToken()) { String token = termAtt.toString(); System.out.println("[" + token + "]"); System.out.println("Token starting offset: " + offsetAtt.startOffset()); System.out.println(" Token ending offset: " + offsetAtt.endOffset()); System.out.println(""); } ts.end(); } catch (IOException e) { e.printStackTrace(); } finally { ts.close(); wa.close(); }
Note
Keep in mind that Attribute objects are reused in each iteration as we increment tokens for performance and efficient memory management.
How it works…
In this sample, we are breaking down this text – Lucene is mainly used for information retrieval and you can read more about it at http://lucene.apache.org. using StandardAnalyzer
. Note that we put a try catch block around TokenStream
retrieval and its iteration. This is so we can handle IOException and use the finally
block to cleanly close the TokenStream and Analyzer. The following is a step-by-step guide on what's happening in the sample code:
- To start processing text, we turn our input stream into
StringReader
to pass into the Analyzer'stokenStream
method. - Then we instantiate two attribute objects,
OffsetAttribute
andCharTermAttribute
. - The attribute objects are then registered in TokenStream by calling its
addAttribute
method. - Note that we call
ts.reset()
to reset TokenStream to the beginning. This call is necessary prior to every iteration routine to ensure we always iterate from the beginning. - We iterate TokenStream in a
while
loop by callingts.incrementToken()
. The loop exits whenincrementToken()
returns false. - We call
termAtt.toString()
to return the current token's value and call thestartOffset()
andendOffset()
methods ofoffsetAtt
to get the offset. Note that the variablestermAtt
andoffsetAtt
are reused in every iteration. - Now we call
ts.end()
to end the TokenStream. This call signals the current TokenStream handler to execute any end-of-stream operations. - And lastly, we call the
close()
method to close out the TokenStream and Analyzer to release any resources used during the analysis process.
- Learning LibGDX Game Development(Second Edition)
- Access 2010數據庫基礎與應用項目式教程(第3版)
- Building a Recommendation Engine with Scala
- Scala程序員面試算法寶典
- Terraform:多云、混合云環(huán)境下實現基礎設施即代碼(第2版)
- ASP.NET程序開發(fā)范例寶典
- 小程序,巧應用:微信小程序開發(fā)實戰(zhàn)(第2版)
- Nagios Core Administration Cookbook(Second Edition)
- Java EE架構設計與開發(fā)實踐
- H5+移動營銷設計寶典
- AI輔助編程Python實戰(zhàn):基于GitHub Copilot和ChatGPT
- AngularJS by Example
- 接口自動化測試持續(xù)集成:Postman+Newman+Git+Jenkins+釘釘
- HTML5從入門到精通(第3版)
- Applied Supervised Learning with Python