官术网_书友最值得收藏!

Obtaining TokenAttribute values

With a TokenStream, we can look at how token values are retrieved. From a high level, TokenStream is an enumeration of tokens. To access the values, we will provide TokenStream with one or more attribute objects. Note that there is only one instance that exists per attribute. This is for performance reasons so we are not creating objects in each iteration; instead, the same attribute instances are updated when we increment the token.

Getting ready

There are several types of attributes; each type provides a different aspect, or metadata, of a token. Here is a list of attributes we will review in this section.

This is the token attribute interface description:

  • CharTermAttribute: This exposes a token's actual textual value, equivalent to a term's value.
  • PositionIncrementAttribute: This returns the position of the current token relative to the previous token. This attribute is useful in phrase-matching as the keyword order and their positions are important. If there are no gaps between the current token and the previous token (for example, no stop words in between), it will be set to its default value, 1.
  • OffsetAttribute: This gives you information about the start and end positions of the corresponding term in the source text.
  • TypeAttribute: This is available if it is used in the implementation. This is usually used to identify the data type.
  • FlagsAttribute: This is somewhat similar to TypeAttribute, but it serves a different purpose. Suppose you need to add specific information about a token and that information should be available down the analyzer chain, you can pass it as flags. TokenFilters can perform any specific action based on the flags of the token.
  • PayloadAttribute: This stores the payload at each index position and is generally useful in scoring when used with Payload-based queries. Because it's stored at each position, it is best to have a minimum number of bytes per term in the index to minimize overloading the index with a massive amount of data.

How to do it…

Now we will see Attribute retrieval in action. In this sample, we will use StandardAnalyzer to process the input text and OffsetAttribute and CharTermAttribute to return each token's value and its offsets. Here is the sample code:

StringReader reader = new StringReader("Lucene is mainly used for information retrieval and you can read more about it at lucene.apache.org.");
StandardAnalyzer wa = new StandardAnalyzer();
TokenStream ts = null;

try {
    ts = wa.tokenStream("field", reader);

    OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
    CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);

    ts.reset();

    while (ts.incrementToken()) {
        String token = termAtt.toString();
        System.out.println("[" + token + "]");
        System.out.println("Token starting offset: " + offsetAtt.startOffset());
        System.out.println(" Token ending offset: " + offsetAtt.endOffset());
        System.out.println("");    
    }

    ts.end();
} catch (IOException e) {
    e.printStackTrace();
} finally {
    ts.close();
    wa.close();
}
Note

Keep in mind that Attribute objects are reused in each iteration as we increment tokens for performance and efficient memory management.

How it works…

In this sample, we are breaking down this text – Lucene is mainly used for information retrieval and you can read more about it at http://lucene.apache.org. using StandardAnalyzer. Note that we put a try catch block around TokenStream retrieval and its iteration. This is so we can handle IOException and use the finally block to cleanly close the TokenStream and Analyzer. The following is a step-by-step guide on what's happening in the sample code:

  1. To start processing text, we turn our input stream into StringReader to pass into the Analyzer's tokenStream method.
  2. Then we instantiate two attribute objects, OffsetAttribute and CharTermAttribute.
  3. The attribute objects are then registered in TokenStream by calling its addAttribute method.
  4. Note that we call ts.reset() to reset TokenStream to the beginning. This call is necessary prior to every iteration routine to ensure we always iterate from the beginning.
  5. We iterate TokenStream in a while loop by calling ts.incrementToken(). The loop exits when incrementToken() returns false.
  6. We call termAtt.toString() to return the current token's value and call the startOffset() and endOffset() methods of offsetAtt to get the offset. Note that the variables termAtt and offsetAtt are reused in every iteration.
  7. Now we call ts.end() to end the TokenStream. This call signals the current TokenStream handler to execute any end-of-stream operations.
  8. And lastly, we call the close() method to close out the TokenStream and Analyzer to release any resources used during the analysis process.
主站蜘蛛池模板: 特克斯县| 依安县| 邵武市| 巴中市| 临高县| 迭部县| 紫阳县| 富平县| 岗巴县| 永善县| 湟源县| 成都市| 炎陵县| 嫩江县| 墨江| 苏尼特左旗| 祁门县| 枞阳县| 芜湖市| 庄河市| 普兰县| 漠河县| 贵南县| 海阳市| 稷山县| 西林县| 霍山县| 南溪县| 苏尼特右旗| 墨竹工卡县| 宝山区| 常熟市| 衡阳市| 吴旗县| 石柱| 平果县| 武功县| 北川| 泊头市| 靖宇县| 阜阳市|