- Lucene 4 Cookbook
- Edwood Ng Vineeth Mohan
- 300字
- 2021-07-16 14:07:51
Defining custom tokenizers
Although there are several excellent built-in tokenizers in Lucene, you may still find yourself needing something to behave slightly differently. You will then have to custom-build a Tokenizer. Lucene provides a character-based tokenizer called CharTokenizer
that should be suitable for most types of tokenizations. You can override its isTokenChar
method to determine what characters should be considered as part of a token and what characters should be considered as delimiters. It's worthwhile to note that both LetterTokenizer
and WhitespaceTokenizer
extend from CharTokenizer
.
How to do it…
In this example, we will create our own tokenizer that splits text by space only. It is similar to WhitespaceTokenizer
but this one is simpler. Here is the sample code:
public class MyTokenizer extends CharTokenizer { public MyTokenizer(Reader input) { super(input); } public MyTokenizer(AttributeFactory factory, Reader input) { super(factory, input); } @Override protected boolean isTokenChar(int c) { return !Character.isSpaceChar(c); } }
How it works…
In this example, we extend from an abstract class called CharTokenizer
. As described earlier, this is a character-based tokenizer. To use CharTokenizer
, you need to override the isTokenChar
method. In this method, you get to examine the input stream (via Reader) character by character and determine whether to treat the character as a token character or a delimiting character. It handles the complexity of token extraction from a Reader for you so you can focus on the business logic of how text should be tokenized. We want to build a tokenizer that splits text by space only, so we leverage the isSpaceChar
method from the character
class to determine if the character is a space. If it's a space, it returns false, which means it's a token character. Otherwise, the character will be treated as a delimiting character and a new token will form afterwards.
- Mastering Python Scripting for System Administrators
- OpenStack Cloud Computing Cookbook(Fourth Edition)
- Magento 1.8 Development Cookbook
- Oracle JDeveloper 11gR2 Cookbook
- Spring+Spring MVC+MyBatis整合開發實戰
- Learning Concurrent Programming in Scala
- 深入分布式緩存:從原理到實踐
- Programming with CodeIgniterMVC
- HTML+CSS+JavaScript網頁設計從入門到精通 (清華社"視頻大講堂"大系·網絡開發視頻大講堂)
- 深入淺出Go語言編程
- INSTANT JQuery Flot Visual Data Analysis
- Instant GLEW
- 虛擬現實建模與編程(SketchUp+OSG開發技術)
- Node.js 6.x Blueprints
- Learning D3.js 5 Mapping(Second Edition)