- Lucene 4 Cookbook
- Edwood Ng Vineeth Mohan
- 262字
- 2021-07-16 14:07:51
Defining custom TokenFilters
Sometimes, search behaviors may be so specific that we need to create a custom TokenFilter to achieve those behaviors. To create a custom filter, we will extend from the TokenFilter class and override the incrementToken()
method.
We will create a simple word-expanding TokenFilter that expands courtesy titles from the short form to the full word. For example, Dr expands to doctor.
How to do it…
Here is the sample code:
public class CourtesyTitleFilter extends TokenFilter { Map<String,String> courtesyTitleMap = new HashMap<String,String>(); private CharTermAttribute termAttr; public CourtesyTitleFilter(TokenStream input) { super(input); termAttr = addAttribute(CharTermAttribute.class); courtesyTitleMap.put("Dr", "doctor"); courtesyTitleMap.put("Mr", "mister"); courtesyTitleMap.put("Mrs", "miss"); } public boolean incrementToken() throws IOException { if (!input.incrementToken()) return false; String small = termAttr.toString(); if(courtesyTitleMap.containsKey(small)) { termAttr.setEmpty().append(courtesyTitleMap.get(small)); } return true; } }
How it works…
We create the CourtesyTitleFilter
class by extending TokenFilter. In its constructor, we initialize a CharTermAttribute
instance for reading the token value and initialize courtesyTitleMap
with the short form and word mapping for our conversion. In the overridden method, incrementToken()
, we first check if the input (inputting TokenStream) still has a token. If no token is found, it exits with a false value. Then it checks if the token exists in courtesyTitleMap
. If a mapping is found, it resets the token value with CharTermAttribute
, setting the attribute empty
by calling setEmpty()
and appending it with the new value from courtesyTitleMap
.
When you run this code as part of an analysis process that splits text by whitespaces and applies a lowercase filter at the end, the string Dr Watson
would become [doctor] [watson]
in output.