Sometimes, search behaviors may be so specific that we need to create a custom TokenFilter to achieve those behaviors. To create a custom filter, we will extend from the TokenFilter class and override the incrementToken() method.
We will create a simple word-expanding TokenFilter that expands courtesy titles from the short form to the full word. For example, Dr expands to doctor.
How to do it…
Here is the sample code:
public class CourtesyTitleFilter extends TokenFilter {
Map<String,String> courtesyTitleMap = new HashMap<String,String>();
private CharTermAttribute termAttr;
public CourtesyTitleFilter(TokenStream input) {
super(input);
termAttr = addAttribute(CharTermAttribute.class);
courtesyTitleMap.put("Dr", "doctor");
courtesyTitleMap.put("Mr", "mister");
courtesyTitleMap.put("Mrs", "miss");
}
public boolean incrementToken() throws IOException {
if (!input.incrementToken())
return false;
String small = termAttr.toString();
if(courtesyTitleMap.containsKey(small)) {
termAttr.setEmpty().append(courtesyTitleMap.get(small));
}
return true;
}
}
How it works…
We create the CourtesyTitleFilter class by extending TokenFilter. In its constructor, we initialize a CharTermAttribute instance for reading the token value and initialize courtesyTitleMap with the short form and word mapping for our conversion. In the overridden method, incrementToken(), we first check if the input (inputting TokenStream) still has a token. If no token is found, it exits with a false value. Then it checks if the token exists in courtesyTitleMap. If a mapping is found, it resets the token value with CharTermAttribute, setting the attribute empty by calling setEmpty() and appending it with the new value from courtesyTitleMap.
When you run this code as part of an analysis process that splits text by whitespaces and applies a lowercase filter at the end, the string Dr Watson would become [doctor] [watson] in output.