官术网_书友最值得收藏!

  • Lucene 4 Cookbook
  • Edwood Ng Vineeth Mohan
  • 262字
  • 2021-07-16 14:07:51

Defining custom TokenFilters

Sometimes, search behaviors may be so specific that we need to create a custom TokenFilter to achieve those behaviors. To create a custom filter, we will extend from the TokenFilter class and override the incrementToken() method.

We will create a simple word-expanding TokenFilter that expands courtesy titles from the short form to the full word. For example, Dr expands to doctor.

How to do it…

Here is the sample code:

public class CourtesyTitleFilter extends TokenFilter {
    Map<String,String> courtesyTitleMap = new HashMap<String,String>();
    private CharTermAttribute termAttr;
    public CourtesyTitleFilter(TokenStream input) {
        super(input);
        termAttr = addAttribute(CharTermAttribute.class);
        courtesyTitleMap.put("Dr", "doctor");
        courtesyTitleMap.put("Mr", "mister");
        courtesyTitleMap.put("Mrs", "miss");
    }
    public boolean incrementToken() throws IOException {
        if (!input.incrementToken())
            return false;
        String small = termAttr.toString();
        if(courtesyTitleMap.containsKey(small)) {
            termAttr.setEmpty().append(courtesyTitleMap.get(small));
        }
        return true;
    }
}

How it works…

We create the CourtesyTitleFilter class by extending TokenFilter. In its constructor, we initialize a CharTermAttribute instance for reading the token value and initialize courtesyTitleMap with the short form and word mapping for our conversion. In the overridden method, incrementToken(), we first check if the input (inputting TokenStream) still has a token. If no token is found, it exits with a false value. Then it checks if the token exists in courtesyTitleMap. If a mapping is found, it resets the token value with CharTermAttribute, setting the attribute empty by calling setEmpty() and appending it with the new value from courtesyTitleMap.

When you run this code as part of an analysis process that splits text by whitespaces and applies a lowercase filter at the end, the string Dr Watson would become [doctor] [watson] in output.

主站蜘蛛池模板: 汾西县| 饶河县| 攀枝花市| 株洲市| 浑源县| 治多县| 苏州市| 桐柏县| 蕉岭县| 乐至县| 台州市| 孟津县| 嘉黎县| 鄂伦春自治旗| 措美县| 清镇市| 罗田县| 闽清县| 都昌县| 万安县| 公安县| 临朐县| 梅州市| 金昌市| 玛多县| 泰安市| 阜平县| 肇东市| 凯里市| 米林县| 岢岚县| 体育| 德令哈市| 绥中县| 秦皇岛市| 平阴县| 云安县| 盐池县| 曲沃县| 三明市| 裕民县|