官术网_书友最值得收藏!

Defining custom TokenFilters

Sometimes, search behaviors may be so specific that we need to create a custom TokenFilter to achieve those behaviors. To create a custom filter, we will extend from the TokenFilter class and override the incrementToken() method.

We will create a simple word-expanding TokenFilter that expands courtesy titles from the short form to the full word. For example, Dr expands to doctor.

How to do it…

Here is the sample code:

public class CourtesyTitleFilter extends TokenFilter {
    Map<String,String> courtesyTitleMap = new HashMap<String,String>();
    private CharTermAttribute termAttr;
    public CourtesyTitleFilter(TokenStream input) {
        super(input);
        termAttr = addAttribute(CharTermAttribute.class);
        courtesyTitleMap.put("Dr", "doctor");
        courtesyTitleMap.put("Mr", "mister");
        courtesyTitleMap.put("Mrs", "miss");
    }
    public boolean incrementToken() throws IOException {
        if (!input.incrementToken())
            return false;
        String small = termAttr.toString();
        if(courtesyTitleMap.containsKey(small)) {
            termAttr.setEmpty().append(courtesyTitleMap.get(small));
        }
        return true;
    }
}

How it works…

We create the CourtesyTitleFilter class by extending TokenFilter. In its constructor, we initialize a CharTermAttribute instance for reading the token value and initialize courtesyTitleMap with the short form and word mapping for our conversion. In the overridden method, incrementToken(), we first check if the input (inputting TokenStream) still has a token. If no token is found, it exits with a false value. Then it checks if the token exists in courtesyTitleMap. If a mapping is found, it resets the token value with CharTermAttribute, setting the attribute empty by calling setEmpty() and appending it with the new value from courtesyTitleMap.

When you run this code as part of an analysis process that splits text by whitespaces and applies a lowercase filter at the end, the string Dr Watson would become [doctor] [watson] in output.

主站蜘蛛池模板: 读书| 扬中市| 宁南县| 综艺| 河北省| 莫力| 新和县| 陆川县| 日照市| 梨树县| 黑水县| 随州市| 天津市| 礼泉县| 怀远县| 共和县| 岑溪市| 古交市| 如东县| 永福县| 屏边| 武义县| 南召县| 察隅县| 家居| 专栏| 镶黄旗| 汉川市| 昌吉市| 通化县| 亳州市| 滨州市| 腾冲县| 吴川市| 元阳县| 丰都县| 麟游县| 湖州市| 阜康市| 蓝山县| 岳阳市|