官术网_书友最值得收藏!

Defining custom TokenFilters

Sometimes, search behaviors may be so specific that we need to create a custom TokenFilter to achieve those behaviors. To create a custom filter, we will extend from the TokenFilter class and override the incrementToken() method.

We will create a simple word-expanding TokenFilter that expands courtesy titles from the short form to the full word. For example, Dr expands to doctor.

How to do it…

Here is the sample code:

public class CourtesyTitleFilter extends TokenFilter {
    Map<String,String> courtesyTitleMap = new HashMap<String,String>();
    private CharTermAttribute termAttr;
    public CourtesyTitleFilter(TokenStream input) {
        super(input);
        termAttr = addAttribute(CharTermAttribute.class);
        courtesyTitleMap.put("Dr", "doctor");
        courtesyTitleMap.put("Mr", "mister");
        courtesyTitleMap.put("Mrs", "miss");
    }
    public boolean incrementToken() throws IOException {
        if (!input.incrementToken())
            return false;
        String small = termAttr.toString();
        if(courtesyTitleMap.containsKey(small)) {
            termAttr.setEmpty().append(courtesyTitleMap.get(small));
        }
        return true;
    }
}

How it works…

We create the CourtesyTitleFilter class by extending TokenFilter. In its constructor, we initialize a CharTermAttribute instance for reading the token value and initialize courtesyTitleMap with the short form and word mapping for our conversion. In the overridden method, incrementToken(), we first check if the input (inputting TokenStream) still has a token. If no token is found, it exits with a false value. Then it checks if the token exists in courtesyTitleMap. If a mapping is found, it resets the token value with CharTermAttribute, setting the attribute empty by calling setEmpty() and appending it with the new value from courtesyTitleMap.

When you run this code as part of an analysis process that splits text by whitespaces and applies a lowercase filter at the end, the string Dr Watson would become [doctor] [watson] in output.

主站蜘蛛池模板: 左云县| 建阳市| 绥宁县| 长武县| 阳东县| 临漳县| 泽普县| 乐安县| 寻甸| 稻城县| 左云县| 乐安县| 澄城县| 三台县| 长白| 绥化市| 铅山县| 盐源县| 金昌市| 且末县| 金坛市| 永顺县| 南雄市| 荆门市| 浦北县| 远安县| 中方县| 醴陵市| 西宁市| 吉林省| 安溪县| 孝义市| 定兴县| 湾仔区| 延庆县| 崇礼县| 彰化市| 玛曲县| 广水市| 巴东县| 常熟市|