官术网_书友最值得收藏!

  • Elasticsearch Indexing
  • Hüseyin Akdo?an
  • 857字
  • 2021-07-23 14:54:08

Analysis

We mentioned earlier that all of Apache Lucene's data is stored in an inverted index. This transformation is required for successful response by Elasticsearch to search requests. The process of transforming this data is called analysis.

Elasticsearch has an index analysis module. It maps to the Lucene Analyzer. In general, analyzers are composed of a single Tokenizer and zero or more TokenFilters.

Note

Analysis modules and analyzers will be discussed in depth in Chapter 4, Analysis and Analyzers.

Elasticsearch provides a lot of character filters, tokenizers, and token filters. For example, a character filter may be used to strip out HTML markup and a token filter may be used to modify tokens (for example, lowercase). You can combine them to create custom analyzers or you can use its built-in analyzer.

Good understanding of the process of analysis is very important in terms of improving the user's search experience and relevant search results because Elasticsearch (actually Lucene) will use analyzer during indexing and query time.

Tip

It is crucial to remember that all Elasticsearch queries are not being analyzed.

Now let's examine the importance of the analyzer in terms of relevant search results with a simple scenario:

curl -XPOST localhost:9200/company/employee -d '{
  "firstname": "Joe Jeffers",
  "lastname": "Hoffman",
  "age": 30
}'
{"_index":"company","_type":"employee","_id":"AU7GIEQeR7spPlxvqlud","_version":1,"created":true}

We indexed an employee. His name is Joe Jeffers Hoffman, 30 years old. Let's search the employees that are named Joe in the company index now:

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match": {
      "firstname": "joe"
    }
  }
}'
{
   "took": 68,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.19178301,
      "hits": [
         {
            "_index": "company",
            "_type": "employee",
            "_id": "AU7GIEQeR7spPlxvqlud",
            "_score": 0.19178301,
            "_source": {
               "firstname": "Joe Jeffers",
               "lastname": "Hoffman",
               "age": 30
            }
         }
      ]
   }
}

All string type fields in the company index will be analyzed by a standard analyzer because employee types were created with dynamic mapping.

The standard analyzer is the default analyzer that Elasticsearch uses. It removes most punctuation and splits the text on word boundaries, as defined by the Unicode Consortium.

Note

If you want to have more information about the Unicode Consortium, please refer to http://www.unicode.org/reports/tr29/.

In this case, Joe Jeffers would be two tokens (Joe and Jeffers). To see how the standard analyzer works, run the following command:

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'Joe Jeffers'
{
  "tokens" : [ {
    "token" : "joe",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "jeffers",
    "start_offset" : 4,
    "end_offset" : 11,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

We searched the letters joe and the consequent document containing Joe Jeffers was returned to us because the standard analyzer had split the text on word boundaries and converted to lowercase. The standard analyzer is built using the Lower Case Token Filter along with other filters (the Standard Token Filter and Stop Token Filter, for example).

Now let's examine the following example:

curl -XDELETE localhost:9200/company
{"acknowledged":true}

curl -XPUT localhost:9200/company -d '{
  "mappings": {
    "employee": {
      "properties": {
        "firstname": {"type": "string", "index": "not_analyzed"}
      }
    }
  }
}'
{"acknowledged":true}

curl -XPOST localhost:9200/company/employee -d '{
  "firstname": "Joe Jeffers",
  "lastname": "Hoffman",
  "age": 30
}'
{"_index":"company","_type":"employee","_id":"AU7GOF2wR7spPlxvqmHY","_version":1,"created":true}

We deleted the company index created by dynamic mapping and recreated it with explicit mapping. This time, we used the not_analyzed value of the index option on the firstname field in the employee type. This means that the field is not analyzed at indexing time:

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match": {
      "firstname": "joe"
    }
  }
}'
{
   "took": 12,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 2,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   }
}

As you can see, Elasticsearch did not return a result to us with the match query because the firstname field is configured to the not_analyzed value. Therefore, Elasticsearch did not use an analyzer during indexing; the indexed value was exactly as specified. In other words, Joe Jeffers was a single token. Unless otherwise indicated, the match query uses the default search analyzer. Therefore, if you want a document to return to us with the match query without changing the analyzer in this example, we need to specify the exact value (paying attention to uppercase/lowercase):

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match" : {
        "firstname": "Joe Jeffers"
    }
  }
}'

The preceding command will return us the document we searched for. Now let's examine the following example:

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match_phrase_prefix": {
      "firstname": "Joe"
    }
  }
}'
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,****
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.30685282,
      "hits": [
         {
            "_index": "company",
            "_type": "employee",
            "_id": "AU7GOF2wR7spPlxvqmHY",
            "_score": 0.30685282,
            "_source": {
               "firstname": "Joe Jeffers",
               "lastname": "Hoffman",
               "age": 30
            }
         }
      ]
   }
}

As you can see, our searched document was returned to us although we did not specify the exact value (please note that we still use uppercase letters) because the match_phrase_prefix query analyzes the text and creates a phrase query out of the analyzed text. It allows for prefix matches on the last term in the text.

主站蜘蛛池模板: 博湖县| 陇南市| 西林县| 荆州市| 镇江市| 孝昌县| 湘阴县| 荆州市| 三门县| 偃师市| 抚宁县| 东至县| 大名县| 微山县| 沙坪坝区| 海伦市| 玉门市| 梅河口市| 渝中区| 民丰县| 哈巴河县| 中卫市| 砚山县| 喀什市| 九寨沟县| 乌鲁木齐市| 德江县| 满洲里市| 南安市| 宁武县| 景宁| 开封县| 南木林县| 七台河市| 广元市| 千阳县| 天津市| 鄱阳县| 车险| 崇仁县| 习水县|