官术网_书友最值得收藏!

Analysis

We mentioned earlier that all of Apache Lucene's data is stored in an inverted index. This transformation is required for successful response by Elasticsearch to search requests. The process of transforming this data is called analysis.

Elasticsearch has an index analysis module. It maps to the Lucene Analyzer. In general, analyzers are composed of a single Tokenizer and zero or more TokenFilters.

Note

Analysis modules and analyzers will be discussed in depth in Chapter 4, Analysis and Analyzers.

Elasticsearch provides a lot of character filters, tokenizers, and token filters. For example, a character filter may be used to strip out HTML markup and a token filter may be used to modify tokens (for example, lowercase). You can combine them to create custom analyzers or you can use its built-in analyzer.

Good understanding of the process of analysis is very important in terms of improving the user's search experience and relevant search results because Elasticsearch (actually Lucene) will use analyzer during indexing and query time.

Tip

It is crucial to remember that all Elasticsearch queries are not being analyzed.

Now let's examine the importance of the analyzer in terms of relevant search results with a simple scenario:

curl -XPOST localhost:9200/company/employee -d '{
  "firstname": "Joe Jeffers",
  "lastname": "Hoffman",
  "age": 30
}'
{"_index":"company","_type":"employee","_id":"AU7GIEQeR7spPlxvqlud","_version":1,"created":true}

We indexed an employee. His name is Joe Jeffers Hoffman, 30 years old. Let's search the employees that are named Joe in the company index now:

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match": {
      "firstname": "joe"
    }
  }
}'
{
   "took": 68,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.19178301,
      "hits": [
         {
            "_index": "company",
            "_type": "employee",
            "_id": "AU7GIEQeR7spPlxvqlud",
            "_score": 0.19178301,
            "_source": {
               "firstname": "Joe Jeffers",
               "lastname": "Hoffman",
               "age": 30
            }
         }
      ]
   }
}

All string type fields in the company index will be analyzed by a standard analyzer because employee types were created with dynamic mapping.

The standard analyzer is the default analyzer that Elasticsearch uses. It removes most punctuation and splits the text on word boundaries, as defined by the Unicode Consortium.

Note

If you want to have more information about the Unicode Consortium, please refer to http://www.unicode.org/reports/tr29/.

In this case, Joe Jeffers would be two tokens (Joe and Jeffers). To see how the standard analyzer works, run the following command:

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'Joe Jeffers'
{
  "tokens" : [ {
    "token" : "joe",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "jeffers",
    "start_offset" : 4,
    "end_offset" : 11,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

We searched the letters joe and the consequent document containing Joe Jeffers was returned to us because the standard analyzer had split the text on word boundaries and converted to lowercase. The standard analyzer is built using the Lower Case Token Filter along with other filters (the Standard Token Filter and Stop Token Filter, for example).

Now let's examine the following example:

curl -XDELETE localhost:9200/company
{"acknowledged":true}

curl -XPUT localhost:9200/company -d '{
  "mappings": {
    "employee": {
      "properties": {
        "firstname": {"type": "string", "index": "not_analyzed"}
      }
    }
  }
}'
{"acknowledged":true}

curl -XPOST localhost:9200/company/employee -d '{
  "firstname": "Joe Jeffers",
  "lastname": "Hoffman",
  "age": 30
}'
{"_index":"company","_type":"employee","_id":"AU7GOF2wR7spPlxvqmHY","_version":1,"created":true}

We deleted the company index created by dynamic mapping and recreated it with explicit mapping. This time, we used the not_analyzed value of the index option on the firstname field in the employee type. This means that the field is not analyzed at indexing time:

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match": {
      "firstname": "joe"
    }
  }
}'
{
   "took": 12,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 2,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   }
}

As you can see, Elasticsearch did not return a result to us with the match query because the firstname field is configured to the not_analyzed value. Therefore, Elasticsearch did not use an analyzer during indexing; the indexed value was exactly as specified. In other words, Joe Jeffers was a single token. Unless otherwise indicated, the match query uses the default search analyzer. Therefore, if you want a document to return to us with the match query without changing the analyzer in this example, we need to specify the exact value (paying attention to uppercase/lowercase):

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match" : {
        "firstname": "Joe Jeffers"
    }
  }
}'

The preceding command will return us the document we searched for. Now let's examine the following example:

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match_phrase_prefix": {
      "firstname": "Joe"
    }
  }
}'
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,****
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.30685282,
      "hits": [
         {
            "_index": "company",
            "_type": "employee",
            "_id": "AU7GOF2wR7spPlxvqmHY",
            "_score": 0.30685282,
            "_source": {
               "firstname": "Joe Jeffers",
               "lastname": "Hoffman",
               "age": 30
            }
         }
      ]
   }
}

As you can see, our searched document was returned to us although we did not specify the exact value (please note that we still use uppercase letters) because the match_phrase_prefix query analyzes the text and creates a phrase query out of the analyzed text. It allows for prefix matches on the last term in the text.

主站蜘蛛池模板: 景谷| 博爱县| 丹凤县| 田东县| 加查县| 巴林右旗| 剑河县| 夏河县| 甘谷县| 承德市| 靖江市| 灌南县| 南平市| 金坛市| 青海省| 平和县| 且末县| 宜良县| 当阳市| 阜南县| 武邑县| 石阡县| 泌阳县| 土默特右旗| 天台县| 陇川县| 伊春市| 泸定县| 政和县| 乐都县| 哈尔滨市| 大城县| 南京市| 民县| 酒泉市| 东台市| 永州市| 波密县| 东乡族自治县| 明溪县| 瑞昌市|