- Elasticsearch Essentials
- Bharvi Dixit
- 2594字
- 2021-07-16 09:33:17
Elasticsearch mapping
We have seen in the previous chapter how an index can have one or more types and each type has its own mapping.
Mappings are like database schemas that describe the fields or properties that the documents of that type may have. For example, the data type of each field, such as a string, integer, or date, and how these fields should be indexed and stored by Lucene.
One more thing to consider is that unlike a database, you cannot have a field with the same name with different types in the same index; otherwise, you will break doc_values
, and the sorting/searching is also broken. For example, create myIndex
and also index a document with a valid
field that contains an integer value inside the type1
document type:
curl –XPOST localhost:9200/myIndex/type1/1 –d '{"valid":5}'
Now, index another document inside type2
in the same index with the valid
field. This time the valid
field contains a string value:
curl –XPOST localhost/myIndex/type2/1 –d '{"valid":"40"}'
In this scenario, the sort and aggregations on the valid
field are broken because they are both indexed as valid
fields in the same index!
Document metadata fields
When a document is indexed into Elasticsearch, there are several metadata fields maintained by Elasticsearch for that document. The following are the most important metadata fields you need to know in order to control your index structure:
_id
:_id
is a unique identifier for the document and can be either auto-generated or can be set while indexing or can be configured in the mapping to be parsed automatically from a field._source
: This is a special field generated by Elasticsearch that contains the actual JSON data in it. Whenever we execute a search request, the_source
field is returned by default. By default, it is enabled, but it can be disabled using the following configuration while creating a mapping:PUT index_name/_mapping/doc_type {"_source":{"enabled":false}}
Note
Be careful while disabling the
_source
field, as there are lots of features you can't with it disabled. For example, highlighting is dependent on the_source
field. Documents can only be searched and not returned; documents can't be re-indexed and can't be updated._all
: When a document is indexed, values from all the fields are indexed separately as well as in a special field called_all
. This is done by Elasticsearch by default to make a search request on the content of the document without specifying the field name. It comes with an extra storage cost and should be disabled if searches need to be made against field names. For disabling it completely, use the following configuration in you mapping file:PUT index_name/_mapping/doc_type {"_all": { "enabled": true }}
However, there are some cases where you do not want to include all the fields to be included in
_all
where only certain fields. You can achieve it by setting theinclude_in_all
parameter tofalse
:PUT index_name/_mapping/doc_type { "_all": { "enabled": true }, "properties": { "first_name": { "type": "string", "include_in_all": false }, "last_name": { "type": "string" } } }
In the preceding example, only the last name will be included inside the
_all
field._ttl
: There are some cases when you want the documents to be automatically deleted from the index. For example, the logs._ttl
(time to live) field provides the options you can set when the documents should be deleted automatically. By default, it is disabled and can be enabled using the following configuration:PUT index_name/_mapping/doc_type { "_ttl": { "enabled": true, "default": "1w" } }
Inside the default field, you can use time units such as
m
(minutes),d
(days),w
(weeks),M
(months), andms
(milliseconds). The default is milliseconds.Note
Please note that the
__ttl
field has been deprecated since the Elasticsearch 2.0.0 beta 2 release and might be removed from the upcoming versions. Elasticsearch will provide a new replacement for this field in future versions.dynamic
: There are some scenarios in which you want to restrict the dynamic fields to be indexed. You only allow the fields that are defined by you in the mapping. This can be done by setting the dynamic property to be strict, in the following way:PUT index_name/_mapping/doc_type { "dynamic": "strict", "properties": { "first_name": { "type": "string" }, "last_name": { "type": "string" } } }
Data types and index analysis options
Lucene provides several options to configure each and every field separately depending on the use case. These options slightly differ based on the data types for a field.
Configuring data types
Data types in Elasticsearch are segregated in two forms:
- Core types: These include string, number, date, boolean, and binary
- Complex data types: These include arrays, objects, multi fields, geo points, geo shapes, nested, attachment, and IP
Note
Since Elasticsearch understands JSON, all the data types supported by JSON are also supported in Elasticsearch, along with some extra data types such as geopoint and attachment.
The following are the common attributes for the core data types:
index
: The values can be fromanalyzed
,no
, ornot_analyzed
. If set toanalyzed
, the text for that field is analyzed using a specified analyzer. If set tono
, the values for that field do not get indexed and thus, are not searchable. If set tonot_analyzed
, the values are indexed as it is; for example,Elasticsearch Essentials
will be indexed as a single term and thus, only exact matches can be done while querying.store
: This takes values as either yes or no (default isno
but_source
is an exception). Apart from indexing the values, Lucene does have an option to store the data, which comes in handy when you want to extract the data from the field. However, since Elasticsearch has an option to store all the data inside the_source
field, it is usually not required to store individual fields in Lucene.boost
: This defaults to1
. This specifies the importance of the field inside doc.null_value
: Using this attribute, you can set a default value to be indexed if a document contains a null value for that field. The default behavior is to omit the field that contains null.Note
One should be careful while configuring default values for null. The default value should always be of the type corresponding to the data type configured for that field, and it also should not be a real value that might appear in some other document.
Let's start with the configuration of the core as well as complex data types.
String
In addition to the common attributes, the following attributes can also be set for string-based fields:
term_vector
: This property defines whether the Lucene term vectors should be calculated for that field or not. The values can beno
(the default one),yes
,with_offsets
,with_positions
, andwith_positions_offsets
.Note
A term vector is the list of terms in the document and their number of occurrences in that document. Term vectors are mainly used for Highlighting and MorelikeThis (searching for similar documents) queries. A very nice blog on term vectors has been written by Adrien Grand, which can be read here: http://blog.jpountz.net/post/41301889664/putting-term-vectors-on-a-diet.
omit_norms
: This takes values astrue
orfalse
. The default value isfalse
. When this attribute is set totrue
, it disables the Lucene norms calculation for that field (and thus you can't use index-time boosting).analyzer
: A globally defined analyzer name for the index is used for indexing and searching. It defaults to the standard analyzer, but can be controlled also, which we will see in the upcoming section.index_analyzer
: The name of the analyzer used for indexing. This is not required if the analyzer attribute is set.search_analyzer
: The name of the analyzer used for searching. This is not required if the analyzer attribute is set.ignore_above
: This specifies the maximum size of the field. If the character count is above the specified limit, that field won't be indexed. This setting is mainly used for thenot_analyzed
fields. Lucene has a term byte-length limit of 32,766. This means a single term cannot contain more than 10,922 characters (one UTF-8 character contains at most 3 bytes).
An example mapping for two string fields, content
and author_name
, is as follows:
{ "contents": { "type": "string", "store": "yes", "index": "analyzed", "include_in_all": false, "analyzer": "simple" }, "author_name": { "type": "string", "index": "not_analyzed", "ignore_above": 50 } }
Number
The number data types are: byte
, short
, integer
, long
, floats
, and double
. The fields that contain numeric values need to be configured with the appropriate data type. Please go through the storage type requirements for all the types under a number before deciding which type you should actually use. In case the field does not contain bigger values, choosing long instead of integer is a waste of space.
An example of configuring numeric fields is shown here:
{"price":{"type":"float"},"age":{"type":"integer"}}
Date
Working with dates usually comes with some extra challenges because there are so many data formats available and you need to decide the correct format while creating a mapping. Date fields usually take two parameters: type and format. However, you can use other analysis options too.
Elasticsearch provides a list of formats to choose from depending on the date format of your data. You can visit the following URL to learn more about it: http://www.elasticsearch.org/guide/reference/mapping/date-format.html.
The following is an example of configuring date fields:
{ "creation_time": { "type": "date", "format": "YYYY-MM-dd" }, "updation_time": { "type": "date", "format": "yyyy/MM/dd HH:mm:ss||yyyy/MM/dd" }, "indexing_time": { "type": "date", "format": "date_optional_time" } }
Please note the different date formats used for different date fields in the preceding mapping. The updation_time
field contains a special format with an ||
operator, which specifies that it will work for both yyyy/MM/dd HH:mm:ss
and yyyy/MM/dd
date formats. Elasticsearch uses date_optional_time
as the default date parsing format, which is an ISO datetime parser.
Boolean
While indexing data, a Boolean type field can contain only two values: true
or false
, and it can be configured in a mapping in the following way:
{"is_verified":{"type":"boolean"}}
Arrays
By default, all the fields in Lucene and thus in Elasticsearch are multivalued, which means that they can store multiple values. In order to send such fields for indexing to Elasticsearch, we use the JSON array type, which is nested within opening and closing square brackets []
. Some considerations need to be taken care of while working with array data types:
- All the values of an array must be of the same data type.
- If the data type of a field is not explicitly defined in a mapping, then the data type of the first value inside the array is used as the type of that field.
- The order of the elements is not maintained inside the index, so do not get upset if you do not find the desired results while querying. However, this order is maintained inside the
_source
field, so when you return the data after querying, you get the same JSON as you had indexed.
Objects
JSON documents are hierarchical in nature, which allows them to define inner objects. Elasticsearch completely understands the nature of these inner objects and can map them easily by providing query support for their inner fields.
Note
Once a field is declared as an object type, you can't put any other type of data into it. If you try to do so, Elasticsearch will throw an exception.
{ "features": { "type": "object", "properties": { "name": { "type": "string" }, "sub_features": { "dynamic": false, "type": "object", "properties": { "name": { "type": "string" }, "description": { "type": "string" } } } } } }
If you look carefully in the previous mapping, there is a features
root object field and it contains two properties: name
and sub_features
. Further, sub_features
, which is an inner object, also contains two properties: name
and description
, but it has an extra setting: dynamic: false
. Setting this property to false
for an object changes the dynamic behavior of Elasticsearch, and you cannot index any other fields inside that object apart from the one that is declared inside the mapping. Therefore, you can index more fields in future inside the features object, but not inside the sub_features
object.
Indexing the same field in different ways
If you need to index the same field in different ways, the following is the way to create a mapping for it. You can define as many fields with the fields
parameter as you want:
{ "name": { "type": "string", "fields": { "raw": { "type": "string", "index": "not_analyzed" } } } }
With the preceding mapping, you just need to index data into the name
field, and Elasticsearch will index the data into the name
field using the standard analyzer that can be used for a full text search, and the data in the name.raw
field without analyzing the tokens; which can be used for an exact term matching. You do not have to send data into the name.raw
field explicitly.
Note
Please note that this option is only available for core data types and not for the objects.
Putting mappings in an index
There are two ways of putting mappings inside an index:
- Using a
post
request at the time of index creation:curl –XPOST 'localhost:9200/index_name' -d '{ "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "type1": { "_all": { "enabled": false }, "properties": { "field1": { "type": "string", "index": "not_analyzed" } } }, "type2": { "properties": { "field2": { "type": "string", "index": "analyzed", "analyzer":"keyword" } } } } }'
- Using a
PUT
request using the_mapping
API. The index must exist before creating a mapping in this way:curl –XPUT 'localhost:9200/index_name/index_type/_mapping' –d '{ "_all": { "enabled": false }, "properties": { "field1": { "type": "integer" } } }'
The mappings for the fields are enclosed inside the
properties
object, while all the metadata fields will appear outside the properties object.Note
It is highly recommended to use the same configuration for the same field names across different types and indexes in a cluster. For instance, the data types and analysis options must be the same; otherwise, you will face weird outputs.
Viewing mappings
Mappings can be viewed easily with the _mapping
API:
- To view the mapping of all the types in an index, use the following URL:
curl –XGET localhost:9200/index_name/_mapping?pretty
- To view the mapping of a single type, use the following URL:
curl –XGET localhost:9200/index_name/type_name/_mapping?pretty
Updating mappings
If you want to add mapping for some new fields in the mapping of an existing type, or create a mapping for a new type, you can do it later using the same _mapping
API.
For example, to add a new field in our existing type, we only need to specify the mapping for the new field in the following way:
curl –XPUT 'localhost:9200/index_name/index_type/_mapping' –d '{ "properties": { "new_field_name": { "type": "integer" } } }'
Please note that the mapping definition of an existing field cannot be changed.
Tip
Dealing with a long JSON data to be sent in request body
While creating indexes with settings, custom analyzers, and mappings, you must have noted that all the JSON configurations are passed using –d
, which stands for data. This is used to send a request body. While creating settings and mappings, it usually happens that the JSON data becomes so large that it gets difficult to use them in a command line using curl
. The easy solution is to create a file with the .json
extension and provide the path of the file while working with those settings or mappings. The following is an example command:
curl –XPUT 'localhost:9200/index_name/_settings' –d @path/setting.json curl –XPUT 'localhost:9200/index_name/index_type/_mapping' –d @path/mapping.json
- 基于差分進化的優化方法及應用
- MySQL數據庫管理與開發實踐教程 (清華電腦學堂)
- PhpStorm Cookbook
- 飛槳PaddlePaddle深度學習實戰
- INSTANT Passbook App Development for iOS How-to
- Learning Raspbian
- R Data Analysis Cookbook(Second Edition)
- C語言程序設計
- NGINX Cookbook
- 微信小程序開發與實戰(微課版)
- 智能搜索和推薦系統:原理、算法與應用
- Unity Character Animation with Mecanim
- Deep Learning with R Cookbook
- PHP+MySQL動態網站開發從入門到精通(視頻教學版)
- 深度學習入門:基于Python的理論與實現