- Elasticsearch Blueprints
- Vineeth Mohan
- 1232字
- 2021-07-16 13:39:32
Data modeling in Elasticsearch
Data is modeled in documents in Elasticsearch. This means a single item irrespective of whether it is an entity like a company, country, or product has to be modeled as a single document. A document can contain any number of fields and values associated with it. This information is modeled around the JSON format that helps us to express the behavior of our data using arrays, objects, or simple values.
Note
Elasticsearch is schemaless, which means that you can index a document to Elasticsearch before you create its index or mapping. Elasticsearch guesses the best type for each field value.
Inherently, JSON supports formats such as string, numbers, object, array, and even Boolean. Elasticsearch adds various other formats on top of it, such as the date, IP and so on. You will find the list of such supported types in the following table. The use of date types to store date values will help us do effective date-specific operations, such as date range and aggregation on top of it. This can't be done if we use the default string type. The same goes for other custom types, such as IP, geo_point, and so on.
It is necessary that you let Elasticsearch know what type of data that particular field will hold. We saw how to pass the type configuration to an Elasticsearch instance. Besides that, you can use various other configurations to fine-tune your overall search performance. We may see a few configurations in due course. However, learning all these configuration parameters is worthwhile and will be useful when you try to fine-tune your search performance.
Imagine yourself in a scenario where you are in need and want to build a shopping application. The first step to build such an application is to get your product information indexed. Here, it would be best to model a document around a single product. Hence, a single document represents all the data associated with a product, such as its name, description, date of manufacture, and so on.
First, let's create the index:
curl -X PUT "http://localhost:9200/products" -d '{ "index": { "number_of_shards": 1, "number_of_replicas": 1 } }'
Here, we assume that the Elasticsearch instance runs on the local machine or rather, the localhost. We create an index called products
with one shard and one replica. This means that our data won't be partitioned across shards; instead, a single shard will handle it. This means that in future, it's not possible to scale out across new machines added to the cluster. A replica of one makes sure that a copy of the shard is maintained elsewhere too.
Note
More shards when distributed in various hardware will increase the index/write throughout. More replicas increase the search/read throughout.
Now, let's make the mapping.
Here, products
is the index and product
is the type:
curl -X PUT "http://localhost:9200/products/product/_mapping" -d ' { "product":{ "properties":{ "name":{ "type":"string" }, "description":{ "type":"string" }, "dateOfManufactoring":{ "type":"date", "format":"YYYY-MM-dd" }, "price":{ "type":"long" }, "productType":{ "type":"string", "include_in_all":"false", "index":"not_analyzed" // By setting the attribute index as not_analyzed ,// we are asking Elasticsearch not to analyze the string. //This is required to do aggregation effectively. }, "totalBuy":{ "type":"long", "include_in_all":"false" }, "imageURL":{ "type":"string", "include_in_all":"false", "index":"no" // As we won't normally search URL's , we are setting //the index to no. This means that this field is //not searchable but retrievable. } } } }'
Here, we modeled the information on a single product as a document and created various fields to hold that information. Let's see what these fields are and how we treat them:
name
: This field stores the name of our product. We should be able to search by this name even if we provide a single word for it. So, if the name is Lenovo laptops, even if the user gives only the wordLenovo
, this document should match. Hence, it has to go through a process called analysis, where tokens qualified to represent this string are selected. We will talk about this in detail later. However, we need to understand that this process happens by default, until you configure otherwise.description
: This field holds the description of the product and should be treated the same as the name field.dateOfManufactoring
: This is the date on which this product was manufactured. Here, if we don't declare this field as a date, it would be assumed to be a string. The problem with this approach is that when we try to do range selection on this field, rather than looking into the date value, it looks at its lexicographical value (that is computed based on an alphabetical or dictionary order), which will give us a wrong result. This means that a date search between two date ranges won't give accurate results in the case of a string type. Hence, we need to declare this field as a date and it stores this field in the Unix epoch format. But wait! There are numerous formats of date. How will Elasticsearch understand the right format and parse out the right date value? For that, you need to provide the format as a format attribute. Using this format, the date string is parsed and the epoch value is computed. Furthermore, all queries and aggregations are solved and take place through this parsed date value and hence, we get the actual results.price
: This field has the price value as a number.productType
: This field stores the product type such asLaptop
,Tab
, and so on, as a string. However, this string is not broken so that aggregation results make sense. It has to be noted here that when we make this fieldnot_analyzed
, it's not searchable on a token level. What this means is that if the product type isLarge Laptop
, the search query of the wordLaptop
won't give you a match, but rather, the exact wordLarge Laptop
alone will give you a match. However, through this approach, aggregation works neatly.totalBuy
: This is a field maintained by us to track the number of items bought for this field.imageURL
: We store the image of this product in external image database and provide the URL to access it. As we are not going to conduct a search or aggregate this field, it's safe to disable the index for this field. This means that this field won't be searchable, but will be retrievable.
We have already learned how to index data in Elasticsearch. Assume that you indexed the information you wished to. If you would like to see the overview of your index, you can install Elasticsearch-head as a simple frontend to your index. This is how a sample search result looks like after indexing the information:
{ "took": 87, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 4, "max_score": 1, "hits": [ { "_index": "products", "_type": "product", "_id": "CD5BR19RQ3mD3MdNhtCq9Q", "_score": 1, "_source": { "name": "Lenovo A1000L Tablet", "description": "Lenovo Ideatab A1000 Tablet (4GB, WiFi, Voice Calling), Black", "dateOfManufactoring": "2014-01-01", "prize": 6699, "totalBuy": 320, "productType": "Tablet", "imageURL": "www.imageDB.com/urlTolenovoTablet.jpg" } } ] } }
The greatest advantage of using Elasticsearch is the level at which you can control your data. The flexible schema lets you define your own ways to deal with your information. So, the user can have absolute freedom to define the fields and the types that the user's virtual document would hold (in our case, a particular product).