官术网_书友最值得收藏!

  • Elasticsearch Blueprints
  • Vineeth Mohan
  • 1232字
  • 2021-07-16 13:39:32

Data modeling in Elasticsearch

Data is modeled in documents in Elasticsearch. This means a single item irrespective of whether it is an entity like a company, country, or product has to be modeled as a single document. A document can contain any number of fields and values associated with it. This information is modeled around the JSON format that helps us to express the behavior of our data using arrays, objects, or simple values.

Note

Elasticsearch is schemaless, which means that you can index a document to Elasticsearch before you create its index or mapping. Elasticsearch guesses the best type for each field value.

Inherently, JSON supports formats such as string, numbers, object, array, and even Boolean. Elasticsearch adds various other formats on top of it, such as the date, IP and so on. You will find the list of such supported types in the following table. The use of date types to store date values will help us do effective date-specific operations, such as date range and aggregation on top of it. This can't be done if we use the default string type. The same goes for other custom types, such as IP, geo_point, and so on.

It is necessary that you let Elasticsearch know what type of data that particular field will hold. We saw how to pass the type configuration to an Elasticsearch instance. Besides that, you can use various other configurations to fine-tune your overall search performance. We may see a few configurations in due course. However, learning all these configuration parameters is worthwhile and will be useful when you try to fine-tune your search performance.

Imagine yourself in a scenario where you are in need and want to build a shopping application. The first step to build such an application is to get your product information indexed. Here, it would be best to model a document around a single product. Hence, a single document represents all the data associated with a product, such as its name, description, date of manufacture, and so on.

First, let's create the index:

curl -X PUT "http://localhost:9200/products" -d '{
    "index": {
        "number_of_shards": 1,
        "number_of_replicas": 1
    }
}'

Here, we assume that the Elasticsearch instance runs on the local machine or rather, the localhost. We create an index called products with one shard and one replica. This means that our data won't be partitioned across shards; instead, a single shard will handle it. This means that in future, it's not possible to scale out across new machines added to the cluster. A replica of one makes sure that a copy of the shard is maintained elsewhere too.

Note

More shards when distributed in various hardware will increase the index/write throughout. More replicas increase the search/read throughout.

Now, let's make the mapping.

Here, products is the index and product is the type:

curl -X PUT "http://localhost:9200/products/product/_mapping" -d ' {
    "product":{
        "properties":{
            "name":{
                "type":"string"
            },
            "description":{
                "type":"string"
            },
            "dateOfManufactoring":{
                "type":"date",
                "format":"YYYY-MM-dd"
            },
            "price":{
                "type":"long"
            },
            "productType":{
                "type":"string",
                "include_in_all":"false",
                "index":"not_analyzed"
// By setting the attribute index as not_analyzed ,// we are asking Elasticsearch not to analyze the string. //This is required to do aggregation effectively. 

            },
            "totalBuy":{
                "type":"long",
                "include_in_all":"false"
            },
            "imageURL":{
                "type":"string",
                "include_in_all":"false",
                "index":"no"
// As we won't normally search URL's , we are setting 
//the index to no. This means that this field is 
//not searchable but retrievable.
            }
        }
    }
}'

Here, we modeled the information on a single product as a document and created various fields to hold that information. Let's see what these fields are and how we treat them:

  • name: This field stores the name of our product. We should be able to search by this name even if we provide a single word for it. So, if the name is Lenovo laptops, even if the user gives only the word Lenovo, this document should match. Hence, it has to go through a process called analysis, where tokens qualified to represent this string are selected. We will talk about this in detail later. However, we need to understand that this process happens by default, until you configure otherwise.
  • description: This field holds the description of the product and should be treated the same as the name field.
  • dateOfManufactoring: This is the date on which this product was manufactured. Here, if we don't declare this field as a date, it would be assumed to be a string. The problem with this approach is that when we try to do range selection on this field, rather than looking into the date value, it looks at its lexicographical value (that is computed based on an alphabetical or dictionary order), which will give us a wrong result. This means that a date search between two date ranges won't give accurate results in the case of a string type. Hence, we need to declare this field as a date and it stores this field in the Unix epoch format. But wait! There are numerous formats of date. How will Elasticsearch understand the right format and parse out the right date value? For that, you need to provide the format as a format attribute. Using this format, the date string is parsed and the epoch value is computed. Furthermore, all queries and aggregations are solved and take place through this parsed date value and hence, we get the actual results.
  • price: This field has the price value as a number.
  • productType: This field stores the product type such as Laptop, Tab, and so on, as a string. However, this string is not broken so that aggregation results make sense. It has to be noted here that when we make this field not_analyzed, it's not searchable on a token level. What this means is that if the product type is Large Laptop, the search query of the word Laptop won't give you a match, but rather, the exact word Large Laptop alone will give you a match. However, through this approach, aggregation works neatly.
  • totalBuy: This is a field maintained by us to track the number of items bought for this field.
  • imageURL: We store the image of this product in external image database and provide the URL to access it. As we are not going to conduct a search or aggregate this field, it's safe to disable the index for this field. This means that this field won't be searchable, but will be retrievable.

We have already learned how to index data in Elasticsearch. Assume that you indexed the information you wished to. If you would like to see the overview of your index, you can install Elasticsearch-head as a simple frontend to your index. This is how a sample search result looks like after indexing the information:

{

    "took": 87,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "hits": {
        "total": 4,
        "max_score": 1,
        "hits": [
            {
                "_index": "products",
                "_type": "product",
                "_id": "CD5BR19RQ3mD3MdNhtCq9Q",
                "_score": 1,
                "_source": {
                    "name": "Lenovo A1000L Tablet",
                    "description": "Lenovo Ideatab A1000 Tablet (4GB, WiFi, Voice Calling), Black",
                    "dateOfManufactoring": "2014-01-01",
                    "prize": 6699,
                    "totalBuy": 320,
                    "productType": "Tablet",
                    "imageURL": "www.imageDB.com/urlTolenovoTablet.jpg"
                }
            }
        ]
    }

}

The greatest advantage of using Elasticsearch is the level at which you can control your data. The flexible schema lets you define your own ways to deal with your information. So, the user can have absolute freedom to define the fields and the types that the user's virtual document would hold (in our case, a particular product).

主站蜘蛛池模板: 屯昌县| 齐齐哈尔市| 东明县| 丰顺县| 岢岚县| 光山县| 清水县| 清苑县| 廉江市| 武宣县| 东宁县| 万安县| 宣恩县| 日照市| 胶南市| 施秉县| 奇台县| 深水埗区| 石林| 陇南市| 保山市| 安塞县| 皋兰县| 正安县| 合水县| 金昌市| 梧州市| 嘉善县| 凌源市| 哈巴河县| 清水县| 河池市| 依兰县| 托里县| 广德县| 莒南县| 策勒县| 清丰县| 兴隆县| 永川市| 瓦房店市|