- Elasticsearch Essentials
- Bharvi Dixit
- 725字
- 2021-07-16 09:33:17
Text search
Searching is broadly divided into two types: exact term search and full text search. An exact term search is something in which we look out for the exact terms; for example, any named entity such as the name of a person, location, or organization or date. These searches are easier to make since the search engine simply looks out for a yes or no and returns the documents.
However, full text search is different as well as challenging. Full text search refers to the search within text fields, where the text can be unstructured as well as structured. The text data can be in the form of any human language and based on the natural languages, which are very hard for a machine to understand and give relevant results. The following are some examples of full text searches:
- Find all the documents with search in the title or content fields, and return the results with matches in titles with the higher score
- Find all the tweets in which people are talking about terrorism and killing and return the results sorted by the tweet creation time
While doing these kinds of searches, we not only want relevant results but also expect that the search for a keyword matches all of its synonyms, root words, and spelling mistakes. For example, terrorism should match terorism and terror, while killing should match kills, kill, and killed.
To serve all these queries, the text-based fields go through an analysis phase before indexing, and based on this analysis, inverted indexes are built. At the time of querying, the same analysis process is applied to the terms that are sent within the queries to match those terms stored in the inverted indexes.
TF-IDF
TF-IDF stands for term frequencies-inverse document frequencies, and it is an important parameter used inside Lucene's standard similarity algorithm, Vector Space Model (VSM). The weight calculated by TF-IDF is the statistical measure to evaluate how important a word is to a document in a collection of documents.
Let's see how a TF-IDF weight is calculated to find our term's relevancy:
- TF (term): (The number of times a term appears in a document) / (The total number of terms in the document)
- IDF (term):
log_e
(The total number of documents / The number of documents with the t term in it)Note
While calculating IDF, the log is taken because terms such as the, that, and is may appear too many times, and we need to weigh down these frequently appearing terms while increasing the importance of rare terms.
The weight of TF-IDF is a product of TF(term)*IDF(term).
In information retrieval, one of the simplest relevancy ranking functions is implemented by summing the TF-IDF weight for each query term. Based on the combined weights for all the terms appearing in a single query, a score is calculated that is used to return the results in a sorted order.
Inverted indexes
Inverted index is the heart of search engines. The primary goal of a search engine is to provide speedy searches while finding the documents in which our search terms occur. Relevancy comes second.
Let's see with an example how inverted indexes are created and why they are so fast. In this example, we have two documents with each content field containing the following texts:
- I hate when spiders sit on the wall and act like they pay rent
- I hate when spider just sit there
While indexing, these texts are tokenized into separate terms and all the unique terms are stored inside the index with information such as in which document this term appears and what is the term position in that document.
The inverted index built with the preceding document texts looks like this:

When you search for the term spider OR
spiders, the query is executed against the inverted index and the terms are looked out for, and the documents where these terms appear are quickly identified. If you search for spider AND
spiders, you will not get any results because when we use AND
queries, both the terms used in the queries must be present in the document. However, spiders and spider are different for the search engine unless they are normalized into their root forms. For all these term normalizations, Elasticsearch has a document analysis phase that we will see in the upcoming sections.
- 現代C++編程:從入門到實踐
- Visual C++程序設計學習筆記
- R語言經典實例(原書第2版)
- CentOS 7 Linux Server Cookbook(Second Edition)
- SQL for Data Analytics
- Blender 3D Incredible Machines
- 從學徒到高手:汽車電路識圖、故障檢測與維修技能全圖解
- Windows Forensics Cookbook
- C程序設計案例教程
- 可解釋機器學習:模型、方法與實踐
- 數據結構案例教程(C/C++版)
- 從Java到Web程序設計教程
- 后臺開發:核心技術與應用實踐
- Flink核心技術:源碼剖析與特性開發
- Getting Started with Windows Server Security