- Elasticsearch Essentials
- Bharvi Dixit
- 725字
- 2021-07-16 09:33:17
Text search
Searching is broadly divided into two types: exact term search and full text search. An exact term search is something in which we look out for the exact terms; for example, any named entity such as the name of a person, location, or organization or date. These searches are easier to make since the search engine simply looks out for a yes or no and returns the documents.
However, full text search is different as well as challenging. Full text search refers to the search within text fields, where the text can be unstructured as well as structured. The text data can be in the form of any human language and based on the natural languages, which are very hard for a machine to understand and give relevant results. The following are some examples of full text searches:
- Find all the documents with search in the title or content fields, and return the results with matches in titles with the higher score
- Find all the tweets in which people are talking about terrorism and killing and return the results sorted by the tweet creation time
While doing these kinds of searches, we not only want relevant results but also expect that the search for a keyword matches all of its synonyms, root words, and spelling mistakes. For example, terrorism should match terorism and terror, while killing should match kills, kill, and killed.
To serve all these queries, the text-based fields go through an analysis phase before indexing, and based on this analysis, inverted indexes are built. At the time of querying, the same analysis process is applied to the terms that are sent within the queries to match those terms stored in the inverted indexes.
TF-IDF
TF-IDF stands for term frequencies-inverse document frequencies, and it is an important parameter used inside Lucene's standard similarity algorithm, Vector Space Model (VSM). The weight calculated by TF-IDF is the statistical measure to evaluate how important a word is to a document in a collection of documents.
Let's see how a TF-IDF weight is calculated to find our term's relevancy:
- TF (term): (The number of times a term appears in a document) / (The total number of terms in the document)
- IDF (term):
log_e
(The total number of documents / The number of documents with the t term in it)Note
While calculating IDF, the log is taken because terms such as the, that, and is may appear too many times, and we need to weigh down these frequently appearing terms while increasing the importance of rare terms.
The weight of TF-IDF is a product of TF(term)*IDF(term).
In information retrieval, one of the simplest relevancy ranking functions is implemented by summing the TF-IDF weight for each query term. Based on the combined weights for all the terms appearing in a single query, a score is calculated that is used to return the results in a sorted order.
Inverted indexes
Inverted index is the heart of search engines. The primary goal of a search engine is to provide speedy searches while finding the documents in which our search terms occur. Relevancy comes second.
Let's see with an example how inverted indexes are created and why they are so fast. In this example, we have two documents with each content field containing the following texts:
- I hate when spiders sit on the wall and act like they pay rent
- I hate when spider just sit there
While indexing, these texts are tokenized into separate terms and all the unique terms are stored inside the index with information such as in which document this term appears and what is the term position in that document.
The inverted index built with the preceding document texts looks like this:

When you search for the term spider OR
spiders, the query is executed against the inverted index and the terms are looked out for, and the documents where these terms appear are quickly identified. If you search for spider AND
spiders, you will not get any results because when we use AND
queries, both the terms used in the queries must be present in the document. However, spiders and spider are different for the search engine unless they are normalized into their root forms. For all these term normalizations, Elasticsearch has a document analysis phase that we will see in the upcoming sections.
- Flask Web全棧開發實戰
- Visual Basic程序開發(學習筆記)
- Mastering Scientific Computing with R
- Lua程序設計(第4版)
- Mastering Swift 2
- Java EE核心技術與應用
- PHP+Ajax+jQuery網站開發項目式教程
- 搞定J2EE:Struts+Spring+Hibernate整合詳解與典型案例
- Statistical Application Development with R and Python(Second Edition)
- 時空數據建模及其應用
- Mastering AWS Security
- Oracle數據庫編程經典300例
- Python趣味編程與精彩實例
- Julia數據科學應用
- 深入淺出Python數據分析