官术网_书友最值得收藏!

Text search

Searching is broadly divided into two types: exact term search and full text search. An exact term search is something in which we look out for the exact terms; for example, any named entity such as the name of a person, location, or organization or date. These searches are easier to make since the search engine simply looks out for a yes or no and returns the documents.

However, full text search is different as well as challenging. Full text search refers to the search within text fields, where the text can be unstructured as well as structured. The text data can be in the form of any human language and based on the natural languages, which are very hard for a machine to understand and give relevant results. The following are some examples of full text searches:

  • Find all the documents with search in the title or content fields, and return the results with matches in titles with the higher score
  • Find all the tweets in which people are talking about terrorism and killing and return the results sorted by the tweet creation time

While doing these kinds of searches, we not only want relevant results but also expect that the search for a keyword matches all of its synonyms, root words, and spelling mistakes. For example, terrorism should match terorism and terror, while killing should match kills, kill, and killed.

To serve all these queries, the text-based fields go through an analysis phase before indexing, and based on this analysis, inverted indexes are built. At the time of querying, the same analysis process is applied to the terms that are sent within the queries to match those terms stored in the inverted indexes.

TF-IDF

TF-IDF stands for term frequencies-inverse document frequencies, and it is an important parameter used inside Lucene's standard similarity algorithm, Vector Space Model (VSM). The weight calculated by TF-IDF is the statistical measure to evaluate how important a word is to a document in a collection of documents.

Let's see how a TF-IDF weight is calculated to find our term's relevancy:

  • TF (term): (The number of times a term appears in a document) / (The total number of terms in the document)
  • IDF (term): log_e (The total number of documents / The number of documents with the t term in it)

    Note

    While calculating IDF, the log is taken because terms such as the, that, and is may appear too many times, and we need to weigh down these frequently appearing terms while increasing the importance of rare terms.

The weight of TF-IDF is a product of TF(term)*IDF(term).

In information retrieval, one of the simplest relevancy ranking functions is implemented by summing the TF-IDF weight for each query term. Based on the combined weights for all the terms appearing in a single query, a score is calculated that is used to return the results in a sorted order.

Inverted indexes

Inverted index is the heart of search engines. The primary goal of a search engine is to provide speedy searches while finding the documents in which our search terms occur. Relevancy comes second.

Let's see with an example how inverted indexes are created and why they are so fast. In this example, we have two documents with each content field containing the following texts:

  • I hate when spiders sit on the wall and act like they pay rent
  • I hate when spider just sit there

While indexing, these texts are tokenized into separate terms and all the unique terms are stored inside the index with information such as in which document this term appears and what is the term position in that document.

The inverted index built with the preceding document texts looks like this:

When you search for the term spider OR spiders, the query is executed against the inverted index and the terms are looked out for, and the documents where these terms appear are quickly identified. If you search for spider AND spiders, you will not get any results because when we use AND queries, both the terms used in the queries must be present in the document. However, spiders and spider are different for the search engine unless they are normalized into their root forms. For all these term normalizations, Elasticsearch has a document analysis phase that we will see in the upcoming sections.

主站蜘蛛池模板: 深水埗区| 泸溪县| 高密市| 龙陵县| 库尔勒市| 阿尔山市| 芜湖市| 库车县| 襄樊市| 宁明县| 独山县| 恭城| 花莲县| 宁明县| 南阳市| 博客| 临海市| 锦州市| 保山市| 乐安县| 黑龙江省| 二连浩特市| 武定县| 视频| 灵璧县| 卫辉市| 长宁县| 诏安县| 米林县| 武川县| 莱州市| 资溪县| 五华县| 阿图什市| 保靖县| 洛浦县| 南充市| 威宁| 黄浦区| 九寨沟县| 郑州市|