官术网_书友最值得收藏!

Text search

Searching is broadly divided into two types: exact term search and full text search. An exact term search is something in which we look out for the exact terms; for example, any named entity such as the name of a person, location, or organization or date. These searches are easier to make since the search engine simply looks out for a yes or no and returns the documents.

However, full text search is different as well as challenging. Full text search refers to the search within text fields, where the text can be unstructured as well as structured. The text data can be in the form of any human language and based on the natural languages, which are very hard for a machine to understand and give relevant results. The following are some examples of full text searches:

  • Find all the documents with search in the title or content fields, and return the results with matches in titles with the higher score
  • Find all the tweets in which people are talking about terrorism and killing and return the results sorted by the tweet creation time

While doing these kinds of searches, we not only want relevant results but also expect that the search for a keyword matches all of its synonyms, root words, and spelling mistakes. For example, terrorism should match terorism and terror, while killing should match kills, kill, and killed.

To serve all these queries, the text-based fields go through an analysis phase before indexing, and based on this analysis, inverted indexes are built. At the time of querying, the same analysis process is applied to the terms that are sent within the queries to match those terms stored in the inverted indexes.

TF-IDF

TF-IDF stands for term frequencies-inverse document frequencies, and it is an important parameter used inside Lucene's standard similarity algorithm, Vector Space Model (VSM). The weight calculated by TF-IDF is the statistical measure to evaluate how important a word is to a document in a collection of documents.

Let's see how a TF-IDF weight is calculated to find our term's relevancy:

  • TF (term): (The number of times a term appears in a document) / (The total number of terms in the document)
  • IDF (term): log_e (The total number of documents / The number of documents with the t term in it)

    Note

    While calculating IDF, the log is taken because terms such as the, that, and is may appear too many times, and we need to weigh down these frequently appearing terms while increasing the importance of rare terms.

The weight of TF-IDF is a product of TF(term)*IDF(term).

In information retrieval, one of the simplest relevancy ranking functions is implemented by summing the TF-IDF weight for each query term. Based on the combined weights for all the terms appearing in a single query, a score is calculated that is used to return the results in a sorted order.

Inverted indexes

Inverted index is the heart of search engines. The primary goal of a search engine is to provide speedy searches while finding the documents in which our search terms occur. Relevancy comes second.

Let's see with an example how inverted indexes are created and why they are so fast. In this example, we have two documents with each content field containing the following texts:

  • I hate when spiders sit on the wall and act like they pay rent
  • I hate when spider just sit there

While indexing, these texts are tokenized into separate terms and all the unique terms are stored inside the index with information such as in which document this term appears and what is the term position in that document.

The inverted index built with the preceding document texts looks like this:

When you search for the term spider OR spiders, the query is executed against the inverted index and the terms are looked out for, and the documents where these terms appear are quickly identified. If you search for spider AND spiders, you will not get any results because when we use AND queries, both the terms used in the queries must be present in the document. However, spiders and spider are different for the search engine unless they are normalized into their root forms. For all these term normalizations, Elasticsearch has a document analysis phase that we will see in the upcoming sections.

主站蜘蛛池模板: 兴安县| 教育| 襄汾县| 通渭县| 镇巴县| 台湾省| 托里县| 滦南县| 云林县| 游戏| 蒙自县| 白玉县| 若尔盖县| 双柏县| 株洲市| 沈阳市| 白河县| 西峡县| 凌海市| 修水县| 石林| 英山县| 新宾| 芦溪县| 罗甸县| 囊谦县| 恩施市| 汉阴县| 松原市| 汨罗市| 佛山市| 灵台县| 安多县| 游戏| 涿鹿县| 南康市| 宁都县| 林西县| 苏尼特右旗| 鄂州市| 罗定市|