官术网_书友最值得收藏!

Building an in-memory search index using a hash map

Dictionaries can be used to quickly search for a word in a list of documents, similar to a search engine. In this subsection, we will learn how to build an inverted index based on a dictionary of lists. Let's say we have a collection of four documents:

    docs = ["the cat is under the table",
"the dog is under the table",
"cats and dogs smell roses",
"Carla eats an apple"]

A simple way to retrieve all the documents that match a query is to scan each document and test for the presence of a word. For example, if we want to look up the documents where the word table appears, we can employ the following filtering operation:

    matches = [doc for doc in docs if "table" in doc]

This approach is simple and works well when we have one-off queries; however, if we need to query the collection very often, it can be beneficial to optimize querying time. Since the per-query cost of the linear scan is O(N), you can imagine that a better scaling will allow us to handle much larger document collections.

A better strategy is to spend some time preprocessing the documents so that they are easier to find at query time. We can build a structure, called the inverted index, that associates each word in our collection with the list of documents where that word is present. In our earlier example, the word "table" will be associated to the "the cat is under the table" and "the dog is under the table" documents; they correspond to indices 0 and 1.

Such a mapping can be implemented by going over our collection of documents and storing in a dictionary the index of the documents where that term appears. The implementation is similar to the counter_dict function, except that, instead of accumulating a counter, we are growing the list of documents that match the current term:

    # Building an index
index = {}
for i, doc in enumerate(docs):
# We iterate over each term in the document
for word in doc.split():
# We build a list containing the indices
# where the term appears
if word not in index:
index[word] = [i]
else:
index[word].append(i)

Once we have built our index, doing a query involves a simple dictionary lookup. For example, if we want to return all the documents containing the term table, we can simply query the index, and retrieve the corresponding documents:

    results = index["table"]
result_documents = [docs[i] for i in results]

Since all it takes to query our collection is a dictionary access, the index can handle queries with time complexity O(1)! Thanks to the inverted index, we are now able to query any number of documents (as long as they fit in memory) in constant time. Needless to say, indexing is a technique widely used to quickly retrieve data not only in search engines, but also in databases and any system that requires fast searches.

Note that building an inverted index is an expensive operation and requires you to encode every possible query. This is a substantial drawback, but the benefits are great and it may be worthwhile to pay the price in terms of decreased flexibility.

主站蜘蛛池模板: 抚宁县| 华坪县| 巧家县| 阿城市| 肥东县| 株洲县| 甘德县| 革吉县| 宁明县| 南川市| 拜泉县| 蛟河市| 高淳县| 达尔| 油尖旺区| 渭南市| 衡阳县| 时尚| 衡阳市| 南安市| 黔西县| 泉州市| 镇安县| 双城市| 来凤县| 新建县| 双江| 蚌埠市| 蒲城县| 高雄县| 嫩江县| 申扎县| 合作市| 集安市| 于都县| 威信县| 镇沅| 文安县| 长治市| 沁源县| 石首市|