- Python High Performance(Second Edition)
- Gabriele Lanaro
- 548字
- 2021-07-09 21:01:55
Building an in-memory search index using a hash map
Dictionaries can be used to quickly search for a word in a list of documents, similar to a search engine. In this subsection, we will learn how to build an inverted index based on a dictionary of lists. Let's say we have a collection of four documents:
docs = ["the cat is under the table",
"the dog is under the table",
"cats and dogs smell roses",
"Carla eats an apple"]
A simple way to retrieve all the documents that match a query is to scan each document and test for the presence of a word. For example, if we want to look up the documents where the word table appears, we can employ the following filtering operation:
matches = [doc for doc in docs if "table" in doc]
This approach is simple and works well when we have one-off queries; however, if we need to query the collection very often, it can be beneficial to optimize querying time. Since the per-query cost of the linear scan is O(N), you can imagine that a better scaling will allow us to handle much larger document collections.
A better strategy is to spend some time preprocessing the documents so that they are easier to find at query time. We can build a structure, called the inverted index, that associates each word in our collection with the list of documents where that word is present. In our earlier example, the word "table" will be associated to the "the cat is under the table" and "the dog is under the table" documents; they correspond to indices 0 and 1.
Such a mapping can be implemented by going over our collection of documents and storing in a dictionary the index of the documents where that term appears. The implementation is similar to the counter_dict function, except that, instead of accumulating a counter, we are growing the list of documents that match the current term:
# Building an index
index = {}
for i, doc in enumerate(docs):
# We iterate over each term in the document
for word in doc.split():
# We build a list containing the indices
# where the term appears
if word not in index:
index[word] = [i]
else:
index[word].append(i)
Once we have built our index, doing a query involves a simple dictionary lookup. For example, if we want to return all the documents containing the term table, we can simply query the index, and retrieve the corresponding documents:
results = index["table"]
result_documents = [docs[i] for i in results]
Since all it takes to query our collection is a dictionary access, the index can handle queries with time complexity O(1)! Thanks to the inverted index, we are now able to query any number of documents (as long as they fit in memory) in constant time. Needless to say, indexing is a technique widely used to quickly retrieve data not only in search engines, but also in databases and any system that requires fast searches.
Note that building an inverted index is an expensive operation and requires you to encode every possible query. This is a substantial drawback, but the benefits are great and it may be worthwhile to pay the price in terms of decreased flexibility.
- AngularJS Testing Cookbook
- The Android Game Developer's Handbook
- Learning Data Mining with Python
- Django Design Patterns and Best Practices
- Java持續交付
- Full-Stack Vue.js 2 and Laravel 5
- 精通Scrapy網絡爬蟲
- Learning Apache Mahout Classification
- Illustrator CC平面設計實戰從入門到精通(視頻自學全彩版)
- Cocos2d-x Game Development Blueprints
- HTML+CSS+JavaScript編程入門指南(全2冊)
- Vue.js光速入門及企業項目開發實戰
- C#程序設計基礎入門教程
- SAP Web Dynpro for ABAP開發技術詳解:基礎應用
- Android 游戲開發大全(第二版)