Building an in-memory search index using a hash map

Dictionaries can be used to quickly search for a word in a list of documents, similar to a search engine. In this subsection, we will learn how to build an inverted index based on a dictionary of lists. Let's say we have a collection of four documents:

    docs = ["the cat is under the table",
            "the dog is under the table",
            "cats and dogs smell roses",
            "Carla eats an apple"]

A simple way to retrieve all the documents that match a query is to scan each document and test for the presence of a word. For example, if we want to look up the documents where the word table appears, we can employ the following filtering operation:

    matches = [doc for doc in docs if "table" in doc]

This approach is simple and works well when we have one-off queries; however, if we need to query the collection very often, it can be beneficial to optimize querying time. Since the per-query cost of the linear scan is O(N), you can imagine that a better scaling will allow us to handle much larger document collections.

A better strategy is to spend some time preprocessing the documents so that they are easier to find at query time. We can build a structure, called the inverted index, that associates each word in our collection with the list of documents where that word is present. In our earlier example, the word "table" will be associated to the "the cat is under the table" and "the dog is under the table" documents; they correspond to indices 0 and 1.

Such a mapping can be implemented by going over our collection of documents and storing in a dictionary the index of the documents where that term appears. The implementation is similar to the counter_dict function, except that, instead of accumulating a counter, we are growing the list of documents that match the current term:

    # Building an index
    index = {}
    for i, doc in enumerate(docs):
        # We iterate over each term in the document
        for word in doc.split():
            # We build a list containing the indices 
            # where the term appears
            if word not in index:
                index[word] = [i]
            else:
                index[word].append(i)

Once we have built our index, doing a query involves a simple dictionary lookup. For example, if we want to return all the documents containing the term table, we can simply query the index, and retrieve the corresponding documents:

    results = index["table"]
    result_documents = [docs[i] for i in results]

Since all it takes to query our collection is a dictionary access, the index can handle queries with time complexity O(1)! Thanks to the inverted index, we are now able to query any number of documents (as long as they fit in memory) in constant time. Needless to say, indexing is a technique widely used to quickly retrieve data not only in search engines, but also in databases and any system that requires fast searches.

Note that building an inverted index is an expensive operation and requires you to encode every possible query. This is a substantial drawback, but the benefits are great and it may be worthwhile to pay the price in terms of decreased flexibility.

官术网_书友最值得收藏!

Python High Performance（Second Edition）

Building an in-memory search index using a hash map