Tries

A perhaps less popular data structure, very useful in practice, is the trie (sometimes called prefix tree). Tries are extremely fast at matching a list of strings against a prefix. This is especially useful when implementing features such as search-as-you type and autocompletion, where the list of available completions is very large and short response times are required.

Unfortunately, Python does not include a trie implementation in its standard library; however, many efficient implementations are readily available through PyPI. The one we will use in this subsection is patricia-trie, a single-file, pure Python implementation of trie. As an example, we will use patricia-trie to perform the task of finding the longest prefix in a set of strings (just like autocompletion).

As an example, we can demonstrate how fast a trie is able to search through a list of strings. In order to generate a large amount of unique random strings, we can define a function, random_string. The random_string function will return a string composed of random uppercase characters and, while there is a chance to get duplicates, we can greatly reduce the probability of duplicates to the point of being negligible if we make the string long enough. The implementation of the random_string function is shown as follows:

    from random import choice
    from string import ascii_uppercase

    def random_string(length):
     """Produce a random string made of *length* uppercase ascii 
     characters"""
     return ''.join(choice(ascii_uppercase) for i in range(length))

We can build a list of random strings and time how fast it searches for a prefix (in our case, the "AA" string) using the str.startswith function:

    strings = [random_string(32) for i in range(10000)]
    matches = [s for s in strings if s.startswith('AA')]

List comprehension and str.startwith are already very optimized operations and, on this small dataset, the search takes only a millisecond or so:

    %timeit [s for s in strings if s.startswith('AA')]

    1000 loops, best of 3: 1.76 ms per loop

Now, let's try using a trie for the same operation. In this example, we will use the patricia-trie library that is installable through pip. The patricia.trie class implements a variant of the trie data structure with an interface similar to a dictionary. We can initialize our trie by creating a dictionary from our list of strings, as follows:

    from patricia import trie
    strings_dict = {s:0 for s in strings} 
    # A dictionary where all values are 0
    strings_trie = trie(**strings_dict)

To query patricia-trie for a matching prefix, we can use the trie.iter method, which returns an iterator over the matching strings:

    matches = list(strings_trie.iter('AA'))

Now that we know how to initialize and query a trie, we can time the operation:

    %timeit list(strings_trie.iter('AA'))
    10000 loops, best of 3: 60.1 μs per loop

If you look closely, the timing for this input size is 60.1 μs, which is about 30 times faster (1.76 ms = 1760 μs) than linear search! The speed up is so impressive because of the better computational complexity of the trie prefix search. Querying a trie has a time complexity O(S), where S is the length of the longest string in the collection, while the time complexity of a simple linear scan is O(N), where N is the size of the collection.

Note that if we want to return all the prefixes that match, the running time will be proportional to the number of results that match the prefix. Therefore, when designing timing benchmarks, care must be taken to ensure that we are always returning the same number of results.

The scaling properties of a trie versus a linear scan for datasets of different sizes that contains ten prefix matches are shown in the following table:

An interesting fact is that the implementation of patricia-trie is actually a single Python file; this clearly shows how simple and powerful a clever algorithm can be. For extra features and performance, other C-optimized trie libraries are also available, such as datrie and marisa-trie.

官术网_书友最值得收藏!

Python High Performance（Second Edition）

Tries