官术网_书友最值得收藏!

Tries

A perhaps less popular data structure, very useful in practice, is the trie (sometimes called prefix tree). Tries are extremely fast at matching a list of strings against a prefix. This is especially useful when implementing features such as search-as-you type and autocompletion, where the list of available completions is very large and short response times are required.

Unfortunately, Python does not include a trie implementation in its standard library; however, many efficient implementations are readily available through PyPI. The one we will use in this subsection is patricia-trie, a single-file, pure Python implementation of trie. As an example, we will use patricia-trie to perform the task of finding the longest prefix in a set of strings (just like autocompletion).

As an example, we can demonstrate how fast a trie is able to search through a list of strings. In order to generate a large amount of unique random strings, we can define a function, random_string. The random_string function will return a string composed of random uppercase characters and, while there is a chance to get duplicates, we can greatly reduce the probability of duplicates to the point of being negligible if we make the string long enough. The implementation of the random_string function is shown as follows:

    from random import choice
from string import ascii_uppercase

def random_string(length):
"""Produce a random string made of *length* uppercase ascii
characters"""
return ''.join(choice(ascii_uppercase) for i in range(length))

We can build a list of random strings and time how fast it searches for a prefix (in our case, the "AA" string) using the str.startswith function:

    strings = [random_string(32) for i in range(10000)]
matches = [s for s in strings if s.startswith('AA')]

List comprehension and str.startwith are already very optimized operations and, on this small dataset, the search takes only a millisecond or so:

    %timeit [s for s in strings if s.startswith('AA')]

1000 loops, best of 3: 1.76 ms per loop

Now, let's try using a trie for the same operation. In this example, we will use the patricia-trie library that is installable through pip. The patricia.trie class implements a variant of the trie data structure with an interface similar to a dictionary. We can initialize our trie by creating a dictionary from our list of strings, as follows:

    from patricia import trie
strings_dict = {s:0 for s in strings}
# A dictionary where all values are 0
strings_trie = trie(**strings_dict)

To query patricia-trie for a matching prefix, we can use the trie.iter method, which returns an iterator over the matching strings:

    matches = list(strings_trie.iter('AA'))

Now that we know how to initialize and query a trie, we can time the operation:

    %timeit list(strings_trie.iter('AA'))
10000 loops, best of 3: 60.1 μs per loop

If you look closely, the timing for this input size is 60.1 μs, which is about 30 times faster (1.76 ms = 1760 μs) than linear search! The speed up is so impressive because of the better computational complexity of the trie prefix search. Querying a trie has a time complexity O(S), where S is the length of the longest string in the collection, while the time complexity of a simple linear scan is O(N), where N is the size of the collection. 

Note that if we want to return all the prefixes that match, the running time will be proportional to the number of results that match the prefix. Therefore, when designing timing benchmarks, care must be taken to ensure that we are always returning the same number of results.

The scaling properties of a trie versus a linear scan for datasets of different sizes that contains ten prefix matches are shown in the following table:

An interesting fact is that the implementation of patricia-trie is actually a single Python file; this clearly shows how simple and powerful a clever algorithm can be. For extra features and performance, other C-optimized trie libraries are also available, such as datrie and marisa-trie.

主站蜘蛛池模板: 桐城市| 临城县| 锡林浩特市| 琼海市| 铁岭县| 绵竹市| 四平市| 延津县| 青海省| 霞浦县| 广元市| 惠水县| 富蕴县| 广灵县| 鄂托克旗| 弋阳县| 凉城县| 彭水| 凉城县| 拜城县| 柏乡县| 吴旗县| 赤水市| 罗田县| 汝州市| 永吉县| 陆川县| 政和县| 穆棱市| 内丘县| 靖边县| 永新县| 东山县| 高淳县| 沙湾县| 苍山县| 大同市| 彝良县| 凌源市| 象州县| 庆城县|