官术网_书友最值得收藏!

Advanced collections

The following collections are mostly just extensions of base collections, some of them fairly simple and others a bit more advanced. For all of them though, it is important to know the characteristics of the underlying structures. Without understanding them, it will be difficult to comprehend the characteristics of these collections.

There are a few collections that are implemented in native C code for performance reasons, but all of them can easily be implemented in pure Python as well.

ChainMap – the list of dictionaries

Introduced in Python 3.3, ChainMap allows you to combine multiple mappings (dictionaries for example) into one. This is especially useful when combining multiple contexts. For example, when looking for a variable in your current scope, by default, Python will search in locals(), globals(), and lastly builtins.

Normally, you would do something like this:

import builtins

builtin_vars = vars(builtins)
if key in locals():
    value = locals()[key]
elif key in globals():
    value = globals()[key]
elif key in builtin_vars:
    value = builtin_vars[key]
else:
    raise NameError('name %r is not defined' % key)

This works, but it's ugly to say the least. We can make it prettier, of course:

import builtins

mappings = globals(), locals(), vars(builtins)
for mapping in mappings:
    if key in mapping:
        value = mapping[key]
        break
else:
    raise NameError('name %r is not defined' % key)

A lot better! Moreover, this can actually be considered a nice solution. But since Python 3.3, it's even easier. Now we can simply use the following code:

import builtins
import collections

mappings = collections.ChainMap(globals(), locals(), vars(builtins))
value = mappings[key]

The ChainMap collection is very useful for command-line applications. The most important configuration happens through command-line arguments, followed by directory local config files, followed by global config files, followed by defaults:

import argparse
import collections

defaults = {
    'spam': 'default spam value',
    'eggs': 'default eggs value',
}

parser = argparse.ArgumentParser()
parser.add_argument('--spam')
parser.add_argument('--eggs')

args = vars(parser.parse_args())
# We need to check for empty/default values so we can't simply use vars(args)
filtered_args = {k: v for k, v in args.items() if v}

combined = collections.ChainMap(filtered_args, defaults)

print(combined ['spam'])

Note that accessing specific mappings is still possible:

print(combined.maps[1]['spam'])

for map_ in combined.maps:
    print(map_.get('spam'))

counter – keeping track of the most occurring elements

The counter is a class for keeping track of the number of occurrences of an element. Its basic usage is as you would expect:

>>> import collections

>>> counter = collections.Counter('eggs')
>>> for k in 'eggs':
... print('Count for %s: %d' % (k, counter[k]))
Count for e: 1
Count for g: 2
Count for g: 2
Count for s: 1

However, counter can do more than simply return the count. It also has a few very useful and fast (it uses heapq) methods for getting the most common elements. Even with a million elements added to the counter, it still executes within a second:

>>> import math
>>> import collections

>>> counter = collections.Counter()
>>> for i in range(0, 100000):
... counter[math.sqrt(i) // 25] += 1

>>> for key, count in counter.most_common(5):
... print('%s: %d' % (key, count))
11.0: 14375
10.0: 13125
9.0: 11875
8.0: 10625
12.0: 10000

But wait, there's more! In addition to getting the most frequent elements, it's also possible to add, subtract, intersect, and "union" counters very similarly to the set operations that we saw earlier. So what is the difference between adding two counters and making a union of them? As you would expect, they are similar, but there is a small difference. Let's look at its workings:

>>> import collections


>>> def print_counter(expression, counter):
... sorted_characters = sorted(counter.elements())
... print(expression, ''.join(sorted_characters))

>>> eggs = collections.Counter('eggs')
>>> spam = collections.Counter('spam')
>>> print_counter('eggs:', eggs)
eggs: eggs
>>> print_counter('spam:', spam)
spam: amps
>>> print_counter('eggs & spam:', eggs & spam)
eggs & spam: s
>>> print_counter('spam & eggs:', spam & eggs)
spam & eggs: s
>>> print_counter('eggs - spam:', eggs - spam)
eggs - spam: egg
>>> print_counter('spam - eggs:', spam - eggs)
spam - eggs: amp
>>> print_counter('eggs + spam:', eggs + spam)
eggs + spam: aeggmpss
>>> print_counter('spam + eggs:', spam + eggs)
spam + eggs: aeggmpss
>>> print_counter('eggs | spam:', eggs | spam)
eggs | spam: aeggmps
>>> print_counter('spam | eggs:', spam | eggs)
spam | eggs: aeggmps

The first two are obvious. The eggs string is just a sequence of characters with two "g"s, one "s", and one "e", and spam is almost the same but with different letters.

The result of spam & eggs (and the reverse) is also quite predictable. The only letter that's shared between spam and eggs is s, so that's the result. When it comes to counts, it simply does a min(element_a, element_b) per shared element from both and gets the lowest.

When subtracting the letters s, p, a, and m from eggs, you are left with e and g. Similarly, when removing e, g, and s from spam, you are left with p, a, and m.

Now, adding is as you would expect—just an element-by-element addition of both counters.

So how is the union (OR) any different? It gets the max(element_a, element_b) per element in either of the counters instead of adding them; regardless as is the case with the addition.

Lastly, as is demonstrated in the preceding code, the elements method returns an expanded list of all elements repeated by the count.

Note

The Counter object will automatically remove elements that are zero or less during the execution of mathematical operations.

deque – the double ended queue

The deque (short for Double Ended Queue) object is one of the oldest collections. It was introduced in Python 2.4, so it has been available for over 10 years by now. Generally, this object will be too low-level for most purposes these days, as many operations that would otherwise use it have well-supported libraries available, but that doesn't make it less useful.

Internally, deque is created as a doubly linked list, which means that every item points to the next and the previous item. Since deque is double-ended, the list itself points to both the first and the last element. This makes both adding and removing items from either the beginning or the end a very light O(1) operation, since only the pointer to the beginning/end of the list needs to change and a pointer needs to be added to the first/last item, depending on whether an item is added at the beginning or the end.

For simple stack/queue purposes, it seems wasteful to use a double-ended queue, but the performance is good enough for us not to care about the overhead incurred. The deque class is fully implemented in C (with CPython).

Its usage as a queue is very straightforward:

>>> import collections

>>> queue = collections.deque()
>>> queue.append(1)
>>> queue.append(2)
>>> queue
deque([1, 2])
>>> queue.popleft()
1
>>> queue.popleft()
2
>>> queue.popleft()
Traceback (most recent call last):
 ...
IndexError: pop from an empty deque

As expected, the items are followed by an IndexError since there are only two items and we are trying to get three.

The usage as a stack is almost identical, but we have to use pop instead of popleft (or appendleft instead of append):

>>> import collections

>>> queue = collections.deque()
>>> queue.append(1)
>>> queue.append(2)
>>> queue
deque([1, 2])
>>> queue.pop()
2
>>> queue.pop()
1
>>> queue.pop()
Traceback (most recent call last):
 ...
IndexError: pop from an empty deque

Another very useful feature is that deque can be used as a circular queue with the maxlen parameter. By using this, it can be used to keep the last n status messages or something similar:

>>> import collections

>>> circular = collections.deque(maxlen=2)
>>> for i in range(5):
... circular.append(i)
... circular
deque([0], maxlen=2)
deque([0, 1], maxlen=2)
deque([1, 2], maxlen=2)
deque([2, 3], maxlen=2)
deque([3, 4], maxlen=2)
>>> circular
deque([3, 4], maxlen=2)

Whenever you require a queue or stack class within a single-threaded application, deque is a very convenient option. If you require the object to be synchronized for multithreading operations, then the queue.Queue class would be better suited. Internally, it wraps deque, but it's a thread-safe alternative. In the same category, there is also an asyncio.Queue for asynchronous operations and multiprocessing.Queue for multiprocessing operations. Examples of asyncio and multiprocessing can be found in Chapter 7, Async IO – Multithreading without Threads and Chapter 13, Multiprocessing – When a Single CPU Core Is Not Enough respectively.

defaultdict – dictionary with a default value

The defaultdict is by far my favorite object in the collections package. I still remember writing my own versions of it before it was added to the core. While it's a fairly simple object, it is extremely useful for all sorts of design patterns. Instead of having to check for the existence of a key and adding a value every time, you can just declare the default from the beginning, and there is no need to worry about the rest.

For example, let's say we are building a very basic graph structure from a list of connected nodes.

This is our list of connected nodes (one way):

nodes = [
    ('a', 'b'),
    ('a', 'c'),
    ('b', 'a'),
    ('b', 'd'),
    ('c', 'a'),
    ('d', 'a'),
    ('d', 'b'),
    ('d', 'c'),
]

Now let's put this graph into a normal dictionary:

>>> graph = dict()
>>> for from_, to in nodes:
... if from_ not in graph:
... graph[from_] = []
... graph[from_].append(to)

>>> import pprint
>>> pprint.pprint(graph)
{'a': ['b', 'c'],
 'b': ['a', 'd'],
 'c': ['a'],
 'd': ['a', 'b', 'c']}

Some variations are possible, of course, using setdefault for example. But they remain more complex than they need to be.

The truly Pythonic version uses defaultdict instead:

>>> import collections

>>> graph = collections.defaultdict(list)
>>> for from_, to in nodes:
... graph[from_].append(to)

>>> import pprint
>>> pprint.pprint(graph)
defaultdict(<class 'list'>,
 {'a': ['b', 'c'],
 'b': ['a', 'd'],
 'c': ['a'],
 'd': ['a', 'b', 'c']})

Isn't that a beautiful bit of code? The defaultdict can actually be seen as the precursor of the counter object. It's not as fancy and doesn't have all the bells and whistles that counter has, but it does the job in many cases:

>>> counter = collections.defaultdict(int)
>>> counter['spam'] += 5
>>> counter
defaultdict(<class 'int'>, {'spam': 5})

The default value for defaultdict needs to be a callable object. In the previous cases, these were int and list, but you can easily define your own functions to use as a default value. That's what the following example uses, although I won't recommend production usage since it lacks a bit of readability. I do believe, however, that it is a beautiful example of the power of Python.

This is how we create a tree in a single line of Python:

import collections
def tree(): return collections.defaultdict(tree)

Brilliant, isn't it? Here's how we can actually use it:

>>> import json
>>> import collections


>>> def tree():
... return collections.defaultdict(tree)

>>> colours = tree()
>>> colours['other']['black'] = 0x000000
>>> colours['other']['white'] = 0xFFFFFF
>>> colours['primary']['red'] = 0xFF0000
>>> colours['primary']['green'] = 0x00FF00
>>> colours['primary']['blue'] = 0x0000FF
>>> colours['secondary']['yellow'] = 0xFFFF00
>>> colours['secondary']['aqua'] = 0x00FFFF
>>> colours['secondary']['fuchsia'] = 0xFF00FF

>>> print(json.dumps(colours, sort_keys=True, indent=4))
{
 "other": {
 "black": 0,
 "white": 16777215
 },
 "primary": {
 "blue": 255,
 "green": 65280,
 "red": 16711680
 },
 "secondary": {
 "aqua": 65535,
 "fuchsia": 16711935,
 "yellow": 16776960
 }
}

The nice thing is that you can make it go as deep as you'd like. Because of the defaultdict base, it generates itself recursively.

namedtuple – tuples with field names

The namedtuple object is exactly what the name implies—a tuple with a name. It has a few useful use cases, though I must admit that I haven't found too many in the wild, except for some Python modules such as inspect and urllib.parse. Points in 2D or 3D space are a nice example of where it is definitely useful:

>>> import collections

>>> Point = collections.namedtuple('Point', ['x', 'y', 'z'])
>>> point_a = Point(1, 2, 3)
>>> point_a
Point(x=1, y=2, z=3)

>>> point_b = Point(x=4, z=5, y=6)
>>> point_b
Point(x=4, y=6, z=5)

Not too much can be said about namedtuple; it does what you would expect, and the greatest advantage is that the properties can be executed both by name and by index, which makes tuple unpacking quite easy:

>>> x, y, z = point_a
>>> print('X: %d, Y: %d, Z: %d' % (x, y, z))
X: 1, Y: 2, Z: 3
>>> print('X: %d, Y: %d, Z: %d' % point_b)
X: 4, Y: 6, Z: 5
>>> print('X: %d' % point_a.x)

enum – a group of constants

The enum package is quite similar to namedtuple but has a completely different goal and interface. The basic enum object makes it really easy to have constants in your modules while still avoiding magic numbers. This is a basic example:

>>> import enum


>>> class Color(enum.Enum):
... red = 1
... green = 2
... blue = 3

>>> Color.red
<Color.red: 1>
>>> Color['red']
<Color.red: 1>
>>> Color(1)
<Color.red: 1>
>>> Color.red.name
'red'
>>> Color.red.value
1
>>> isinstance(Color.red, Color)
True
>>> Color.red is Color['red']
True
>>> Color.red is Color(1)
True

A few of the handy features of the enum package are that the objects are iterable, accessible through both numeric and textual representation of the values, and, with proper inheritance, even comparable to other classes.

The following code shows the usage of a basic API:

>>> for color in Color:
... color
<Color.red: 1>
<Color.green: 2>
<Color.blue: 3>

>>> colors = dict()
>>> colors[Color.green] = 0x00FF00
>>> colors
{<Color.green: 2>: 65280}

There is more though. One of the lesser known possibilities from the enum package is that you can make value comparisons work through inheritance of specific types, and this works for every type—not just integers but (your own) custom types as well.

This is the regular enum:

>>> import enum


>>> class Spam(enum.Enum):
... EGGS = 'eggs'

>>> Spam.EGGS == 'eggs'
False

The following is enum with str inheritance:

>>> import enum


>>> class Spam(str, enum.Enum):
... EGGS = 'eggs'

>>> Spam.EGGS == 'eggs'
True

OrderedDict – a dictionary where the insertion order matters

OrderdDict is a dict that keeps track of the order in which the items were inserted. Whereas a normal dict will return your keys in the order of hash, OrderedDict will return your keys by the order of insertion. So, it's not ordered by key or value, but that is easily possible too:

>>> import collections


>>> spam = collections.OrderedDict()
>>> spam['b'] = 2
>>> spam['c'] = 3
>>> spam['a'] = 1
>>> spam
OrderedDict([('b', 2), ('c', 3), ('a', 1)])

>>> for key, value in spam.items():
... key, value
('b', 2)
('c', 3)
('a', 1)

>>> eggs = collections.OrderedDict(sorted(spam.items()))
>>> eggs
OrderedDict([('a', 1), ('b', 2), ('c', 3)])

While you can probably guess how this works, the internals might surprise you a little. I know I was expecting a different implementation.

Internally, OrderedDict uses a normal dict for key/value storage, and in addition to that, it uses a doubly linked list to keep track of the next/previous items. To keep track of the reverse relation (from the doubly linked list back to the keys), there is an extra dict stored internally.

Put simply, OrderedDict can be a very handy tool for keeping your dict sorted, but it does come at a cost. The system is structured in such a way that set and get are really fast O(1), but the object is still a lot heavier (double or more memory usage) when compared to a regular dict. In many cases, the memory usage of the objects inside will outweigh the memory usage of the dict itself, of course, but this is something to keep in mind.

heapq – the ordered list

The heapq module is a great little module that makes it very easy to create a priority queue in Python. A structure that will always make the smallest (or largest, depending on the implementation) item available with minimum effort. The API is quite simple, and one of the best examples of its usage can be seen in the OrderedDict object. You probably don't want to use heapq directly, but understanding the inner workings is important in order to analyze how classes such as OrderedDict work.

Tip

If you are looking for a structure to keep your list always sorted, try the bisect module instead.

The basic usage is quite simple though:

>>> import heapq


>>> heap = [1, 3, 5, 7, 2, 4, 3]
>>> heapq.heapify(heap)
>>> heap
[1, 2, 3, 7, 3, 4, 5]

>>> while heap:
... heapq.heappop(heap), heap
(1, [2, 3, 3, 7, 5, 4])
(2, [3, 3, 4, 7, 5])
(3, [3, 5, 4, 7])
(3, [4, 5, 7])
(4, [5, 7])
(5, [7])
(7, [])

One important thing to note here—something that you have probably already understood from the preceding example—is that the heapq module does not create a special object. It is simply a bunch of methods for treating a regular list as a heap. That doesn't make it less useful, but it is something to take into consideration. You may also wonder why the heap isn't sorted. Actually, it is sorted but not the way you expect it to be. If you view the heap as a tree, it becomes much more obvious:

   1
 2   3
7 3 4 5

The smallest number is always at the top and the biggest numbers are always at the bottom of the tree. Because of that, it's really easy to find the smallest number, but finding the largest is not so easy. To get the sorted version of the heap, we simply need to keep removing the top of the tree until all items are gone.

bisect – the sorted list

We have seen the heapq module in the previous paragraph, which makes it really simple to always get the smallest number from a list, and therefore makes it easy to sort a list of objects. While the heapq module appends items to form a tree-like structure, the bisect module inserts items in such a way that they stay sorted. A big difference is that adding/removing items with the heapq module is very light whereas finding items is really light with the bisect module. If your primary purpose is searching, then bisect should be your choice.

As is the case with heapq, bisect does not really create a special data structure. It just works on a standard list and expects that list to always be sorted. It is important to understand the performance implications of this; simply adding items to the list using the bisect algorithm can be very slow because an insert on a list takes O(n). Effectively, creating a sorted list using bisect takes O(n*n), which is quite slow, especially because creating the same sorted list using heapq or sorted takes O(n * log(n)) instead.

Note

The log(n) refers to the base 2 logarithm function. To calculate this value, the math.log2() function can be used. This results in an increase of 1 every time the number doubles in size. For n=2, the value of log(n) is 1, and consequently for n=4 and n=8, the log values are 2 and 3, respectively.

This means that a 32-bit number, which is 2**32 = 4294967296, has a log of 32.

If you have a sorted structure and you only need to add a single item, then the bisect algorithm can be used for insertion. Otherwise, it's generally faster to simply append the items and call a .sort() afterwards.

To illustrate, we have these lines:

>>> import bisect

Using the regular sort:
>>> sorted_list = []
>>> sorted_list.append(5) # O(1)
>>> sorted_list.append(3) # O(1)
>>> sorted_list.append(1) # O(1)
>>> sorted_list.append(2) # O(1)
>>> sorted_list.sort() # O(n * log(n)) = O(4 * log(4)) = O(8)
>>> sorted_list
[1, 2, 3, 5]

Using bisect:
>>> sorted_list = []
>>> bisect.insort(sorted_list, 5) # O(n) = O(1)
>>> bisect.insort(sorted_list, 3) # O(n) = O(2)
>>> bisect.insort(sorted_list, 1) # O(n) = O(3)
>>> bisect.insort(sorted_list, 2) # O(n) = O(4)
>>> sorted_list
[1, 2, 3, 5]

For a small number of items, the difference is negligible, but it quickly grows to a point where the difference will be large. For n=4, the difference is just between 4 * 1 + 8 = 12 and 1 + 2 + 3 + 4 = 10 making the bisect solution faster. But if we were to insert 1,000 items, it would be 1000 + 1000 * log(1000) = 10966 versus 1 + 2 + … 1000 = 1000 * (1000 + 1) / 2 = 500500. So, be very careful while inserting many items.

Searching within the list is very fast though; because it is sorted, we can use a very simple binary search algorithm. For example, what if we want to check whether a few numbers exist within the list?

>>> import bisect


>>> sorted_list = [1, 2, 3, 5]
>>> def contains(sorted_list, value):
... i = bisect.bisect_left(sorted_list, value)
... return i < len(sorted_list) and sorted_list[i] == value

>>> contains(sorted_list, 2)
True
>>> contains(sorted_list, 4)
False
>>> contains(sorted_list, 6)
False

As you can see, the bisect_left function finds the position at which the number is supposed to be. This is actually what the insort function does as well; it inserts the number at the correct position by searching for the location of the number.

So how is this different from a regular value in sorted_list? The biggest difference is that bisect does a binary search internally, which means that it starts in the middle and jumps left or right depending on whether the value is bigger or smaller than the value. To illustrate, we will search for 4 in a list of numbers from 0 to 14:

sorted_list = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
Step 1: 4 > 7                       ^
Step 2: 4 > 3           ^
Step 3: 4 > 5                 ^
Step 4: 4 > 5              ^

As you can see, after only four steps (actually three; the fourth is just for illustration), we have found the number we searched for. Depending on the number (7, for example), it may go faster, but it will never take more than O(log(n)) steps to find a number.

With a regular list, a search would simply walk through all items until it finds the desired item. If you're lucky, it could be the first number you encounter, but if you're unlucky, it could be the last item. In the case of 1,000 items, that would be the difference between 1,000 steps and log(1000) = 10 steps.

主站蜘蛛池模板: 桦川县| 磐安县| 衡东县| 黑河市| 陆川县| 虞城县| 鄂温| 娄烦县| 太保市| 南皮县| 句容市| 峨边| 三门县| 奉新县| 教育| 西华县| 衡阳市| 宣恩县| 文山县| 万荣县| 睢宁县| 汶上县| 马龙县| 合川市| 会理县| 贡觉县| 胶南市| 鄂托克旗| 禹州市| 清水河县| 津南区| 武山县| 宜兴市| 茶陵县| 台东县| 承德县| 建阳市| 改则县| 武强县| 溧阳市| 合肥市|