牛牛怎么斗

書名： Mastering Python
作者名： Rick van Hattem
本章字數： 2284字
更新時間： 2021-07-16 11:10:33

Core collections

Before we can look at the more advanced combined collections later in this chapter, you need to understand the workings of the core Python collections. This is not just about the usage, however; it is also about the time complexities involved, which can strongly affect how your application will behave as it grows. If you are well versed with the time complexities of these objects and know the possibilities of Python 3's tuple packing and unpacking by heart, then feel free to jump to the Advanced collections section.

list – a mutable list of items

The list is most likely the container structure that you've used most in Python. It is simple in its usage, and for most cases, it exhibits great performance.

While you may already be well versed with the usage of list, you might not be aware of the time complexities of the list object. Luckily, many of the time complexities of list are very low; append, get, set, and len all take O(1) time—the best possible. However, you might not be aware of the fact that remove and insert have O(n) time complexity. So, to delete a single item out of 1,000 items, Python will have to walk-through 1,000 items. Internally, the remove and insert operations execute something along these lines:

>>> def remove(items, value):
... new_items = []
... found = False
... for item in items:
... # Skip the first item which is equal to value
... if not found and item == value:
... found = True
... continue
... new_items.append(item)
...
... if not found:
... raise ValueError('list.remove(x): x not in list')
...
... return new_items


>>> def insert(items, index, value):
... new_items = []
... for i, item in enumerate(items):
... if i == index:
... new_items.append(value)
... new_items.append(item)
... return new_items

>>> items = list(range(10))
>>> items
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> items = remove(items, 5)
>>> items
[0, 1, 2, 3, 4, 6, 7, 8, 9]

>>> items = insert(items, 2, 5)
>>> items
[0, 1, 5, 2, 3, 4, 6, 7, 8, 9]

To remove or insert a single item from/into the list, Python needs to copy the entire list, which is especially heavy with larger lists. When executing this only once, it is of course not all that bad. But when executing a large number of deletions, a filter or list comprehension is a much faster solution because, if properly structured, it needs to copy the list only once. For example, suppose we wish to remove a specific set of numbers from the list. We have quite a few options for this. The first is a solution using remove, followed by a list comprehension, and then comes a filter statement. Chapter 4, Functional Programming – Readability Versus Brevity, will explain list comprehensions and the filter statement in more detail. But first, let's check out the example:

>>> primes = set((1, 2, 3, 5, 7))

# Classic solution
>>> items = list(range(10))
>>> for prime in primes:
... items.remove(prime)
>>> items
[0, 4, 6, 8, 9]

# List comprehension
>>> items = list(range(10))
>>> [item for item in items if item not in primes]
[0, 4, 6, 8, 9]

# Filter
>>> items = list(range(10))
>>> list(filter(lambda item: item not in primes, items))
[0, 4, 6, 8, 9]

The latter two are much faster for large lists of items. This is because the operations are much faster. To compare using n=len(items) and m=len(primes), the first takes O(m*n)=5*10=50 operations, whereas the latter two take O(n*1)=10*1=10 operations.

Note

The first method is actually slightly better than that since n decreases during the loop. So, it's effectively 10+9+8+7+6=40, but this is an effect that is negligible enough to ignore. In the case of n=1000, that would be the difference between 1000+999+998+997+996=4990 and 5*1000=5000, which is negligible in most cases.

Of course, min, max, and in all take O(n) as well, but that is expected for a structure that is not optimized for these types of lookups.

They can be implemented like this:

>>> def in_(items, value):
... for item in items:
... if item == value:
... return True
... return False

>>> def min_(items):
... current_min = items[0]
... for item in items[1:]:
... if current_min > item:
... current_min = item
... return current_min

>>> def max_(items):
... current_max = items[0]
... for item in items[1:]:
... if current_max < item:
... current_max = item
... return current_max

>>> items = range(5)
>>> in_(items, 3)
True
>>> min_(items)
0
>>> max_(items)
4

With these examples, it's obvious as well that the in operator could work O(1) if you're lucky, but we count it as O(n) because it might not exist, in which case all values need to be checked.

dict – unsorted but a fast map of items

The dict has to be at least among the top three container structures you use in Python. It's fast, simple to use, and very effective. The average time complexity is exactly as you would expect—O(1) for get, set, and del—but there are cases where this is not true. The way a dict works is by converting the key to a hash using the hash function (calls the __hash__ function of an object) and storing it in a hash table. There are two problems with hash tables, however. The first and the most obvious is that the items will be sorted by hash, which appears at random in most cases. The second problem with hash tables is that they can have hash collisions, and the result of a hash collision is that in the worst case, all the former operations can take O(n) instead. Hash collisions are not all that likely to occur, but they can occur, and if a large dict performs subpar, that's the place to look.

Let's see how this actually works in practice. For the sake of this example, I will use the simplest hashing algorithm I can think of, which is the most significant digit of a number. So, for the case of 12345, it will return 1, and for 56789, it will return 5:

>>> def most_significant(value):
... while value >= 10:
... value //= 10
... return value

>>> most_significant(12345)
1
>>> most_significant(99)
9
>>> most_significant(0)
0

Now we will emulate a dict using a list of lists with this hashing method. We know that our hashing method can only return numbers from 0 to 9, so we need only 10 buckets in our list. Now we will add a few values and show how the spam in eggs might work:

>>> def add(collection, key, value):
... index = most_significant(key)
... collection[index].append((key, value))

>>> def contains(collection, key):
... index = most_significant(key)
... for k, v in collection[index]:
... if k == key:
... return True
... return False

# Create the collection of 10 lists
>>> collection = [[], [], [], [], [], [], [], [], [], []]

# Add some items, using key/value pairs
>>> add(collection, 123, 'a')
>>> add(collection, 456, 'b')
>>> add(collection, 789, 'c')
>>> add(collection, 101, 'c')

# Look at the collection
>>> collection
[[], [(123, 'a'), (101, 'c')], [], [],
 [(456, 'b')], [], [], [(789, 'c')], [], []]

# Check if the contains works correctly
>>> contains(collection, 123)
True
>>> contains(collection, 1)
False

This code is obviously not identical to the dict implementation, but it is actually quite similar internally. Because we can just get item 1 for a value of 123 by simple indexing, we have only O(1) lookup costs in the general case. However, since both keys, 123 and 101, are within the 1 bucket, the runtime can actually increase to O(n) in the worst case where all keys have the same hash. That is what we call a hash collision.

Tip

To debug hash collisions, you can use the hash() function paired with the counter collection, discussed in the counter – keeping track of the most occurring elements section.

In addition to the hash collision performance problem, there is another behavior that might surprise you. When deleting items from a dictionary, it won't actually resize the dictionary in memory yet. The result is that both copying and iterating the entire dictionary take O(m) time (where m is the maximum size of the dictionary); n, the current number of items is not used. So, if you add 1,000 items to a dict and remove 999, iterating and copying will still take 1,000 steps. The only way to work around this issue is by recreating the dictionary, which is something that both the copy and insert operations will perform internally. Note that recreation during an insert operation is not guaranteed and depends on the number of free slots available internally.

set – like a dict without values

A set is a structure that uses the hash method to get a unique collection of values. Internally, it is very similar to a dict, with the same hash collision problem, but there are a few handy features of set that need to be shown:

# All output in the table below is generated using this function
>>> def print_set(expression, set_):
... 'Print set as a string sorted by letters'
... print(expression, ''.join(sorted(set_)))

>>> spam = set('spam')
>>> print_set('spam:', spam)
spam: amps

>>> eggs = set('eggs')
>>> print_set('eggs:', spam)
eggs: amps

The first few are pretty much as expected. At the operators, it gets interesting.

One useful example for set operations is calculating the differences between two objects. For example, let's assume we have two lists:

current_users: The current users in a group
new_users: The new list of users in a group

In permission systems, this is a very common scenario—mass adding and/or removing users from a group. Within many permission databases, it's not easily possible to set the entire list at once, so you need a list to insert and a list to delete. This is where set comes in really handy:

The set function takes a sequence as argument so the double ( is
required.
>>> current_users = set((
... 'a',
... 'b',
... 'd',
... ))

>>> new_users = set((
... 'b',
... 'c',
... 'd',
... 'e',
... ))

>>> to_insert = new_users - current_users
>>> sorted(to_insert)
['c', 'e']
>>> to_delete = current_users - new_users
>>> sorted(to_delete)
['a']
>>> unchanged = new_users & current_users
>>> sorted(unchanged)
['b', 'd']

Now we have lists of all users who were added, removed, and unchanged. Note that sorted is only needed for consistent output, since a set, similar to a dict, has no predefined sort order.

tuple – the immutable list

A tuple is an object that you use very often without even noticing it. When you look at it initially, it seems like a useless data structure. It's like a list that you can't modify, so why not just use a list? There are a few cases where a tuple offers some really useful functionalities that a list does not.

Firstly, they are hashaable. This means that you can use a tuple as a key in a dict, which is something a list can't do:

>>> spam = 1, 2, 3
>>> eggs = 4, 5, 6

>>> data = dict()
>>> data[spam] = 'spam'
>>> data[eggs] = 'eggs'

>>> import pprint # Using pprint for consistent and sorted output
>>> pprint.pprint(data)
{(1, 2, 3): 'spam', (4, 5, 6): 'eggs'}

However, it can actually be more than simple numbers. As long as all elements of a tuple are hashable, it will work. This means that you can use nested tuples, strings, numbers, and anything else for which the hash() function returns a consistent result:

>>> spam = 1, 'abc', (2, 3, (4, 5)), 'def'
>>> eggs = 4, (spam, 5), 6

>>> data = dict()
>>> data[spam] = 'spam'
>>> data[eggs] = 'eggs'
>>> import pprint # Using pprint for consistent and sorted output
>>> pprint.pprint(data)
{(1, 'abc', (2, 3, (4, 5)), 'def'): 'spam',
 (4, ((1, 'abc', (2, 3, (4, 5)), 'def'), 5), 6): 'eggs'}

You can make these as complex as you need. As long as all the parts are hashable, it will function as expected.

Perhaps, even more useful is the fact that tuples also support tuple packing and unpacking:

# Assign using tuples on both sides
>>> a, b, c = 1, 2, 3
>>> a
1

# Assign a tuple to a single variable
>>> spam = a, (b, c)
>>> spam
(1, (2, 3))

# Unpack a tuple to two variables
>>> a, b = spam
>>> a
1
>>> b
(2, 3)

In addition to regular packing and unpacking, from Python 3 onwards, we can actually pack and unpack objects with a variable number of items:

# Unpack with variable length objects which actually assigns as a
list, not a tuple
>>> spam, *eggs = 1, 2, 3, 4
>>> spam
1
>>> eggs
[2, 3, 4]

# Which can be unpacked as well of course
>>> a, b, c = eggs
>>> c
4

# This works for ranges as well
>>> spam, *eggs = range(10)
>>> spam
0
>>> eggs
[1, 2, 3, 4, 5, 6, 7, 8, 9]

# Which works both ways
>>> a
2
>>> a, b, *c = a, *eggs
>>> a, b
(2, 1)
>>> c
[2, 3, 4, 5, 6, 7, 8, 9]

This very method can be applied in many cases, even for function arguments:

>>> def eggs(*args):
... print('args:', args)

>>> eggs(1, 2, 3)
args: (1, 2, 3)

And its equally useful to return multiple arguments from a function:

>>> def spam_eggs():
... return 'spam', 'eggs'

>>> spam, eggs = spam_eggs()
>>> print('spam: %s, eggs: %s' % (spam, eggs))
spam: spam, eggs: eggs

官术网_书友最值得收藏!

Mastering Python