- Mastering Python
- Rick van Hattem
- 2284字
- 2021-07-16 11:10:33
Core collections
Before we can look at the more advanced combined collections later in this chapter, you need to understand the workings of the core Python collections. This is not just about the usage, however; it is also about the time complexities involved, which can strongly affect how your application will behave as it grows. If you are well versed with the time complexities of these objects and know the possibilities of Python 3's tuple packing and unpacking by heart, then feel free to jump to the Advanced collections section.
list – a mutable list of items
The list
is most likely the container structure that you've used most in Python. It is simple in its usage, and for most cases, it exhibits great performance.
While you may already be well versed with the usage of list, you might not be aware of the time complexities of the list
object. Luckily, many of the time complexities of list
are very low; append
, get
, set
, and len
all take O(1)
time—the best possible. However, you might not be aware of the fact that remove
and insert
have O(n)
time complexity. So, to delete a single item out of 1,000 items, Python will have to walk-through 1,000 items. Internally, the remove
and insert
operations execute something along these lines:
>>> def remove(items, value): ... new_items = [] ... found = False ... for item in items: ... # Skip the first item which is equal to value ... if not found and item == value: ... found = True ... continue ... new_items.append(item) ... ... if not found: ... raise ValueError('list.remove(x): x not in list') ... ... return new_items >>> def insert(items, index, value): ... new_items = [] ... for i, item in enumerate(items): ... if i == index: ... new_items.append(value) ... new_items.append(item) ... return new_items >>> items = list(range(10)) >>> items [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> items = remove(items, 5) >>> items [0, 1, 2, 3, 4, 6, 7, 8, 9] >>> items = insert(items, 2, 5) >>> items [0, 1, 5, 2, 3, 4, 6, 7, 8, 9]
To remove or insert a single item from/into the list, Python needs to copy the entire list, which is especially heavy with larger lists. When executing this only once, it is of course not all that bad. But when executing a large number of deletions, a filter
or list
comprehension is a much faster solution because, if properly structured, it needs to copy the list only once. For example, suppose we wish to remove a specific set of numbers from the list. We have quite a few options for this. The first is a solution using remove
, followed by a list comprehension, and then comes a filter
statement. Chapter 4, Functional Programming – Readability Versus Brevity, will explain list
comprehensions and the filter
statement in more detail. But first, let's check out the example:
>>> primes = set((1, 2, 3, 5, 7)) # Classic solution >>> items = list(range(10)) >>> for prime in primes: ... items.remove(prime) >>> items [0, 4, 6, 8, 9] # List comprehension >>> items = list(range(10)) >>> [item for item in items if item not in primes] [0, 4, 6, 8, 9] # Filter >>> items = list(range(10)) >>> list(filter(lambda item: item not in primes, items)) [0, 4, 6, 8, 9]
The latter two are much faster for large lists of items. This is because the operations are much faster. To compare using n=len(items)
and m=len(primes)
, the first takes O(m*n)=5*10=50
operations, whereas the latter two take O(n*1)=10*1=10
operations.
Note
The first method is actually slightly better than that since n
decreases during the loop. So, it's effectively 10+9+8+7+6=40
, but this is an effect that is negligible enough to ignore. In the case of n=1000
, that would be the difference between 1000+999+998+997+996=4990
and 5*1000=5000
, which is negligible in most cases.
Of course, min
, max
, and in
all take O(n)
as well, but that is expected for a structure that is not optimized for these types of lookups.
They can be implemented like this:
>>> def in_(items, value): ... for item in items: ... if item == value: ... return True ... return False >>> def min_(items): ... current_min = items[0] ... for item in items[1:]: ... if current_min > item: ... current_min = item ... return current_min >>> def max_(items): ... current_max = items[0] ... for item in items[1:]: ... if current_max < item: ... current_max = item ... return current_max >>> items = range(5) >>> in_(items, 3) True >>> min_(items) 0 >>> max_(items) 4
With these examples, it's obvious as well that the in
operator could work O(1)
if you're lucky, but we count it as O(n)
because it might not exist, in which case all values need to be checked.
dict – unsorted but a fast map of items
The dict
has to be at least among the top three container structures you use in Python. It's fast, simple to use, and very effective. The average time complexity is exactly as you would expect—O(1)
for get
, set
, and del
—but there are cases where this is not true. The way a dict
works is by converting the key to a hash using the hash
function (calls the __hash__
function of an object) and storing it in a hash table. There are two problems with hash tables, however. The first and the most obvious is that the items will be sorted by hash, which appears at random in most cases. The second problem with hash tables is that they can have hash collisions, and the result of a hash collision is that in the worst case, all the former operations can take O(n)
instead. Hash collisions are not all that likely to occur, but they can occur, and if a large dict
performs subpar, that's the place to look.
Let's see how this actually works in practice. For the sake of this example, I will use the simplest hashing algorithm I can think of, which is the most significant digit of a number. So, for the case of 12345
, it will return 1
, and for 56789
, it will return 5
:
>>> def most_significant(value): ... while value >= 10: ... value //= 10 ... return value >>> most_significant(12345) 1 >>> most_significant(99) 9 >>> most_significant(0) 0
Now we will emulate a dict
using a list
of lists with this hashing method. We know that our hashing method can only return numbers from 0
to 9
, so we need only 10 buckets in our list. Now we will add a few values and show how the spam in eggs might work:
>>> def add(collection, key, value): ... index = most_significant(key) ... collection[index].append((key, value)) >>> def contains(collection, key): ... index = most_significant(key) ... for k, v in collection[index]: ... if k == key: ... return True ... return False # Create the collection of 10 lists >>> collection = [[], [], [], [], [], [], [], [], [], []] # Add some items, using key/value pairs >>> add(collection, 123, 'a') >>> add(collection, 456, 'b') >>> add(collection, 789, 'c') >>> add(collection, 101, 'c') # Look at the collection >>> collection [[], [(123, 'a'), (101, 'c')], [], [], [(456, 'b')], [], [], [(789, 'c')], [], []] # Check if the contains works correctly >>> contains(collection, 123) True >>> contains(collection, 1) False
This code is obviously not identical to the dict
implementation, but it is actually quite similar internally. Because we can just get item 1
for a value of 123
by simple indexing, we have only O(1)
lookup costs in the general case. However, since both keys, 123
and 101
, are within the 1
bucket, the runtime can actually increase to O(n)
in the worst case where all keys have the same hash. That is what we call a hash collision.
In addition to the hash collision performance problem, there is another behavior that might surprise you. When deleting items from a dictionary, it won't actually resize the dictionary in memory yet. The result is that both copying and iterating the entire dictionary take O(m)
time (where m is the maximum size of the dictionary); n
, the current number of items is not used. So, if you add 1,000 items to a dict
and remove 999, iterating and copying will still take 1,000 steps. The only way to work around this issue is by recreating the dictionary, which is something that both the copy
and insert
operations will perform internally. Note that recreation during an insert
operation is not guaranteed and depends on the number of free slots available internally.
set – like a dict without values
A set
is a structure that uses the hash method to get a unique collection of values. Internally, it is very similar to a dict
, with the same hash collision problem, but there are a few handy features of set that need to be shown:
# All output in the table below is generated using this function >>> def print_set(expression, set_): ... 'Print set as a string sorted by letters' ... print(expression, ''.join(sorted(set_))) >>> spam = set('spam') >>> print_set('spam:', spam) spam: amps >>> eggs = set('eggs') >>> print_set('eggs:', spam) eggs: amps
The first few are pretty much as expected. At the operators, it gets interesting.

One useful example for set
operations is calculating the differences between two objects. For example, let's assume we have two lists:
current_users
: The current users in a groupnew_users
: The new list of users in a group
In permission systems, this is a very common scenario—mass adding and/or removing users from a group. Within many permission databases, it's not easily possible to set the entire list at once, so you need a list to insert and a list to delete. This is where set
comes in really handy:
The set function takes a sequence as argument so the double ( is required. >>> current_users = set(( ... 'a', ... 'b', ... 'd', ... )) >>> new_users = set(( ... 'b', ... 'c', ... 'd', ... 'e', ... )) >>> to_insert = new_users - current_users >>> sorted(to_insert) ['c', 'e'] >>> to_delete = current_users - new_users >>> sorted(to_delete) ['a'] >>> unchanged = new_users & current_users >>> sorted(unchanged) ['b', 'd']
Now we have lists of all users who were added, removed, and unchanged. Note that sorted
is only needed for consistent output, since a set
, similar to a dict
, has no predefined sort order.
tuple – the immutable list
A tuple
is an object that you use very often without even noticing it. When you look at it initially, it seems like a useless data structure. It's like a list that you can't modify, so why not just use a list
? There are a few cases where a tuple
offers some really useful functionalities that a list
does not.
Firstly, they are hashaable. This means that you can use a tuple
as a key in a dict
, which is something a list
can't do:
>>> spam = 1, 2, 3 >>> eggs = 4, 5, 6 >>> data = dict() >>> data[spam] = 'spam' >>> data[eggs] = 'eggs' >>> import pprint # Using pprint for consistent and sorted output >>> pprint.pprint(data) {(1, 2, 3): 'spam', (4, 5, 6): 'eggs'}
However, it can actually be more than simple numbers. As long as all elements of a tuple
are hashable, it will work. This means that you can use nested tuples, strings, numbers, and anything else for which the hash()
function returns a consistent result:
>>> spam = 1, 'abc', (2, 3, (4, 5)), 'def' >>> eggs = 4, (spam, 5), 6 >>> data = dict() >>> data[spam] = 'spam' >>> data[eggs] = 'eggs' >>> import pprint # Using pprint for consistent and sorted output >>> pprint.pprint(data) {(1, 'abc', (2, 3, (4, 5)), 'def'): 'spam', (4, ((1, 'abc', (2, 3, (4, 5)), 'def'), 5), 6): 'eggs'}
You can make these as complex as you need. As long as all the parts are hashable, it will function as expected.
Perhaps, even more useful is the fact that tuples also support tuple packing and unpacking:
# Assign using tuples on both sides >>> a, b, c = 1, 2, 3 >>> a 1 # Assign a tuple to a single variable >>> spam = a, (b, c) >>> spam (1, (2, 3)) # Unpack a tuple to two variables >>> a, b = spam >>> a 1 >>> b (2, 3)
In addition to regular packing and unpacking, from Python 3 onwards, we can actually pack and unpack objects with a variable number of items:
# Unpack with variable length objects which actually assigns as a list, not a tuple >>> spam, *eggs = 1, 2, 3, 4 >>> spam 1 >>> eggs [2, 3, 4] # Which can be unpacked as well of course >>> a, b, c = eggs >>> c 4 # This works for ranges as well >>> spam, *eggs = range(10) >>> spam 0 >>> eggs [1, 2, 3, 4, 5, 6, 7, 8, 9] # Which works both ways >>> a 2 >>> a, b, *c = a, *eggs >>> a, b (2, 1) >>> c [2, 3, 4, 5, 6, 7, 8, 9]
This very method can be applied in many cases, even for function arguments:
>>> def eggs(*args): ... print('args:', args) >>> eggs(1, 2, 3) args: (1, 2, 3)
And its equally useful to return multiple arguments from a function:
>>> def spam_eggs(): ... return 'spam', 'eggs' >>> spam, eggs = spam_eggs() >>> print('spam: %s, eggs: %s' % (spam, eggs)) spam: spam, eggs: eggs
- 測試驅(qū)動開發(fā):入門、實戰(zhàn)與進階
- Mastering Kotlin
- HTML5 and CSS3 Transition,Transformation,and Animation
- Learn React with TypeScript 3
- SQL基礎(chǔ)教程(第2版)
- Unity 2018 Shaders and Effects Cookbook
- Java程序設(shè)計案例教程
- BeagleBone Robotic Projects(Second Edition)
- OpenCV 3計算機視覺:Python語言實現(xiàn)(原書第2版)
- Flask Web開發(fā):基于Python的Web應(yīng)用開發(fā)實戰(zhàn)(第2版)
- 例說FPGA:可直接用于工程項目的第一手經(jīng)驗
- Clojure編程樂趣
- Web程序設(shè)計與架構(gòu)
- Python GUI設(shè)計:tkinter菜鳥編程
- ASP.NET開發(fā)技巧精講