- Data Wrangling with Python
- Dr. Tirthajyoti Sarkar Shubhadeep Roychowdhury
- 3155字
- 2021-06-11 13:40:26
Advanced Data Structures
We will start this chapter by discussing advanced data structures. We will do that by revisiting lists. We will construct a stack and a queue, explore multiple element membership checking, and throw a bit of functional programming in for good measure. If all of this sounds intimidating, then do not worry. We will get to things step by step, like in the previous chapter, and you will feel confident once you have finished this chapter.
To start this chapter, you have to open an empty notebook. To do that, you can simply input the following command in a shell. It is advised that you first navigate to an empty directory using cd before you enter the command:
docker run -p 8888:8888 -v 'pwd':/notebooks -it rcshubhadeep/packt-data-wrangling-base:latest
Once the Docker container is running, point your browser to http://localhost:8888 and use dw_4_all as the passcode to access the notebook interface.
Iterator
We will start off this topic with lists. However, before we get into lists, we will introduce the concept of an iterator. An iterator is an object that implements the next method, meaning an iterator is an object that can iterate over a collection (lists, tuples, dicts, and so on). It is stateful, which means that each time we call the next method, it gives us the next element from the collection. And if there is no further element, then it raises a StopIteration exception.
Note
A StopIteration exception occurs with the iterator's next method when there are no further values to iterate.
If you are familiar with a programming language like C, C++, Java, JavaScript, or PHP, you may have noticed the difference between the for loop implementation in those languages, which consists of three distinct parts, precisely the initiation, the increment, and the termination condition, and the for loop in Python. In Python, we do not use that kind of for loop. What we use in Python is more like a foreach loop: for i in list_1. This is because, under the hood, the for loop is using an iterator, and thus we do not need to do all the extra steps. The iterator does this for us.
Exercise 15: Introduction to the Iterator
To generate lists of numbers, we can use different methods:
- Generate a list that will contain 10000000: ones:
big_list_of_numbers = [1 for x in range(0, 10000000)]
- Check the size of this variable:
from sys import getsizeof
getsizeof(big_list_of_numbers)
The value it will show you will be something around 81528056 (it is in bytes). This is a lot of memory! And the big_list_of_numbers variable is only available once the list comprehension is over. It can also overflow the available system memory if you try too big a number.
- Use an iterator to reduce memory utilization:
from itertools import repeat
small_list_of_numbers = repeat(1, times=10000000)
getsizeof(small_list_of_numbers)
The last line shows that our small_list_of_numbers is only 56 bytes in size. Also, it is a lazy method, as it did not generate all the elements. It will generate them one by one when asked, thus saving us time. In fact, if you omit the times keyword argument, then you can practically generate an infinite number of 1s.
- Loop over the newly generated iterator:
for i, x in enumerate(small_list_of_numbers):
print(x)
if i > 10:
break
We use the enumerate function so that we get the loop counter, along with the values. This will help us break once we reach a certain number of the counter (10 for example).
The output will be a list of 10 ones.
- To look up the definition of any function, type the function name, followed by a ? and press Shift + Enter in a Jupyter notebook. Run the following code to understand how we can use permutations and combinations with itertools:
from itertools import (permutations, combinations, dropwhile, repeat,
zip_longest)
permutations?
combinations?
dropwhile?
repeat?
zip_longest?
Stacks
A stack is a very useful data structure. If you know a bit about CPU internals and how a program gets executed, then you have an idea that a stack is present in many such cases. It is simply a list with one restriction, Last In First Out (LIFO), meaning an element that comes in last goes out first when a value is read from a stack. The following illustration will make this a bit clearer:

Figure 2.1: A stack with two insert elements and one pop operation
As you can see, we have a LIFO strategy to read values from a stack. We will implement a stack using a Python list. Python's lists have a method called pop, which does the exact same pop operation that you can see in the preceding illustration. We will use that to implement a stack.
Exercise 16: Implementing a Stack in Python
- First, define an empty stack:
stack = []
- Use the append method to add an element in the stack. Thanks to append, the element will be always appended at the end of the list:
stack.append(25)
stack
The output is as follows:
[25]
- Append another value to the stack:
stack.append(-12)
stack
The output is as follows:
[25, -12]
- Read a value from our stack using the pop method. This method reads at the current last index of the list and returns it to us. It also deletes the index once the read is done:
tos = stack.pop()tos
The output is as follows:
-12
After we execute the preceding code, we will have -12 in tos and the stack will have only one element in it, 25.
- Append hello to the stack:
stack.append("Hello")
stack
The output is as follows:
[25, 'Hello']
Imagine you are scraping a web page and you want to follow each URL that is present there. If you insert (append) them one by one in a stack, while you read the web page, and then pop them one by one and follow the link, then you have a clean and extendable solution to the problem. We will examine part of this task in the next exercise.
Exercise 17: Implementing a Stack Using User-Defined Methods
We will continue the topic of the stack from the last exercise. But this time, we will implement the append and pop functions by ourselves. The aim of this exercise is twofold. On one hand, we will implement the stack, and this time with a real-life example, which also involves knowledge of string methods and thus serves as a reminder of the last chapter and activity. On the other hand, it will show us a subtle feature of Python and how it handles passing list variables to functions, and will bring us to the next exercise, functional programming:
- First, we will define two functions, stack_push and stack_pop. We renamed them so that we do not have a namespace conflict. Also, create a stack called url_stack for later use:
def stack_push(s, value):
return s + [value]
def stack_pop(s):
tos = s[-1]
del s[-1]
return tos
url_stack = []
- The first function takes the already existing stack and adds the value at the end of it.
Note
Notice the square brackets around the value to convert it in to a one-element list for the sake of the + operation.
- The second one reads the value that's currently at the -1 index of the stack and then uses the del operator to delete that index, and finally returns the value it read earlier.
- Now, we are going to have a string with a few URLs in it. Our job is to analyze the string so that we push the URLs in the stack one by one as we encounter them, and then finally use a for loop to pop them one by one. Let's take the first line from the Wikipedia article about data science:
wikipedia_datascience = "Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge [https://en.wikipedia.org/wiki/Knowledge] and insights from data [https://en.wikipedia.org/wiki/Data] in various forms, both structured and unstructured,similar to data mining [https://en.wikipedia.org/wiki/Data_mining]"
- For the sake of the simplicity of this exercise, we have kept the links in square brackets beside the target words.
- Find the length of the string:
len(wikipedia_datascience)
The output is as follows:
347
- Convert this string into a list by using the split method from the string and then calculate its length:
wd_list = wikipedia_datascience.split()
len(wd_list)
The output is as follows:
34
- Use a for loop to go over each word and check whether it is a URL. To do that, we will use the startswith method from the string, and if it is a URL, then we push it into the stack:
for word in wd_list:
if word.startswith("[https://"):
url_stack = stack_push(url_stack, word[1:-1])
# Notice the clever use of string slicing
- Print the value in url_stack:
url_stack
The output is as follows:
['https://en.wikipedia.org/wiki/Knowledge',
'https://en.wikipedia.org/wiki/Data',
'https://en.wikipedia.org/wiki/Data_mining']
- Iterate over the list and print the URLs one by one by using the stack_pop function:
for i in range(0, len(url_stack)):
print(stack_pop(url_stack))
The output is as follows:
Figure 2.2: Output of the URLs that are printed using a stack
- Print it again to make sure that the stack is empty after the final for loop:
print(url_stack)
The output is as follows:
[]
We have noticed a strange phenomenon in the stack_pop method. We passed the list variable there, and we used the del operator inside the function, but it changed the original variable by deleting the last index each time we call the function. If you are coming from a language like C, C++, and Java, then this is a completely unexpected behavior, as in those languages this can only happen if we pass the variable by reference and it can lead to subtle bugs in Python code. So be careful. In general, it is not a good idea to change a variable's value in place, meaning inside a function. Any variable that's passed to the function should be considered and treated as immutable. This is close to the principles of functional programming. A lambda expression in Python is a way to construct one-line, nameless functions that are, by convention, side effect-free.
Exercise 18: Lambda Expression
In this exercise, we will use a lambda expression to prove the famous trigonometric identity:

Figure 2.3 Trigonometric identity
- Import the math package:
import math
- Define two functions, my_sine and my_cosine. The reason we are declaring these functions is because the original sin and cos functions from the math package take radians as input, but we are more familiar with degrees. So, we will use a lambda expression to define a nameless one-line function and use it. This lambda function will automatically convert our degree input to radians and then apply sin or cos on it and return the value:
def my_sine():
return lambda x: math.sin(math.radians(x))
def my_cosine():
return lambda x: math.cos(math.radians(x))
- Define sine and cosine for our purpose:
sine = my_sine()
cosine = my_cosine()
math.pow(sine(30), 2) + math.pow(cosine(30), 2)
The output is as follows:
1.0
Notice that we have assigned the return value from both my_sine and my_cosine to two variables, and then used them directly as the functions. It is a much cleaner approach than using them explicitly. Notice that we did not explicitly write a return statement inside the lambda function. It is assumed.
Exercise 19: Lambda Expression for Sorting
The lambda expression will take an input and sort it according to the values in tuples. A lambda can take one or more inputs. A lambda expression can also be used to reverse sort by using the parameter of reverse as True:
- Imagine you're in a data wrangling job where you are confronted with the following list of tuples:
capitals = [("USA", "Washington"), ("India", "Delhi"), ("France", "Paris"), ("UK", "London")]
capitals
The output will be as follows:
[('USA', 'Washington'),
('India', 'Delhi'),
('France', 'Paris'),
('UK', 'London')]
- Sort this list by the name of the capitals of each country, using a simple lambda expression. Use the following code:
capitals.sort(key=lambda item: item[1])
capitals
The output will be as follows:
[('India', 'Delhi'),
('UK', 'London'),
('France', 'Paris'),
('USA', 'Washington')]
As we can see, lambda expressions are powerful if we master them and use them in our data wrangling jobs. They are also side effect-free, meaning that they do not change the values of the variables that are passed to them in place.
Exercise 20: Multi-Element Membership Checking
Here is an interesting problem. Let's imagine a list of a few words scraped from a text corpus you are working with:
- Create a list_of_words list with words scraped from a text corpus:
list_of_words = ["Hello", "there.", "How", "are", "you", "doing?"]
- Find out whether this list contains all the elements from another list:
check_for = ["How", "are"]
There exists an elaborate solution, which involves a for loop and few if-else conditions (and you should try to write it!), but there also exists an elegant Pythonic solution to this problem, which takes one line and uses the all function. The all function returns True if all elements of the iterable are true.
- Using the in keyword to check membership in the list list_of_words:
all(w in list_of_words for w in check_for)
The output is as follows:
True
It is indeed elegant and simple to reason about, and this neat trick is very important when dealing with lists.
Queue
Apart from stacks, another high-level data structure that we are interested in is queue. A queue is like a stack, meaning that you continue adding elements one by one. With a queue, the reading of elements obeys a FIFO (First In First Out) strategy. Check out the following diagram to understand this better:

Figure 2.4: Pictorial representation of a queue
We will accomplish this first using list methods and we will show you that for this purpose, it is inefficient. Then, we will learn about the dequeue data structure from the collection module of Python.
Exercise 21: Implementing a Queue in Python
- Create a Python queue with the plain list methods:
%%time
queue = []
for i in range(0, 100000):
queue.append(i)
print("Queue created")
The output is as follows:
Queue created
Wall time: 11 ms
- Use the pop function to empty the queue and check items in it:
for i in range(0, 100000):
queue.pop(0)
print("Queue emptied")
The output is as follows:
Queue emptied
If we use the %%time magic command while executing the preceding code, we will see that it takes a while to finish. In a modern MacBook, with a quad-core processor and 8 GB RAM, it took around 1.20 seconds to finish. This time is taken because of the pop(0) operation, which means every time we pop a value from the left of the list (which is the current 0 index), Python has to rearrange all the other elements of the list by shifting them one space left. Indeed, it is not a very optimized implementation.
- Implement the same queue using the deque data structure from Python's collection package:
%%time
from collections import deque
queue2 = deque()
for i in range(0, 100000):
queue2.append(i)
print("Queue created")
for i in range(0, 100000):
queue2.popleft()
print("Queue emptied")
The output is as follows:
Queue created
Queue emptied
Wall time: 23 ms
- With the specialized and optimized queue implementation from Python's standard library, the time that's taken for this operation is only in the range of 28 milliseconds! This is a huge improvement on the previous one.
A queue is a very important data structure. To give one example from real life, we can think about a producer-consumer system design. While doing data wrangling, you will often come across a problem where you must process very big files. One of the ways to deal with this problem is to chunk the contents of the file in to smaller parts and then push them in to a queue while creating small, dedicated worker processes, which reads off the queue and processes one small chunk at a time. This is a very powerful design, and you can even use it efficiently to design huge multi-node data wrangling pipelines.
We will end the discussion on data structures here. What we discussed here is just the tip of the iceberg. Data structures are a fascinating subject. There are many other data structures that we did not touch and which, when used efficiently, can offer enormous added value. We strongly encourage you to explore data structures more. Try to learn about linked lists, tree, graph, trie, and all the different variations of them as much as you can. Not only do they offer the joy of learning, but they are also the secret mega weapons in the arsenal of a data practitioner that you can bring out every time you are challenged with a difficult data wrangling job.
Activity 3: Permutation, Iterator, Lambda, List
In this activity, we will be using permutations to generate all possible three-digit numbers that can be generated using 0, 1, and 2. Then, loop over this iterator, and also use isinstance and assert to make sure that the return types are tuples. Also, use a single line of code involving dropwhile and lambda expressions to convert all the tuples to lists while dropping any leading zeros (for example, (0, 1, 2) becomes [1, 2]). Finally, write a function that takes a list like before and returns the actual number contained in it.
These steps will guide you to solve this activity:
- Look up the definition of permutations and dropwhile from itertools.
- Write an expression to generate all the possible three-digit numbers using 0, 1, and 2.
- Loop over the iterator expression you generated before. Print each element that's returned by the iterator. Use assert and isinstance to make sure that the elements are of the tuple type.
- Write the loop again using dropwhile with a lambda expression to drop any leading zeros from the tuples. As an example, (0, 1, 2) will become [0, 2]. Also, cast the output of dropwhile to a list.
- Check the actual type that dropwhile returns.
- Combine the preceding code into one block, and this time write a separate function where you will pass the list generated from dropwhile, and the function will return the whole number contained in the list. As an example, if you pass [1, 2] to the function, it will return 12. Make sure that the return type is indeed a number and not a string. Although this task can be achieved using other tricks, we require that you treat the incoming list as a stack in the function and generate the number by reading the individual digits from the stack.
With this activity, we have finished this topic and we will head over to the next topic, which involves basic file-level operations. But before we leave this topic, we encourage you to think about a solution to the preceding problem without using all the advanced operations and data structures we have used here. You will soon realize how complex the naive solution is, and how much value these data structures and operations bring.
Note
The solution for this activity can be found on page 289.