- The Data Wrangling Workshop
- Brian Lipp Shubhadeep Roychowdhury Dr. Tirthajyoti Sarkar
- 6364字
- 2021-06-18 18:11:51
Advanced Mathematical Operations
Generating numerical arrays is a fairly common task. So far, we have been doing this by creating a Python list object and then converting that into a NumPy array. However, we can bypass that and work directly with native NumPy methods. The arange function creates a series of numbers based on the minimum and maximum bounds you give and the step size you specify. Another function, linspace, creates a series of fixed numbers of the intermediate points between two extremes.
In the next exercise, we are going to create a list and then convert that into a NumPy array. We will then show you how to perform some advanced mathematical operations on that array.
Exercise 3.04: Advanced Mathematical Operations on NumPy Arrays
In this exercise, we'll practice using all the built-in mathematical functions of the NumPy library. Here, we are going to be creating a list and converting it into a NumPy array. Then, we will perform some advanced mathematical operations on that array. Let's go through the following steps:
Note
We're going to use the numbers.csv file in this exercise, which can be found here: https://packt.live/30Om2wC.
- Import the pandas library and read from the numbers.csv file using pandas. Then, convert it into a list:
import pandas as pd
df = pd.read_csv("../datasets/numbers.csv")
list_5 = df.values.tolist()
list_5
Note
Don't forget to change the path (highlighted) based on the location of the file on your system.
The output (partially shown) is as follows:
Figure 3.1: Partial output of the .csv file
- Convert the list into a NumPy array by using the following command:
import numpy as np
array_5 = np.array(list_5)
array_5
The output (partially shown) is as follows:
Figure 3.2: Partial output of the NumPy array
- Find the sine value of the array by using the following command:
# sine function
np.sin(array_5)
The output (partially shown) is as follows:
Figure 3.3: Partial output of the sine value
- Find the logarithmic value of the array by using the following command:
# logarithm
np.log(array_5)
The output (partially shown) is as follows:
Figure 3.4: Partial output of the logarithmic array
- Find the exponential value of the array by using the following command:
# Exponential
np.exp(array_5)
The output (partially shown) is as follows:
Figure 3.5: Partial output of the exponential array
As we can see, advanced mathematical operations are fairly easy to perform on a NumPy array using the built-in methods.
Note
To access the source code for this specific section, please refer to https://packt.live/37NIyrf.
You can also run this example online at https://packt.live/3eh0Xz6.
Exercise 3.05: Generating Arrays Using arange and linspace Methods
This exercise will demonstrate how we can create a series of numbers using the arange method. To make the list linearly spaced, we're going to use the linspace method. To do so, let's go through the following steps:
- Import the NumPy library and create a series of numbers using the arange method using the following command:
import numpy as np
np.arange(5,16)
The output is as follows:
array([ 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
- Print numbers using the arange function by using the following command:
print("Numbers spaced apart by 2: ",\
np.arange(0,11,2))
print("Numbers spaced apart by a floating point number: ",\
np.arange(0,11,2.5))
print("Every 5th number from 30 in reverse order\n",\
np.arange(30,-1,-5))
The output is as follows:
Numbers spaced apart by 2: [ 0 2 4 6 8 10]
Numbers spaced apart by a floating point number:
[ 0. 2.5 5.0 7.5 10. ]
Every 5th number from 30 in reverse order
[30 25 20 15 10 5 0]
- For linearly spaced numbers, we can use the linspace method, as follows:
print("11 linearly spaced numbers between 1 and 5: ",\
np.linspace(1,5,11))
The output is as follows:
11 linearly spaced numbers between 1 and 5:
[1. 1.4 1.8 2.2 2.6 3. 3.4 3.8 4.2 4.6 5. ]
As we can see, the linspace method helps us in creating linearly spaced elements in an array.
Note
To access the source code for this specific section, please refer to https://packt.live/2YOZGsy.
You can also run this example online at https://packt.live/3ddPcYG.
So far, we have only created one-dimensional arrays. Now, let's create some multi-dimensional arrays (such as a matrix in linear algebra).
Exercise 3.06: Creating Multi-Dimensional Arrays
In this exercise, just like we created the one-dimensional array from a simple flat list, we will create a two-dimensional array from a list of lists.
Note
This exercise will use the numbers2.csv file, which can be found at https://packt.live/2V8EQTZ.
Let's go through the following steps:
- Import the necessary Python libraries, load the numbers2.csv file, and convert it into a two-dimensional NumPy array by using the following commands:
import pandas as pd
import numpy as np
df = pd.read_csv("../datasets/numbers2.csv",\
header=None)
list_2D = df.values
mat1 = np.array(list_2D)
print("Type/Class of this object:",\
type(mat1))
print("Here is the matrix\n----------\n",\
mat1, "\n----------")
Note
Don't forget to change the path (highlighted) based on the location of the file on your system.
The output is as follows:
Type/Class of this object: <class 'numpy.ndarray'>
Here is the matrix
----------
[[1 2 3]
[4 5 6]
[7 8 9]]
----------
- Tuples can be converted into multi-dimensional arrays by using the following code:
tuple_2D = np.array([(1.5,2,3), (4,5,6)])
mat_tuple = np.array(tuple_2D)
print (mat_tuple)
The output is as follows:
[[1.5 2. 3. ]
[4. 5. 6. ]]
Thus, we have created multi-dimensional arrays using Python lists and tuples.
Note
To access the source code for this specific section, please refer to https://packt.live/30RjJcc.
You can also run this example online at https://packt.live/30QiIBm.
Now, let's determine the dimension, shape, size, and data type of the two-dimensional array.
Exercise 3.07: The Dimension, Shape, Size, and Data Type of Two-dimensional Arrays
This exercise will demonstrate a few methods that will let you check the dimension, shape, and size of the array.
Note
The numbers2.csv file can be found at https://packt.live/2V8EQTZ.
Note that if it's a 3x2 matrix, that is, it has 3 rows and 2 columns, then the shape will be (3,2), but the size will be 6, as in 6 = 3x2. To learn how to find out the dimensions of an array in Python, let's go through the following steps:
- Import the necessary Python modules and load the numbers2.csv file:
import pandas as pd
import numpy as np
df = pd.read_csv("../datasets/numbers2.csv",\
header=None)
list_2D = df.values
mat1 = np.array(list_2D)
Note
Don't forget to change the path (highlighted) based on the location of the file on your system.
- Print the dimension of the matrix using the ndim function:
print("Dimension of this matrix: ", mat1.ndim,sep='')
The output is as follows:
Dimension of this matrix: 2
- Print the size using the size function:
print("Size of this matrix: ", mat1.size,sep='')
The output is as follows:
Size of this matrix: 9
- Print the shape of the matrix using the shape function:
print("Shape of this matrix: ", mat1.shape,sep='')
The output is as follows:
Shape of this matrix: (3, 3)
- Print the dimension type using the dtype function:
print("Data type of this matrix: ", mat1.dtype,sep='')
The output is as follows:
Data type of this matrix: int64
In this exercise, we looked at the various utility methods available in order to check the dimensions of an array. We used the dnim, shape, dtype, and size functions to look at the dimension of the array.
Note
To access the source code for this specific section, please refer to https://packt.live/30PVEm1.
You can also run this example online at https://packt.live/3ebSsoG.
Now that we are familiar with basic vector (one-dimensional) and matrix data structures in NumPy, we will be able to create special matrices with ease. Often, you may have to create matrices filled with zeros, ones, random numbers, or ones in a diagonal fashion. An identity matrix is a matrix filled with zeros and ones in a diagonal from left to right.
Exercise 3.08: Zeros, Ones, Random, Identity Matrices, and Vectors
In this exercise, we will be creating a vector of zeros and a matrix of zeros using the zeros function of the NumPy library. Then, we'll create a matrix of fives using the ones function, followed by generating an identity matrix using the eye function. We will also work with the random function, where we'll create a matrix filled with random values. To do this, let's go through the following steps:
- Print the vector of zeros by using the following command:
import numpy as np
print("Vector of zeros: ",np.zeros(5))
The output is as follows:
Vector of zeros: [0. 0. 0. 0. 0.]
- Print the matrix of zeros by using the following command:
print("Matrix of zeros: ",np.zeros((3,4)))
The output is as follows:
Matrix of zeros: [[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
- Print the matrix of fives by using the following command:
print("Matrix of 5's: ",5*np.ones((3,3)))
The output is as follows:
Matrix of 5's: [[5. 5. 5.]
[5. 5. 5.]
[5. 5. 5.]]
- Print an identity matrix by using the following command:
print("Identity matrix of dimension 2:",np.eye(2))
The output is as follows:
Identity matrix of dimension 2: [[1. 0.]
[0. 1.]]
- Print an identity matrix with a dimension of 4x4 by using the following command:
print("Identity matrix of dimension 4:",np.eye(4))
The output is as follows:
Identity matrix of dimension 4: [[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
- Print a matrix of random shape using the randint function:
print("Random matrix of shape(4,3):\n",\
np.random.randint(low=1,high=10,size=(4,3)))
The sample output is as follows:
Random matrix of shape (4,3):
[[6 7 6]
[5 6 7]
[5 3 6]
[2 9 4]]
As we can see from the preceding output, a matrix was generated with a random shape.
Note
When creating matrices, you need to pass on tuples of integers as arguments. The output is susceptible to change since we have used random numbers.
To access the source code for this specific section, please refer to https://packt.live/2UROs5f.
You can also run this example online at https://packt.live/37J5hV9.
Random number generation is a very useful utility and needs to be mastered for data science/data wrangling tasks. We will look at the topic of random variables and distributions again in the section on statistics and learn how NumPy and pandas have built-in random number and series generation, as well as manipulation functions.
Reshaping an array is a very useful operation for vectors as machine learning algorithms may demand input vectors in various formats for mathematical manipulation. In this section, we will be looking at how reshaping can be done on an array. The opposite of reshape is the ravel function, which flattens any given array into a one-dimensional array. It is a very useful action in many machine learning and data analytics tasks.
Exercise 3.09: Reshaping, Ravel, Min, Max, and Sorting
In this exercise, we will generate a random one-dimensional vector of two-digit numbers and then reshape the vector into multi-dimensional vectors. Let's go through the following steps:
- Create an array of 30 random integers (sampled from 1 to 99) and reshape it into two different forms using the following code:
import numpy as np
a = np.random.randint(1,100,30)
b = a.reshape(2,3,5)
c = a.reshape(6,5)
- Print the shape using the shape function by using the following code:
print ("Shape of a:", a.shape)
print ("Shape of b:", b.shape)
print ("Shape of c:", c.shape)
The output is as follows:
Shape of a: (30,)
Shape of b: (2, 3, 5)
Shape of c: (6, 5)
- Print the arrays a, b, and c using the following code:
print("\na looks like\n",a)
print("\nb looks like\n",b)
print("\nc looks like\n",c)
The sample output is as follows:
a looks like
[ 7 82 9 29 50 50 71 65 33 84 55 78 40 68 50 15 65 55 98
38 23 75 50 57 32 69 34 59 98 48]
b looks like
[[[ 7 82 9 29 50]
[50 71 65 33 84]
[55 78 40 68 50]]
[[15 65 55 98 38]
[23 75 50 57 32]
[69 34 59 98 48]]]
c looks like
[[ 7 82 9 29 50]
[50 71 65 33 84]
[55 78 40 68 50]
[15 65 55 98 38]
[23 75 50 57 32]
[69 34 59 98 48]]
Note
b is a three-dimensional array – a kind of list of a list of a list. The output is susceptible to change since we have used random numbers.
- Ravel file b using the following code:
b_flat = b.ravel()
print(b_flat)
The sample output is as follows (the output may be different in each iteration):
[ 7 82 9 29 50 50 71 65 33 84 55 78 40 68 50 15 65 55 98 38
23 75 50 57 32 69 34 59 98 48]
Note
To access the source code for this specific section, please refer to https://packt.live/2Y6KYh8.
You can also run this example online at https://packt.live/2N4fDFs.
In this exercise, you learned how to use shape and reshape functions to see and adjust the dimensions of an array. This can be useful in a variety of cases when working with arrays.
Indexing and slicing NumPy arrays is very similar to regular list indexing. We can even go through a vector of elements with a definite step size by providing it as an additional argument in the format (start, step, end). Furthermore, we can pass a list as an argument to select specific elements.
Note
In multi-dimensional arrays, you can use two numbers to denote the position of an element. For example, if the element is in the third row and second column, its indices are 2 and 1 (because of Python's zero-based indexing).
Exercise 3.10: Indexing and Slicing
In this exercise, we will learn how to perform indexing and slicing on one-dimensional and multi-dimensional arrays. To complete this exercise, let's go through the following steps:
- Create an array of 10 elements and examine its various elements by slicing and indexing the array with slightly different syntaxes. Do this by using the following command:
import numpy as np
array_1 = np.arange(0,11)
print("Array:",array_1)
The output is as follows:
Array: [ 0 1 2 3 4 5 6 7 8 9 10]
- Print the element in the seventh position by using the following command:
print("Element at 7th index is:", array_1[7])
The output is as follows:
Element at 7th index is: 7
- Print the elements between the third and sixth positions by using the following command:
print("Elements from 3rd to 5th index are:", array_1[3:6])
The output is as follows:
Elements from 3rd to 5th index are: [3 4 5]
- Print the elements until the fourth position by using the following command:
print("Elements up to 4th index are:", array_1[:4])
The output is as follows:
Elements up to 4th index are: [0 1 2 3]
- Print the elements backward by using the following command:
print("Elements from last backwards are:", array_1[-1::-1])
The output is as follows:
Elements from last backwards are: [10 9 8 7 6 5 4 3 2 1 0]
- Print the elements using their backward index, skipping three values, by using the following command:
print("3 Elements from last backwards are:", array_1[-1:-6:-2])
The output is as follows:
3 Elements from last backwards are: [10 8 6]
- Create a new array called array_2 by using the following command:
array_2 = np.arange(0,21,2)
print("New array:",array_2)
The output is as follows:
New array: [ 0 2 4 6 8 10 12 14 16 18 20]
- Print the second, fourth, and ninth elements of the array:
print("Elements at 2nd, 4th, and 9th index are:", \
array_2[[2,4,9]])
The output is as follows:
Elements at 2nd, 4th, and 9th index are: [ 4 8 18]
- Create a multi-dimensional array by using the following command:
matrix_1 = np.random.randint(10,100,15).reshape(3,5)
print("Matrix of random 2-digit numbers\n ",matrix_1)
The sample output is as follows:
Matrix of random 2-digit numbers
[[21 57 60 24 15]
[53 20 44 72 68]
[39 12 99 99 33]]
Note
The output is susceptible to change since we have used random numbers.
- Access the values using double bracket indexing by using the following command:
print("\nDouble bracket indexing\n")
print("Element in row index 1 and column index 2:", \
matrix_1[1][2])
The sample output is as follows:
Double bracket indexing
Element in row index 1 and column index 2: 44
- Access the values using single bracket indexing by using the following command:
print("\nSingle bracket with comma indexing\n")
print("Element in row index 1 and column index 2:", \
matrix_1[1,2])
The sample output is as follows:
Single bracket with comma indexing
Element in row index 1 and column index 2: 44
- Access the values in a multi-dimensional array using a row or column by using the following command:
print("\nRow or column extract\n")
print("Entire row at index 2:", matrix_1[2])
print("Entire column at index 3:", matrix_1[:,3])
The sample output is as follows:
Row or column extract
Entire row at index 2: [39 12 99 99 33]
Entire column at index 3: [24 72 99]
- Print the matrix with the specified row and column indices by using the following command:
print("\nSubsetting sub-matrices\n")
print("Matrix with row indices 1 and 2 and column "\
"indices 3 and 4\n", matrix_1[1:3,3:5])
The sample output is as follows:
Subsetting sub-matrices
Matrix with row indices 1 and 2 and column indices 3 and 4
[[72 68]
[99 33]]
- Print the matrix with the specified row and column indices by using the following command:
print("Matrix with row indices 0 and 1 and column "\
"indices 1 and 3\n", matrix_1[0:2,[1,3]])
The sample output is as follows:
Matrix with row indices 0 and 1 and column indices 1 and 3
[[57 24]
[20 72]]
Note
The output is susceptible to change since we have used random numbers.
To access the source code for this specific section, please refer to https://packt.live/3fsxJ00.
You can also run this example online at https://packt.live/3hEDYjh.
In this exercise, we worked with NumPy arrays and various ways of subletting them, such as slicing them. When working with arrays, it's very common to deal with them in this way.
Conditional SubSetting
Conditional subsetting is a way to select specific elements based on some numeric condition. It is almost like a shortened version of a SQL query to subset elements. See the following example:
matrix_1 = np.array(np.random.randint(10,100,15)).reshape(3,5)
print("Matrix of random 2-digit numbers\n",matrix_1)
print ("\nElements greater than 50\n", matrix_1[matrix_1>50])
In the preceding code example, we have created an array with 15 random values between 10-100. We have applied the reshape function. Then, we selected the elements that are less than 50.
The sample output is as follows (note that the exact output will be different for you as it is random):
Matrix of random 2-digit numbers
[[71 89 66 99 54]
[28 17 66 35 85]
[82 35 38 15 47]]
Elements greater than 50
[71 89 66 99 54 66 85 82]
NumPy arrays operate just like mathematical matrices, and the operations are performed element-wise.
Now, let's look at an exercise to understand how we can perform array operations.
Exercise 3.11: Array Operations
In this exercise, we're going to create two matrices (multi-dimensional arrays) with random integers and demonstrate element-wise mathematical operations such as addition, subtraction, multiplication, and pision. We can show the exponentiation (raising a number to a certain power) operation by performing the following steps:
Note
Due to random number generation, your specific output could be different than what is shown here.
- Import the NumPy library and create two matrices:
import numpy as np
matrix_1 = np.random.randint(1,10,9).reshape(3,3)
matrix_2 = np.random.randint(1,10,9).reshape(3,3)
print("\n1st Matrix of random single-digit numbers\n",\
matrix_1)
print("\n2nd Matrix of random single-digit numbers\n",\
matrix_2)
The sample output is as follows (note that the exact output will be different for you as it is random):
1st Matrix of random single-digit numbers
[[6 5 9]
[4 7 1]
[3 2 7]]
2nd Matrix of random single-digit numbers
[[2 3 1]
[9 9 9]
[9 9 6]]
- Perform addition, subtraction, pision, and linear combination on the matrices:
print("\nAddition\n", matrix_1+matrix_2)
print("\nMultiplication\n", matrix_1*matrix_2)
print("\nDivision\n", matrix_1/matrix_2)
print("\nLinear combination: 3*A - 2*B\n", \
3*matrix_1-2*matrix_2)
The sample output is as follows (note that the exact output will be different for you as it is random):
Addition
[[ 8 8 10]
[13 16 10]
[12 11 13]]
Multiplication
[[12 15 9]
[36 63 9]
[27 18 42]]
Division
[[3. 1.66666667 9. ]
[0.44444444 0.77777778 0.11111111]
[0.33333333 0.22222222 1.16666667]]
Linear combination: 3*A - 2*B
[[ 14 9 25]
[ -6 3 -15]
[ -9 -12 9]]
- Perform the addition of a scalar, exponential matrix cube, and exponential square root:
print("\nAddition of a scalar (100)\n", 100+matrix_1)
print("\nExponentiation, matrix cubed here\n", matrix_1**3)
print("\nExponentiation, square root using 'pow' function\n", \
pow(matrix_1,0.5))
The sample output is as follows (note that the exact output will be different for you as it is random):
Addition of a scalar (100)
[[106 105 109]
[104 107 101]
[103 102 107]]
Exponentiation, matrix cubed here
[[216 125 729]
[ 64 343 1]
[ 27 8 343]]
Exponentiation, square root using 'pow' function
[[2.44948974 2.23606798 3. ]
[2. 2.64575131 1. ]
[1.73205081 1.41421356 2.64575131]]
Note
The output is susceptible to change since we have used random numbers.
To access the source code for this specific section, please refer to https://packt.live/3fC1ziH.
You can also run this example online at https://packt.live/3fy6j96.
We have now seen how to work with arrays to perform various mathematical functions, such as scalar addition and matrix cubing.
Stacking Arrays
Stacking arrays on top of each other (or side by side) is a useful operation for data wrangling. Stacking is a way to concatenate two NumPy arrays together. Here is the code:
a = np.array([[1,2],[3,4]])
b = np.array([[5,6],[7,8]])
print("Matrix a\n",a)
print("Matrix b\n",b)
print("Vertical stacking\n",np.vstack((a,b)))
print("Horizontal stacking\n",np.hstack((a,b)))
The output is as follows:
Matrix a
[[1 2]
[3 4]]
Matrix b
[[5 6]
[7 8]]
Vertical stacking
[[1 2]
[3 4]
[5 6]
[7 8]]
Horizontal stacking
[[1 2 5 6]
[3 4 7 8]]
NumPy has many other advanced features, mainly related to statistics and linear algebra functions, which are used extensively in machine learning and data science tasks. However, not all of that is directly useful for beginner-level data wrangling, so we won't cover it here.
In the next section, we'll talk about pandas DataFrames.
Pandas DataFrames
The pandas library is a Python package that provides fast, flexible, and expressive data structures that are designed to make working with relational or labeled data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool that's available in any language.
The two primary data structures of pandas are Series (one-dimensional) and DataFrames (two-dimensional) and they handle the vast majority of typical use cases. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other third-party libraries.
Let's look at a few exercises in order to understand data handling techniques using the pandas library.
Exercise 3.12: Creating a Pandas Series
In this exercise, we will learn how to create a pandas series object from the data structures that we created previously. If you have imported pandas as pd, then the function to create a series is simply pd.Series. Let's go through the following steps:
- Import the NumPy library and initialize the labels, lists, and a dictionary:
import numpy as np
labels = ['a','b','c']
my_data = [10,20,30]
array_1 = np.array(my_data)
d = {'a':10,'b':20,'c':30}
- Import pandas as pd by using the following command:
import pandas as pd
- Create a series from the my_data list by using the following command:
print("\nHolding numerical data\n",'-'*25, sep='')
print(pd.Series(array_1))
The output is as follows:
Holding numerical data
-------------------------
0 10
1 20
2 30
dtype: int64
- Create a series from the my_data list along with the labels as follows:
print("\nHolding text labels\n",'-'*20, sep='')
print(pd.Series(labels))
The output is as follows:
Holding text labels
--------------------
0 a
1 b
2 c
dtype: object
- Then, create a series from the NumPy array, as follows:
print("\nHolding functions\n",'-'*20, sep='')
print(pd.Series(data=[sum,print,len]))
The output is as follows:
Holding functions
--------------------
0 <built-in function sum>
1 <built-in function print>
2 <built-in function len>
dtype: object
- Create a series from the dictionary, as follows:
print("\nHolding objects from a dictionary\n",'-'*40, sep='')
print(pd.Series(data=[d.keys, d.items, d.values]))
The output is as follows:
Holding objects from a dictionary
----------------------------------------
0 <built-in method keys of dict object at 0x7fb8...
1 <built-in method items of dict object at 0x7fb...
2 <built-in method values of dict object at 0x7f...
dtype: object
Note
You may get a different final output because the system may store the object in the memory differently.
To access the source code for this specific section, please refer to https://packt.live/2BkMJOL.
You can also run this example online at https://packt.live/30XhxzQ.
In this exercise, we created pandas series, which are the building blocks of pandas DataFrames. The pandas series object can hold many types of data, such as integers, objects, floats, doubles, and others. This is the key to constructing a bigger table where multiple series objects are stacked together to create a database-like entity.
Exercise 3.13: Pandas Series and Data Handling
In this exercise, we will create a pandas series using the pd.series function. Then, we will manipulate the data in the DataFrame using various handling techniques. Perform the following steps:
- Create a pandas series with numerical data by using the following command:
import numpy as np
import pandas as pd
labels = ['a','b','c']
my_data = [10,20,30]
array_1 = np.array(my_data)
d = {'a':10,'b':20,'c':30}
print("\nHolding numerical data\n",'-'*25, sep='')
print(pd.Series(array_1))
The output is as follows:
Holding numerical data
-------------------------
0 10
1 20
2 30
dtype: int32
- Create a pandas series with labels by using the following command:
print("\nHolding text labels\n",'-'*20, sep='')
print(pd.Series(labels))
The output is as follows:
Holding text labels
--------------------
0 a
1 b
2 c
dtype: object
- Create a pandas series with functions by using the following command:
print("\nHolding functions\n",'-'*20, sep='')
print(pd.Series(data=[sum,print,len]))
The output is as follows:
Holding functions
--------------------
0 <built-in function sum>
1 <built-in function print>
2 <built-in function len>
dtype: object
- Create a pandas series with a dictionary by using the following command:
print("\nHolding objects from a dictionary\n",'-'*40, sep='')
print(pd.Series(data=[d.keys, d.items, d.values]))
The output is as follows:
Holding objects from a dictionary
----------------------------------------
0 <built-in method keys of dict object at 0x0000...
1 <built-in method items of dict object at 0x000...
2 <built-in method values of dict object at 0x00...
dtype: object
Note
To access the source code for this specific section, please refer to https://packt.live/3hzXRIr.
You can also run this example online at https://packt.live/3endeC9.
In this exercise, we created pandas series objects using various types of lists.
Exercise 3.14: Creating Pandas DataFrames
The pandas DataFrame is similar to an Excel table or relational database (SQL) table, which consists of three main components: the data, the index (or rows), and the columns. Under the hood, it is a stack of pandas series objects, which are themselves built on top of NumPy arrays. So, all of our previous knowledge of NumPy arrays applies here. Let's perform the following steps:
- Create a simple DataFrame from a two-dimensional matrix of numbers. First, the code draws 20 random integers from the uniform distribution. Then, we need to reshape it into a (5,4) NumPy array – 5 rows and 4 columns:
import numpy as np
import pandas as pd
matrix_data = np.random.randint(1,10,size=20).reshape(5,4)
- Define the rows labels as ('A','B','C','D','E') and column labels as ('W','X','Y','Z'):
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']
- Create a DataFrame using pd.DataFrame:
df = pd.DataFrame(data=matrix_data, index=row_labels, \
columns=column_headings)
- Print the DataFrame:
print("\nThe data frame looks like\n",'-'*45, sep='')
print(df)
The sample output is as follows:
Figure 3.6: Output of the DataFrame
- Create a DataFrame from a Python dictionary of the lists of integers by using the following command:
d={'a':[10,20],'b':[30,40],'c':[50,60]}
- Pass this dictionary as a data argument to the pd.DataFrame function. Pass on a list of rows or indices. Notice how the dictionary keys became the column names and that the values were distributed among multiple rows:
df2=pd.DataFrame(data=d,index=['X','Y'])
print(df2)
The output is as follows:
Figure 3.7: Output of DataFrame df2
Note
To access the source code for this specific section, please refer to https://packt.live/2UVTz4u.
You can also run this example online at https://packt.live/2CgBkAd.
In this exercise, we created DataFrames manually from scratch, which will allow us to understand DataFrames better.
Note
The most common way that you will create a pandas DataFrame will be to read tabular data from a file on your local disk or over the internet – CSV, text, JSON, HTML, Excel, and so on. We will cover some of these in the next chapter.
Exercise 3.15: Viewing a DataFrame Partially
In the previous exercise, we used print(df) to print the whole DataFrame. For a large dataset, we would like to print only sections of data. In this exercise, we will read a part of the DataFrame. Let's learn how to do so:
- Import the NumPy library and execute the following code to create a DataFrame with 25 rows. Then, fill it with random numbers:
# 25 rows and 4 columns
import numpy as np
import pandas as pd
matrix_data = np.random.randint(1,100,100).reshape(25,4)
column_headings = ['W','X','Y','Z']
df = pd.DataFrame(data=matrix_data,columns=column_headings)
- Run the following code to view only the first five rows of the DataFrame:
df.head()
The sample output is as follows (note that your output could be different due to randomness):
Figure 3.8: The first five rows of the DataFrame
By default, head shows only five rows. If you want to see any specific number of rows, just pass that as an argument.
- Print the first eight rows by using the following command:
df.head(8)
The sample output is as follows:
Figure 3.9: The first eight rows of the DataFrame
Just like head shows the first few rows, tail shows the last few rows.
- Print the DataFrame using the tail command, as follows:
df.tail(10)
The sample output (partially shown) is as follows:
Figure 3.10: The last few rows of the DataFrame
Note
To access the source code for this specific section, please refer to https://packt.live/30UiXLB.
You can also run this example online at https://packt.live/2URYCTz.
In this section, we learned how to view portions of the DataFrame without looking at the whole DataFrame. In the next section, we're going to look at two functionalities: indexing and slicing columns in a DataFrame.
Indexing and Slicing Columns
There are two methods for indexing and slicing columns in a DataFrame. They are as follows:
- The DOT method
- The bracket method
The DOT method is good if you want to find a specific element. You will refer to the column after the DOT. An example is df.column. The bracket method is intuitive and easy to follow. In this method, you can access the data by the generic name/header of the column.
The following code illustrates these concepts. We can execute them in our Jupyter Notebook:
print("\nThe 'X' column\n",'-'*25, sep='')
print(df['X'])
print("\nType of the column: ", type(df['X']), sep='')
print("\nThe 'X' and 'Z' columns indexed by passing a list\n",\
'-'*55, sep='')
print(df[['X','Z']])
print("\nType of the pair of columns: ", \
type(df[['X','Z']]), sep='')
The output is as follows (a only the partial output is shown here because the actual column is long):

Figure 3.11: Rows of the 'X' columns
This is the output showing the type of column:

Figure 3.12: Type of 'X' column
This is the output showing the X and Z column indexed by passing a list:

Figure 3.13: Rows of the 'Y' columns
This is the output showing the type of the pair of columns:

Figure 3.14: Type of 'Y' column
Note
For more than one column, the object turns into a DataFrame. But for a single column, it is a pandas series object.
So far, we have seen how to access the columns of DataFrames using both the DOT method and the bracket method. Dataframes are commonly used for row/column data.
Now, let's look at indexing and slicing rows.
Indexing and Slicing Rows
Indexing and slicing rows in a DataFrame can also be done using the following methods:
- The label-based loc method
- The index-based iloc method
The loc method is intuitive and easy to follow. In this method, you can access the data by the generic name of the row. On the other hand, the iloc method allows you to access the rows by their numerical index. This can be very useful for a large table with thousands of rows, especially when you want to iterate over the table in a loop with a numerical counter. The following code illustrates the concepts of iloc:
matrix_data = np.random.randint(1,10,size=20).reshape(5,4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']
df = pd.DataFrame(data=matrix_data, index=row_labels, \
columns=column_headings)
print("\nLabel-based 'loc' method for selecting row(s)\n",\
'-'*60, sep='')
print("\nSingle row\n")
print(df.loc['C'])
print("\nMultiple rows\n")
print(df.loc[['B','C']])
print("\nIndex position based 'iloc' method for selecting "\
"row(s)\n", '-'*70, sep='')
print("\nSingle row\n")
print(df.iloc[2])
print("\nMultiple rows\n")
print(df.iloc[[1,2]])
The sample output is as follows:

Figure 3.15: Output of the loc and iloc methods
One of the most common tasks in data wrangling is creating or deleting columns or rows of data from your DataFrame. Sometimes, you want to create a new column based on some mathematical operation or transformation involving the existing columns. This is similar to manipulating database records and inserting a new column based on simple transformations. We'll look at some of these concepts in the upcoming exercises.
Exercise 3.16: Creating and Deleting a New Column or Row
In this exercise, we're going to create and delete a new column or a row from the stock.csv dataset. We'll also use the inplace function to modify the original DataFrame.
Note
The stock.csv file can be found here: https://packt.live/3hxvPNP.
Let's go through the following steps:
- Import the necessary Python modules, load the stocks.csv file, and create a new column using the following snippet:
import pandas as pd
df = pd.read_csv("../datasets/stock.csv")
df.head()
print("\nA column is created by assigning it in relation\n",\
'-'*75, sep='')
df['New'] = df['Price']+df['Price']
df['New (Sum of X and Z)'] = df['New']+df['Price']
print(df)
Note
Don't forget to change the path (highlighted) based on the location of the file on your system.
The sample output (partially shown) is as follows:
Figure 3.16: Partial output of the DataFrame
- Drop a column using the df.drop method:
print("\nA column is dropped by using df.drop() method\n",\
'-'*55, sep='')
df = df.drop('New', axis=1) # Notice the axis=1 option
# axis = 0 is default, so one has to change it to 1
print(df)
The sample output (partially shown) is as follows:
Figure 3.17: Partial output of the DataFrame
- Drop a specific row using the df.drop method:
df1=df.drop(1)
print("\nA row is dropped by using df.drop method and axis=0\n",\
'-'*65, sep='')
print(df1)
The partial output is as follows:
Figure 3.18: Partial output of the DataFrame
Dropping methods creates a copy of the DataFrame and does not change the original DataFrame.
- Change the original DataFrame by setting the inplace argument to True:
print("\nAn in-place change can be done by making ",\
"inplace=True in the drop method\n",\
'-'*75, sep='')
df.drop('New (Sum of X and Z)', axis=1, inplace=True)
print(df)
The sample output is as follows:
Figure 3.19: Partial Output of the DataFrame
Note
To access the source code for this specific section, please refer to https://packt.live/3frxthU.
You can also run this example online at https://packt.live/2USxJyA.
We have now learned how to modify DataFrames by dropping or adding rows and columns.
Note
All the normal operations are not in-place, that is, they do not impact the original DataFrame object and return a copy of the original with addition (or deletion) instead. The last bit of the preceding code shows how to make a change in the existing DataFrame with the inplace=True argument. Please note that this change is irreversible and should be used with caution.
- Learning LibGDX Game Development(Second Edition)
- JavaScript高效圖形編程
- INSTANT CakePHP Starter
- Mastering JavaScript Design Patterns(Second Edition)
- 區塊鏈底層設計Java實戰
- C++從入門到精通(第5版)
- Natural Language Processing with Java and LingPipe Cookbook
- Service Mesh實戰:基于Linkerd和Kubernetes的微服務實踐
- 軟件測試教程
- 計算語言學導論
- Drupal Search Engine Optimization
- Manage Your SAP Projects with SAP Activate
- Expert Cube Development with SSAS Multidimensional Models
- Mastering MeteorJS Application Development
- TensorFlow程序設計