- Data Wrangling with Python
- Dr. Tirthajyoti Sarkar Shubhadeep Roychowdhury
- 3783字
- 2021-06-11 13:40:27
NumPy Arrays
In the life of a data scientist, reading and manipulating arrays is of prime importance, and it is also the most frequently encountered task. These arrays could be a one-dimensional list or a multi-dimensional table or a matrix full of numbers.
The array could be filled with integers, floating-point numbers, Booleans, strings, or even mixed types. However, in the majority of cases, numeric data types are predominant.
Some example scenarios where you will need to handle numeric arrays are as follows:
- To read a list of phone numbers and postal codes and extract a certain pattern
- To create a matrix with random numbers to run a Monte Carlo simulation on some statistical process
- To scale and normalize a sales figure table, with lots of financial and transactional data
- To create a smaller table of key descriptive statistics (for example, mean, median, min/max range, variance, inter-quartile ranges) from a large raw data table
- To read in and analyze time series data in a one-dimensional array daily, such as the stock price of an organization over a year or daily temperature data from a weather station
In short, arrays and numeric data tables are everywhere. As a data wrangling professional, the importance of the ability to read and process numeric arrays cannot be overstated. In this regard, NumPy arrays will be the most important object in Python that you need to know about.
NumPy Array and Features
NumPy and SciPy are open source add-on modules for Python that provide common mathematical and numerical routines in pre-compiled, fast functions. These have grown into highly mature libraries that provide functionality that meets, or perhaps exceeds, what is associated with common commercial software such as MATLAB or Mathematica.
One of the main advantages of the NumPy module is to handle or create one-dimensional or multi-dimensional arrays. This advanced data structure/class is at the heart of the NumPy package and it serves as the fundamental building block of more advanced classes such as pandas and DataFrame, which we will cover shortly in this chapter.
NumPy arrays are different than common Python lists, since Python lists can be thought as simple array. NumPy arrays are built for vectorized operations that process a lot of numerical data with just a single line of code. Many built-in mathematical functions in NumPy arrays are written in low-level languages such as C or Fortran and pre-compiled for real, fast execution.
Note
NumPy arrays are optimized data structures for numerical analysis, and that's why they are so important to data scientists.
Exercise 26: Creating a NumPy Array (from a List)
In this exercise, we will create a NumPy array from a list:
- To work with NumPy, we must import it. By convention, we give it a short name, np, while importing:
import numpy as np
- Create a list with three elements, 1, 2, and 3:
list_1 = [1,2,3]
- Use the array function to convert it into an array:
array_1 = np.array(list_1)
We just created a NumPy array object called array_1 from the regular Python list object, list_1.
- Create an array of floating type elements 1.2, 3.4, and 5.6:
import array as arr
a = arr.array('d', [1.2, 3.4, 5.6])
print(a)
The output is as follows:
array('d', [1.2, 3.4, 5.6])
- Let's check the type of the newly created object by using the type function:
type(array_1)
The output is as follows:
numpy.ndarray
- Use type on list_1:
type (list_1)
The output is as follows:
list
So, this is indeed different from the regular list object.
Exercise 27: Adding Two NumPy Arrays
This simple exercise will demonstrate the addition of two NumPy arrays, and thereby show the key difference between a regular Python list/array and a NumPy array:
- Consider list_1 and array_1 from the preceding exercise. If you have changed the Jupyter notebook, you will have to declare them again.
- Use the + notation to add two list_1 object and save the results in list_2:
list_2 = list_1 + list_1
print(list_2)
The output is as follows:
[1, 2, 3, 1, 2, 3]
- Use the same + notation to add two array_1 objects and save the result in array_2:
array_2 = array_1 + array_1
print(array_2)
The output is as follows:
[2, ,4, 6]
Did you notice the difference? The first print shows a list with 6 elements [1, 2, 3, 1, 2, 3]. But the second print shows another NumPy array (or vector) with the elements [2, 4, 6], which are just the sum of the individual elements of array_1.
NumPy arrays are like mathematical objects – vectors. They are built for element-wise operations, that is, when we add two NumPy arrays, we add the first element of the first array to the first element of the second array – there is an element-to-element correspondence in this operation. This is in contrast to Python lists, where the elements are simply appended and there is no element-to-element relation. This is the real power of a NumPy array: they can be treated just like mathematical vectors.
A vector is a collection of numbers that can represent, for example, the coordinates of points in a three-dimensional space or the color of numbers (RGB) in a picture. Naturally, relative order is important for such a collection and as we discussed previously, a NumPy array can maintain such order relationships. That's why they are perfectly suitable to use in numerical computations.
Exercise 28: Mathematical Operations on NumPy Arrays
Now that you know that these arrays are like vectors, we will try some mathematical operations on arrays.
NumPy arrays even support element-wise exponentiation. For example, suppose there are two arrays – the elements of the first array will be raised to the power of the elements in the second array:
- Multiply two arrays using the following command:
print("array_1 multiplied by array_1: ",array_1*array_1)
The output is as follows:
array_1 multiplied by array_1: [1 4 9]
- Divide two arrays using the following command:
print("array_1 divided by array_1: ",array_1/array_1)
The output is as follows:
array_1 divided by array_1: [1. 1. 1.]
- Raise one array to the second arrays power using the following command:
print("array_1 raised to the power of array_1: ",array_1**array_1)
The output is as follows:
array_1 raised to the power of array_1: [ 1 4 27]
Exercise 29: Advanced Mathematical Operations on NumPy Arrays
NumPy has all the built-in mathematical functions that you can think of. Here, we are going to be creating a list and converting it into a NumPy array. Then, we will perform some advanced mathematical operations on that array.
Here, we are creating a list and then converting that into a NumPy array. We will then show you how to perform some advanced mathematical operations on that array:
- Create a list with five elements:
list_5=[i for i in range(1,6)]
print(list_5)
The output is as follows:
[1, 2, 3, 4, 5]
- Convert the list into a NumPy array by using the following command:
array_5=np.array(list_5)
array_5
The output is as follows:
array([1, 2, 3, 4, 5])
- Find the sine value of the array by using the following command:
# sine function
print("Sine: ",np.sin(array_5))
The output is as follows:
Sine: [ 0.84147098 0.90929743 0.14112001 -0.7568025 -0.95892427]
- Find the logarithmic value of the array by using the following command:
# logarithm
print("Natural logarithm: ",np.log(array_5))
print("Base-10 logarithm: ",np.log10(array_5))
print("Base-2 logarithm: ",np.log2(array_5))
The output is as follows:
Natural logarithm: [0. 0.69314718 1.09861229 1.38629436 1.60943791]
Base-10 logarithm: [0. 0.30103 0.47712125 0.60205999 0.69897 ]
Base-2 logarithm: [0. 1. 1.5849625 2. 2.32192809]
- Find the exponential value of the array by using the following command:
# Exponential
print("Exponential: ",np.exp(array_5))
The output is as follows:
Exponential: [ 2.71828183 7.3890561 20.08553692 54.59815003 148.4131591 ]
Exercise 30: Generating Arrays Using arange and linspace
Generation of numerical arrays is a fairly common task. So far, we have been doing this by creating a Python list object and then converting that into a NumPy array. However, we can bypass that and work directly with native NumPy methods.
The arange function creates a series of numbers based on the minimum and maximum bounds you give and the step size you specify. Another function, linspace, creates a series of the fixed numbers of intermediate points between two extremes:
- Create a series of numbers using the arange method, by using the following command:
print("A series of numbers:",np.arange(5,16))
The output is as follows:
A series of numbers: [ 5 6 7 8 9 10 11 12 13 14 15]
- Print numbers using the arange function by using the following command:
print("Numbers spaced apart by 2: ",np.arange(0,11,2))
print("Numbers spaced apart by a floating point number: ",np.arange(0,11,2.5))
print("Every 5th number from 30 in reverse order\n",np.arange(30,-1,-5))
The output is as follows:
Numbers spaced apart by 2: [ 0 2 4 6 8 10]
Numbers spaced apart by a floating point number: [ 0. 2.5 5. 7.5 10. ]
Every 5th number from 30 in reverse order
[30 25 20 15 10 5 0]
- For linearly spaced numbers, we can use the linspace method, as follows:
print("11 linearly spaced numbers between 1 and 5: ",np.linspace(1,5,11))
The output is as follows:
11 linearly spaced numbers between 1 and 5: [1. 1.4 1.8 2.2 2.6 3. 3.4 3.8 4.2 4.6 5. ]
Exercise 31: Creating Multi-Dimensional Arrays
So far, we have created only one-dimensional arrays. Now, let's create some multi-dimensional arrays (such as a matrix in linear algebra). Just like we created the one-dimensional array from a simple flat list, we can create a two-dimensional array from a list of lists:
- Create a list of lists and convert it into a two-dimensional NumPy array by using the following command:
list_2D = [[1,2,3],[4,5,6],[7,8,9]]
mat1 = np.array(list_2D)
print("Type/Class of this object:",type(mat1))
print("Here is the matrix\n----------\n",mat1,"\n----------")
The output is as follows:
Type/Class of this object: <class 'numpy.ndarray'>
Here is the matrix
----------
[[1 2 3]
[4 5 6]
[7 8 9]]
----------
- Tuples can be converted into multi-dimensional arrays by using the following code:
tuple_2D = np.array([(1.5,2,3), (4,5,6)])
mat_tuple = np.array(tuple_2D)
print (mat_tuple)
The output is as follows:
[[1.5 2. 3. ]
[4. 5. 6. ]]
Thus, we have created multi-dimensional arrays using Python lists and tuples.
Exercise 32: The Dimension, Shape, Size, and Data Type of the Two-dimensional Array
The following methods let you check the dimension, shape, and size of the array. Note that if it's a 3x2 matrix, that is, it has 3 rows and 2 columns, then the shape will be (3,2), but the size will be 6, as 6 = 3x2:
- Print the dimension of the matrix using ndim by using the following command:
print("Dimension of this matrix: ",mat1.ndim,sep='')
The output is as follows:
Dimension of this matrix: 2
- Print the size using size:
print("Size of this matrix: ", mat1.size,sep='')
The output is as follows:
Size of this matrix: 9
- Print the shape of the matrix using shape:
print("Shape of this matrix: ", mat1.shape,sep='')
The output is as follows:
Shape of this matrix: (3, 3)
- Print the dimension type using dtype:
print("Data type of this matrix: ", mat1.dtype,sep='')
The output is as follows:
Data type of this matrix: int32
Exercise 33: Zeros, Ones, Random, Identity Matrices, and Vectors
Now that we are familiar with basic vector (one-dimensional) and matrix data structures in NumPy, we will take a look how to create special matrices easily. Often, you may have to create matrices filled with zeros, ones, random numbers, or ones in the diagonal:
- Print the vector of zeros by using the following command:
print("Vector of zeros: ",np.zeros(5))
The output is as follows:
Vector of zeros: [0. 0. 0. 0. 0.]
- Print the matrix of zeros by using the following command:
print("Matrix of zeros: ",np.zeros((3,4)))
The output is as follows:
Matrix of zeros: [[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
- Print the matrix of fives by using the following command:
print("Matrix of 5's: ",5*np.ones((3,3)))
The output is as follows:
Matrix of 5's: [[5. 5. 5.]
[5. 5. 5.]
[5. 5. 5.]]
- Print an identity matrix by using the following command:
print("Identity matrix of dimension 2:",np.eye(2))
The output is as follows:
Identity matrix of dimension 2: [[1. 0.]
[0. 1.]]
- Print an identity matrix with a dimension of 4x4 by using the following command:
print("Identity matrix of dimension 4:",np.eye(4))
The output is as follows:
Identity matrix of dimension 4: [[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
- Print a matrix of random shape using the randint function:
print("Random matrix of shape (4,3):\n",np.random.randint(low=1,high=10,size=(4,3)))
The sample output is as follows:
Random matrix of shape (4,3):
[[6 7 6]
[5 6 7]
[5 3 6]
[2 9 4]]
Note
When creating matrices, you need to pass on tuples of integers as arguments.
Random number generation is a very useful utility and needs to be mastered for data science/data wrangling tasks. We will look at the topic of random variables and distributions again in the section on statistics and see how NumPy and pandas have built-in random number and series generation, as well as manipulation functions.
Exercise 34: Reshaping, Ravel, Min, Max, and Sorting
Reshaping an array is a very useful operation for vectors as machine learning algorithms may demand input vectors in various formats for mathematical manipulation. In this section, we will be looking at how reshaping can take be done on an array. The opposite of reshape is the ravel function, which flattens any given array into a one-dimensional array. It is a very useful action in many machine learning and data analytics tasks.
The following functions reshape the function. We will first generate a random one-dimensional vector of 2-digit numbers and then reshape the vector into multi-dimensional vectors:
- Create an array of 30 random integers (sampled from 1 to 99) and reshape it into two different forms using the following code:
a = np.random.randint(1,100,30)
b = a.reshape(2,3,5)
c = a.reshape(6,5)
- Print the shape using the shape function by using the following code:
print ("Shape of a:", a.shape)
print ("Shape of b:", b.shape)
print ("Shape of c:", c.shape)
The output is as follows:
Shape of a: (30,)
Shape of b: (2, 3, 5)
Shape of c: (6, 5)
- Print the arrays a, b, and c using the following code:
print("\na looks like\n",a)
print("\nb looks like\n",b)
print("\nc looks like\n",c)
The sample output is as follows:
a looks like
[ 7 82 9 29 50 50 71 65 33 84 55 78 40 68 50 15 65 55 98 38 23 75 50 57
32 69 34 59 98 48]
b looks like
[[[ 7 82 9 29 50]
[50 71 65 33 84]
[55 78 40 68 50]]
[[15 65 55 98 38]
[23 75 50 57 32]
[69 34 59 98 48]]]
c looks like
[[ 7 82 9 29 50]
[50 71 65 33 84]
[55 78 40 68 50]
[15 65 55 98 38]
[23 75 50 57 32]
[69 34 59 98 48]]
Note
"b" is a three-dimensional array – a kind of list of a list of a list.
- Ravel file b using the following code:
b_flat = b.ravel()
print(b_flat)
The sample output is as follows:
[ 7 82 9 29 50 50 71 65 33 84 55 78 40 68 50 15 65 55 98 38 23 75 50 57
32 69 34 59 98 48]
Exercise 35: Indexing and Slicing
Indexing and slicing of NumPy arrays is very similar to regular list indexing. We can even step through a vector of elements with a definite step size by providing it as an additional argument in the format (start, step, end). Furthermore, we can pass a list as the argument to select specific elements.
In this exercise, we will learn about indexing and slicing on one-dimensional and multi-dimensional arrays:
Note
In multi-dimensional arrays, you can use two numbers to denote the position of an element. For example, if the element is in the third row and second column, its indices are 2 and 1 (because of Python's zero-based indexing).
- Create an array of 10 elements and examine its various elements by slicing and indexing the array with slightly different syntaxes. Do this by using the following command:
array_1 = np.arange(0,11)
print("Array:",array_1)
The output is as follows:
Array: [ 0 1 2 3 4 5 6 7 8 9 10]
- Print the element in the seventh position by using the following command:
print("Element at 7th index is:", array_1[7])
The output is as follows:
Element at 7th index is: 7
- Print the elements between the third and sixth positions by using the following command:
print("Elements from 3rd to 5th index are:", array_1[3:6])
The output is as follows:
Elements from 3rd to 5th index are: [3 4 5]
- Print the elements until the fourth position by using the following command:
print("Elements up to 4th index are:", array_1[:4])
The output is as follows:
Elements up to 4th index are: [0 1 2 3]
- Print the elements backwards by using the following command:
print("Elements from last backwards are:", array_1[-1::-1])
The output is as follows:
Elements from last backwards are: [10 9 8 7 6 5 4 3 2 1 0]
- Print the elements using their backward index, skipping three values, by using the following command:
print("3 Elements from last backwards are:", array_1[-1:-6:-2])
The output is as follows:
3 Elements from last backwards are: [10 8 6]
- Create a new array called array_2 by using the following command:
array_2 = np.arange(0,21,2)
print("New array:",array_2)
The output is as follows:
New array: [ 0 2 4 6 8 10 12 14 16 18 20]
- Print the second, fourth, and ninth elements of the array:
print("Elements at 2nd, 4th, and 9th index are:", array_2[[2,4,9]])
The output is as follows:
Elements at 2nd, 4th, and 9th index are: [ 4 8 18]
- Create a multi-dimensional array by using the following command:
matrix_1 = np.random.randint(10,100,15).reshape(3,5)
print("Matrix of random 2-digit numbers\n ",matrix_1)
The sample output is as follows:
Matrix of random 2-digit numbers
[[21 57 60 24 15]
[53 20 44 72 68]
[39 12 99 99 33]]
- Access the values using double bracket indexing by using the following command:
print("\nDouble bracket indexing\n")
print("Element in row index 1 and column index 2:", matrix_1[1][2])
The sample output is as follows:
Double bracket indexing
Element in row index 1 and column index 2: 44
- Access the values using single bracket indexing by using the following command:
print("\nSingle bracket with comma indexing\n")
print("Element in row index 1 and column index 2:", matrix_1[1,2])
The sample output is as follows:
Single bracket with comma indexing
Element in row index 1 and column index 2: 44
- Access the values in a multi-dimensional array using a row or column by using the following command:
print("\nRow or column extract\n")
print("Entire row at index 2:", matrix_1[2])
print("Entire column at index 3:", matrix_1[:,3])
The sample output is as follows:
Row or column extract
Entire row at index 2: [39 12 99 99 33]
Entire column at index 3: [24 72 99]
- Print the matrix with the specified row and column indices by using the following command:
print("\nSubsetting sub-matrices\n")
print("Matrix with row indices 1 and 2 and column indices 3 and 4\n", matrix_1[1:3,3:5])
The sample output is as follows:
Subsetting sub-matrices
Matrix with row indices 1 and 2 and column indices 3 and 4
[[72 68]
[99 33]]
- Print the matrix with the specified row and column indices by using the following command:
print("Matrix with row indices 0 and 1 and column indices 1 and 3\n", matrix_1[0:2,[1,3]])
The sample output is as follows:
Matrix with row indices 0 and 1 and column indices 1 and 3
[[57 24]
[20 72]]
Conditional Subsetting
Conditional subsetting is a way to select specific elements based on some numeric condition. It is almost like a shortened version of a SQL query to subset elements. See the following example:
matrix_1 = np.array(np.random.randint(10,100,15)).reshape(3,5)
print("Matrix of random 2-digit numbers\n",matrix_1)
print ("\nElements greater than 50\n", matrix_1[matrix_1>50])
The sample output is as follows (note that the exact output will be different for you as it is random):
Matrix of random 2-digit numbers
[[71 89 66 99 54]
[28 17 66 35 85]
[82 35 38 15 47]]
Elements greater than 50
[71 89 66 99 54 66 85 82]
Exercise 36: Array Operations (array-array, array-scalar, and universal functions)
NumPy arrays operate just like mathematical matrices, and the operations are performed element-wise.
Create two matrices (multi-dimensional arrays) with random integers and demonstrate element-wise mathematical operations such as addition, subtraction, multiplication, and division. Show the exponentiation (raising a number to a certain power) operation, as follows:
Note
Due to random number generation, your specific output could be different to what is shown here.
- Create two matrices:
matrix_1 = np.random.randint(1,10,9).reshape(3,3)
matrix_2 = np.random.randint(1,10,9).reshape(3,3)
print("\n1st Matrix of random single-digit numbers\n",matrix_1)
print("\n2nd Matrix of random single-digit numbers\n",matrix_2)
The sample output is as follows (note that the exact output will be different for you as it is random):
1st Matrix of random single-digit numbers
[[6 5 9]
[4 7 1]
[3 2 7]]
2nd Matrix of random single-digit numbers
[[2 3 1]
[9 9 9]
[9 9 6]]
- Perform addition, subtraction, division, and linear combination on the matrices:
print("\nAddition\n", matrix_1+matrix_2)
print("\nMultiplication\n", matrix_1*matrix_2)
print("\nDivision\n", matrix_1/matrix_2)
print("\nLinear combination: 3*A - 2*B\n", 3*matrix_1-2*matrix_2)
The sample output is as follows (note that the exact output will be different for you as it is random):
Addition
[[ 8 8 10]
[13 16 10]
[12 11 13]] ^
Multiplication
[[12 15 9]
[36 63 9]
[27 18 42]]
Division
[[3. 1.66666667 9. ]
[0.44444444 0.77777778 0.11111111]
[0.33333333 0.22222222 1.16666667]]
Linear combination: 3*A - 2*B
[[ 14 9 25]
[ -6 3 -15]
[ -9 -12 9]]
- Perform the addition of a scalar, exponential matrix cube, and exponential square root:
print("\nAddition of a scalar (100)\n", 100+matrix_1)
print("\nExponentiation, matrix cubed here\n", matrix_1**3)
print("\nExponentiation, square root using 'pow' function\n",pow(matrix_1,0.5))
The sample output is as follows (note that the exact output will be different for you as it is random):
Addition of a scalar (100)
[[106 105 109]
[104 107 101]
[103 102 107]]
Exponentiation, matrix cubed here
[[216 125 729]
[ 64 343 1]
[ 27 8 343]]
Exponentiation, square root using 'pow' function
[[2.44948974 2.23606798 3. ]
[2. 2.64575131 1. ]
[1.73205081 1.41421356 2.64575131]]
Stacking Arrays
Stacking arrays on top of each other (or side by side) is a useful operation for data wrangling. Here is the code:
a = np.array([[1,2],[3,4]])
b = np.array([[5,6],[7,8]])
print("Matrix a\n",a)
print("Matrix b\n",b)
print("Vertical stacking\n",np.vstack((a,b)))
print("Horizontal stacking\n",np.hstack((a,b)))
The output is as follows:
Matrix a
[[1 2]
[3 4]]
Matrix b
[[5 6]
[7 8]]
Vertical stacking
[[1 2]
[3 4]
[5 6]
[7 8]]
Horizontal stacking
[[1 2 5 6]
[3 4 7 8]]
NumPy has many other advanced features, mainly related to statistics and linear algebra functions, which are used extensively in machine learning and data science tasks. However, not all of that is directly useful for beginner level data wrangling, so we won't cover it here.
- 面向STEM的mBlock智能機(jī)器人創(chuàng)新課程
- Python Artificial Intelligence Projects for Beginners
- Julia 1.0 Programming
- 快學(xué)Flash動(dòng)畫百例
- 21天學(xué)通Java
- 大學(xué)計(jì)算機(jī)應(yīng)用基礎(chǔ)
- 現(xiàn)代傳感技術(shù)
- 變頻器、軟啟動(dòng)器及PLC實(shí)用技術(shù)260問
- AI的25種可能
- 智慧未來
- 計(jì)算機(jī)應(yīng)用基礎(chǔ)實(shí)訓(xùn)(職業(yè)模塊)
- 30天學(xué)通Java Web項(xiàng)目案例開發(fā)
- 計(jì)算機(jī)硬件技術(shù)基礎(chǔ)學(xué)習(xí)指導(dǎo)與練習(xí)
- JSP通用范例開發(fā)金典
- 工業(yè)機(jī)器人應(yīng)用系統(tǒng)三維建模