- Data Wrangling with Python
- Dr. Tirthajyoti Sarkar Shubhadeep Roychowdhury
- 1725字
- 2021-06-11 13:40:27
Pandas DataFrames
The pandas library is a Python package that provides fast, flexible, and expressive data structures that are designed to make working with relational or labeled data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool that's available in any language.
The two primary data structures of pandas, Series (one-dimensional) and DataFrame (two-dimensional), handle the vast majority of typical use cases. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other third-party libraries.
Exercise 37: Creating a Pandas Series
In this exercise, we will learn about how to create a pandas series object from the data structures that we created previously. If you have imported pandas as pd, then the function to create a series is simply pd.Series:
- Initialize labels, lists, and a dictionary:
labels = ['a','b','c']
my_data = [10,20,30]
array_1 = np.array(my_data)
d = {'a':10,'b':20,'c':30}
print ("Labels:", labels)
print("My data:", my_data)
print("Dictionary:", d)
The output is as follows:
Labels: ['a', 'b', 'c']
My data: [10, 20, 30]
Dictionary: {'a': 10, 'b': 20, 'c': 30}
- Import pandas as pd by using the following command:
import pandas as pd
- Create a series from the my_data list by using the following command:
series_1=pd.Series(data=my_data)
print(series_1)
The output is as follows:
0 10
1 20
2 30
dtype: int64
- Create a series from the my_data list along with the labels as follows:
series_2=pd.Series(data=my_data, index = labels)
print(series_2)
The output is as follows:
a 10
b 20
c 30
dtype: int64
- Then, create a series from the NumPy array, as follows:
series_3=pd.Series(array_1,labels)
print(series_3)
The output is as follows:
a 10
b 20
c 30
dtype: int32
- Create a series from the dictionary, as follows:
series_4=pd.Series(d)
print(series_4)
The output is as follows:
a 10
b 20
c 30
dtype: int64
Exercise 38: Pandas Series and Data Handling
The pandas series object can hold many types of data. This is the key to constructing a bigger table where multiple series objects are stacked together to create a database-like entity:
- Create a pandas series with numerical data by using the following command:
print ("\nHolding numerical data\n",'-'*25, sep='')
print(pd.Series(array_1))
The output is as follows:
Holding numerical data
-------------------------
0 10
1 20
2 30
dtype: int32
- Create a pandas series with labels by using the following command:
print ("\nHolding text labels\n",'-'*20, sep='')
print(pd.Series(labels))
The output is as follows:
Holding text labels
--------------------
0 a
1 b
2 c
dtype: object
- Create a pandas series with functions by using the following command:
print ("\nHolding functions\n",'-'*20, sep='')
print(pd.Series(data=[sum,print,len]))
The output is as follows:
Holding functions
--------------------
0 <built-in function sum>
1 <built-in function print>
2 <built-in function len>
dtype: object
- Create a pandas series with a dictionary by using the following command:
print ("\nHolding objects from a dictionary\n",'-'*40, sep='')
print(pd.Series(data=[d.keys, d.items, d.values]))
The output is as follows:
Holding objects from a dictionary
----------------------------------------
0 <built-in method keys of dict object at 0x0000...
1 <built-in method items of dict object at 0x000...
2 <built-in method values of dict object at 0x00...
dtype: object
Exercise 39: Creating Pandas DataFrames
The pandas DataFrame is similar to an Excel table or relational database (SQL) table that consists of three main components: the data, the index (or rows), and the columns. Under the hood, it is a stack of pandas series objects, which are themselves built on top of NumPy arrays. So, all of our previous knowledge of NumPy array applies here:
- Create a simple DataFrame from a two-dimensional matrix of numbers. First, the code draws 20 random integers from the uniform distribution. Then, we need to reshape it into a (5,4) NumPy array – 5 rows and 4 columns:
matrix_data = np.random.randint(1,10,size=20).reshape(5,4)
- Define the rows labels as ('A','B','C','D','E') and column labels as ('W','X','Y','Z'):
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']
df = pd.DataFrame(data=matrix_data, index=row_labels,
columns=column_headings)
- The function to create a DataFrame is pd.DataFrame and it is called in next:
print("\nThe data frame looks like\n",'-'*45, sep='')
print(df)
The sample output is as follows:
The data frame looks like
---------------------------------------------
W X Y Z
A 6 3 3 3
B 1 9 9 4
C 4 3 6 9
D 4 8 6 7
E 6 6 9 1
- Create a DataFrame from a Python dictionary of some lists of integers by using the following command:
d={'a':[10,20],'b':[30,40],'c':[50,60]}
- Pass this dictionary as the data argument to the pd.DataFrame function. Pass on a list of rows or indices. Notice how the dictionary keys became the column names and that the values were distributed among multiple rows:
df2=pd.DataFrame(data=d,index=['X','Y'])
print(df2)
The output is as follows:
a b c
X 10 30 50
Y 20 40 60
Note
The most common way that you will encounter to create a pandas DataFrame will be to read tabular data from a file on your local disk or over the internet – CSV, text, JSON, HTML, Excel, and so on. We will cover some of these in the next chapter.
Exercise 40: Viewing a DataFrame Partially
In the previous section, we used print(df) to print the whole DataFrame. For a large dataset, we would like to print only sections of data. In this exercise, we will read a part of the DataFrame:
- Execute the following code to create a DataFrame with 25 rows and fill it with random numbers:
# 25 rows and 4 columns
matrix_data = np.random.randint(1,100,100).reshape(25,4)
column_headings = ['W','X','Y','Z']
df = pd.DataFrame(data=matrix_data,columns=column_headings)
- Run the following code to view only the first five rows of the DataFrame:
df.head()
The sample output is as follows (note that your output could be different due to randomness):
Figure 3.1: First five rows of the DataFrame
By default, head shows only five rows. If you want to see any specific number of rows just pass that as an argument.
- Print the first eight rows by using the following command:
df.head(8)
The sample output is as follows:
Figure 3.2: First eight rows of the DataFrame
Just like head shows the first few rows, tail shows the last few rows.
- Print the DataFrame using the tail command, as follows:
df.tail(10)
The sample output is as follows:

Figure 3.3: Last ten rows of the DataFrame
Indexing and Slicing Columns
There are two methods for indexing and slicing columns from a DataFrame. They are as follows:
- DOT method
- Bracket method
The DOT method is good to find specific element. The bracket method is intuitive and easy to follow. In this method, you can access the data by the generic name/header of the column.
The following code illustrates these concepts. Execute them in your Jupyter notebook:
print("\nThe 'X' column\n",'-'*25, sep='')
print(df['X'])
print("\nType of the column: ", type(df['X']), sep='')
print("\nThe 'X' and 'Z' columns indexed by passing a list\n",'-'*55, sep='')
print(df[['X','Z']])
print("\nType of the pair of columns: ", type(df[['X','Z']]), sep='')
The output is as follows (a screenshot is shown here because the actual column is long):

Figure 3.4: Rows of the 'X' columns
This is the output showing the type of column:

Figure 3.5: Type of 'X' column
This is the output showing the X and Z column indexed by passing a list:

Figure 3.6: Rows of the 'Y' columns
This is the output showing the type of the pair of column:

Figure 3.7: Type of 'Y' column
Note
For more than one column, the object turns into a DataFrame. But for a single column, it is a pandas series object.
Indexing and Slicing Rows
Indexing and slicing rows in a DataFrame can also be done using following methods:
- Label-based 'loc' method
- Index based 'iloc' method
The loc method is intuitive and easy to follow. In this method, you can access the data by the generic name of the row. On the other hand, the iloc method allows you to access the rows by their numerical index. It can be very useful for a large table with thousands of rows, especially when you want to iterate over the table in a loop with a numerical counter. The following code illustrate the concepts of iloc:
matrix_data = np.random.randint(1,10,size=20).reshape(5,4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']
df = pd.DataFrame(data=matrix_data, index=row_labels,
columns=column_headings)
print("\nLabel-based 'loc' method for selecting row(s)\n",'-'*60, sep='')
print("\nSingle row\n")
print(df.loc['C'])
print("\nMultiple rows\n")
print(df.loc[['B','C']])
print("\nIndex position based 'iloc' method for selecting row(s)\n",'-'*70, sep='')
print("\nSingle row\n")
print(df.iloc[2])
print("\nMultiple rows\n")
print(df.iloc[[1,2]])
The sample output is as follows:

Figure 3.8: Output of the loc and iloc methods
Exercise 41: Creating and Deleting a New Column or Row
One of the most common tasks in data wrangling is creating or deleting columns or rows of data from your DataFrame. Sometimes, you want to create a new column based on some mathematical operation or transformation involving the existing columns. This is similar to manipulating database records and inserting a new column based on simple transformations. We show some of these concepts in the following code blocks:
- Create a new column using the following snippet:
print("\nA column is created by assigning it in relation\n",'-'*75, sep='')
df['New'] = df['X']+df['Z']
df['New (Sum of X and Z)'] = df['X']+df['Z']
print(df)
The sample output is as follows:
Figure 3.9: Output after adding a new column
- Drop a column using the df.drop method:
print("\nA column is dropped by using df.drop() method\n",'-'*55, sep='')
df = df.drop('New', axis=1) # Notice the axis=1 option, axis = 0 is #default, so one has to change it to 1
print(df)
The sample output is as follows:
Figure 3.10: Output after dropping a column
- Drop a specific row using the df.drop method:
df1=df.drop('A')
print("\nA row is dropped by using df.drop method and axis=0\n",'-'*65, sep='')
print(df1)
The sample output is as follows:
Figure 3.11: Output after dropping a row
Dropping methods creates a copy of the DataFrame and does not change the original DataFrame.
- Change the original DataFrame by setting the inplace argument to True:
print("\nAn in-place change can be done by making inplace=True in the drop method\n",'-'*75, sep='')
df.drop('New (Sum of X and Z)', axis=1, inplace=True)
print(df)
A sample output is as follows:

Figure 3.12: Output after using the inplace argument
Note
All the normal operations are not in-place, that is, they do not impact the original DataFrame object but return a copy of the original with addition (or deletion). The last bit of code shows how to make a change in the existing DataFrame with the inplace=True argument. Please note that this change is irreversible and should be used with caution.