- Mastering Python for Data Science
- Samir Madhavan
- 1444字
- 2021-07-16 20:14:17
Data cleansing
The data in its raw form generally requires some cleaning so that it can be analyzed or a dashboard can be created on it. There are many reasons that data might have issues. For example, the Point of Sale system at a retail shop might have malfunctioned and inputted some data with missing values. We'll be learning how to handle such data in the following section.
Checking the missing data
Generally, most data will have some missing values. There could be various reasons for this: the source system which collects the data might not have collected the values or the values may never have existed. Once you have the data loaded, it is essential to check the missing elements in the data. Depending on the requirements, the missing data needs to be handled. It can be handled by removing a row or replacing a missing value with an alternative value.
In the Student Weight
data, to check if the location column has missing value, the following command can be utilized:
>>> d['Location 1'].isnull() 0 False 1 False 2 False 3 False 4 False 5 False 6 False
The notnull()
method will output each row of the value as TRUE
or FALSE
. If it's False
, then there is a missing value. This data can be aggregated to find the number of instances of the missing value:
>>> d['Location 1'].isnull().value_counts() False 3246 True 24 dtype: int64
The preceding command shows that the Location 1
column has 24
instances of missing values. These missing values can be handled by either removing the rows with the missing values or replacing it with some values. To remove the rows, execute the following command:
>>> d = d['Location 1'].dropna()
To remove all the rows with an instance of missing values, use the following command:
>>> d = d.dropna(how='any')
Filling the missing data
Let's define some DataFrames to work with:
>>> df = pd.DataFrame(np.random.randn(5, 3), index=['a0', 'a10', 'a20', 'a30', 'a40'], columns=['X', 'Y', 'Z']) >>> df X Y Z a0 -0.854269 0.117540 1.515373 a10 -0.483923 -0.379934 0.484155 a20 -0.038317 0.196770 -0.564176 a30 0.752686 1.329661 -0.056649 a40 -1.383379 0.632615 1.274481
We'll now add some extra row indexes, which will create null values in our DataFrame:
>>> df2 = df2.reindex(['a0', 'a1', 'a10', 'a11', 'a20', 'a21', 'a30', 'a31', 'a40', 'a41']) >>> df2 X Y Z a0 -1.193371 0.912654 -0.780461 a1 NaN NaN NaN a10 1.413044 0.615997 0.947334 a11 NaN NaN NaN a20 1.583516 1.388921 0.458771 a21 NaN NaN NaN a30 0.479579 1.427625 1.407924 a31 NaN NaN NaN a40 0.455510 -0.880937 1.375555 a41 NaN NaN NaN
If you want to replace the null values in the df2
DataFrame with a value of zero in the following case, execute the following command:
>>> df2.fillna(0) X Y Z a0 -1.193371 0.912654 -0.780461 a1 0.000000 0.000000 0.000000 a10 1.413044 0.615997 0.947334 a11 0.000000 0.000000 0.000000 a20 1.583516 1.388921 0.458771 a21 0.000000 0.000000 0.000000 a30 0.479579 1.427625 1.407924 a31 0.000000 0.000000 0.000000 a40 0.455510 -0.880937 1.375555 a41 0.000000 0.000000 0.000000
If you want to fill the value with forward propagation, which means that the value previous to the null value in the column will be used to fill the null value, the following command can be used:
>>> df2.fillna(method='pad') #filling with forward propagation X Y Z a0 -1.193371 0.912654 -0.780461 a1 -1.193371 0.912654 -0.780461 a10 1.413044 0.615997 0.947334 a11 1.413044 0.615997 0.947334 a20 1.583516 1.388921 0.458771 a21 1.583516 1.388921 0.458771 a30 0.479579 1.427625 1.407924 a31 0.479579 1.427625 1.407924 a40 0.455510 -0.880937 1.375555 a41 0.455510 -0.880937 1.375555
If you want to fill the null values of the column with the column mean, then the following command can be utilized:
>>> df2.fillna(df2.mean()) X Y Z a0 -1.193371 0.912654 -0.780461 a1 0.547655 0.692852 0.681825 a10 1.413044 0.615997 0.947334 a11 0.547655 0.692852 0.681825 a20 1.583516 1.388921 0.458771 a21 0.547655 0.692852 0.681825 a30 0.479579 1.427625 1.407924 a31 0.547655 0.692852 0.681825 a40 0.455510 -0.880937 1.375555 a41 0.547655 0.692852 0.681825
String operations
Sometimes, you would want to modify the string field column in your data. The following technique explains some of the string operations:
- Substring: Let's start by choosing the first five rows of the
AREA NAME
column in the data as our sample data to modify:>>> df = pd.read_csv('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.csv') >>> df['AREA NAME'][0:5] 0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 2 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 3 COHOES CITY SCHOOL DISTRICT 4 COHOES CITY SCHOOL DISTRICT Name: AREA NAME, dtype: object
In order to extract the first word from the
Area Name
column, we'll use theextract
function as shown in the following command:>>> df['AREA NAME'][0:5].str.extract('(\w+)') 0 RAVENA 1 RAVENA 2 RAVENA 3 COHOES 4 COHOES Name: AREA NAME, dtype: object
In the preceding command, the
str
attribute of the series is utilized. Thestr
class contains anextract
method, where a regular expression could be fed to extract data, which is very powerful. It is also possible to extract a second word inAREA NAME
as a separate column:>>> df['AREA NAME'][0:5].str.extract('(\w+)\s(\w+)') 0 1 0 RAVENA COEYMANS 1 RAVENA COEYMANS 2 RAVENA COEYMANS 3 COHOES CITY 4 COHOES CITY
To extract data in different columns, the respective regular expression needs to be enclosed in separate parentheses.
- Filtering: If we want to filter rows with data on
ELEMENTARY
school, then the following command can be used:>>> df[df['GRADE LEVEL'] == 'ELEMENTARY']
- Uppercase: To convert the area name to uppercase, we'll use the following command:
>>> df['AREA NAME'][0:5].str.upper() 0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 2 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 3 COHOES CITY SCHOOL DISTRICT 4 COHOES CITY SCHOOL DISTRICT Name: AREA NAME, dtype: object
Since the data strings are in uppercase already, there won't be any difference seen.
- Lowercase: To convert
Area Name
to lowercase, we'll use the following command:>>> df['AREA NAME'][0:5].str.lower() 0 ravena coeymans selkirk central school district 1 ravena coeymans selkirk central school district 2 ravena coeymans selkirk central school district 3 cohoes city school district 4 cohoes city school district Name: AREA NAME, dtype: object
- Length: To find the length of each element of the
Area Name
column, we'll use the following command:>>> df['AREA NAME'][0:5].str.len() 0 47 1 47 2 47 3 27 4 27 Name: AREA NAME, dtype: int64
- Split: To split
Area Name
based on a whitespace, we'll use the following command:>>> df['AREA NAME'][0:5].str.split(' ') 0 [RAVENA, COEYMANS, SELKIRK, CENTRAL, SCHOOL, D... 1 [RAVENA, COEYMANS, SELKIRK, CENTRAL, SCHOOL, D... 2 [RAVENA, COEYMANS, SELKIRK, CENTRAL, SCHOOL, D... 3 [COHOES, CITY, SCHOOL, DISTRICT] 4 [COHOES, CITY, SCHOOL, DISTRICT] Name: AREA NAME, dtype: object
- Replace: If we want to replace all the area names ending with
DISTRICT
toDIST
, then the following command can be used:>>> df['AREA NAME'][0:5].str.replace('DISTRICT$', 'DIST') 0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DIST 1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DIST 2 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DIST 3 COHOES CITY SCHOOL DIST 4 COHOES CITY SCHOOL DIST Name: AREA NAME, dtype: object
The first argument in the replace method is the regular expression used to identify the portion of the string to replace. The second argument is the value for it to be replaced with.
Merging data
To combine datasets together, the concat
function of pandas can be utilized. Let's take the Area Name
and the County
columns with its first five rows:
>>> d[['AREA NAME', 'COUNTY']][0:5] AREA NAME COUNTY 0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 2 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 3 COHOES CITY SCHOOL DISTRICT ALBANY 4 COHOES CITY SCHOOL DISTRICT ALBANY
We can divide the data as follows:
>>> p1 = d[['AREA NAME', 'COUNTY']][0:2] >>> p2 = d[['AREA NAME', 'COUNTY']][2:5]
The first two rows of the data are in p1
and the last three rows are in p2
. These pieces can be combined using the concat()
function:
>>> pd.concat([p1,p2]) AREA NAME COUNTY 0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 2 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 3 COHOES CITY SCHOOL DISTRICT ALBANY 4 COHOES CITY SCHOOL DISTRICT ALBANY
The combined pieces can be identified by assigning a key:
>>> concatenated = pd.concat([p1,p2], keys = ['p1','p2']) >>> concatenated AREA NAME COUNTY p1 0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY p2 2 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 3 COHOES CITY SCHOOL DISTRICT ALBANY 4 COHOES CITY SCHOOL DISTRICT ALBANY
Using the keys, the pieces can be extracted back from the concatenated data:
>>> concatenated.ix['p1'] AREA NAME COUNTY 0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY
- GAE編程指南
- Python高效開發實戰:Django、Tornado、Flask、Twisted(第3版)
- JSP開發案例教程
- 零基礎學Java程序設計
- Scala Data Analysis Cookbook
- Flink技術內幕:架構設計與實現原理
- Mastering Drupal 8
- Ionic3與CodePush初探:支持跨平臺與熱更新的App開發技術
- 情境微課開發(第2版)
- Expert Angular
- 自然語言處理NLP從入門到項目實戰:Python語言實現
- Hands-On Machine Learning with ML.NET
- Test-Driven iOS Development with Swift 4(Third Edition)
- Mockito for Spring
- UG 12.0數控編程實例教程