- Machine Learning with scikit:learn Quick Start Guide
- Kevin Jolly
- 326字
- 2021-06-24 18:15:56
Missing values
Another constraint with scikit-learn is that it cannot handle data with missing values. Therefore, we must check whether our dataset has any missing values in any of the columns to begin with. We can do this by using the following code:
#Checking every column for missing values
df.isnull().any()
This produces this output:
Here we note that every column has some amount of missing values.
Missing values can be handled in a variety of ways, such as the following:
- Median imputation
- Mean imputation
- Filling them with the majority value
The amount of techniques is quite large and varies depending on the nature of your dataset. This process of handling features with missing values is called feature engineering.
Feature engineering can be done for both categorical and numerical columns and would require an entire book to explain the various methodologies that comprise the topic.
Since this book provides you with a deep focus on the art of applying the various machine learning algorithms that scikit-learn offers, feature engineering will not be dealt with.
So, for the purpose of aligning with the goals that this book intends to achieve, we will impute all the missing values with a zero.
We can do this by using the following code:
#Imputing the missing values with a 0
df = df.fillna(0)
We now have a dataset that is ready for machine learning with scikit-learn. We will use this dataset for all the other chapters that we will go through in the future. To make it easy for us, then, we will export this dataset as a .csv file and store it in the same directory that you are working in with the Jupyter Notebook.
We can do this by using the following code:
df.to_csv('fraud_prediction.csv')
This will create a .csv file of this dataset in the directory that you are working in, which you can load into the notebook again using pandas.