官术网_书友最值得收藏!

Data cleaning

Data cleaning is a fundamental process to make sure we are able to produce good results at the end. It is task-specific, as in the cleaning you will have to perform on audio data will be different for images, text, or a time series data.

We will need to make sure there is no missing data, and if that's the case we can decide how to deal with it. In the case of missing data—for example, an instance missing a few variables, it's possible to fill them with the average for that variable, fill it with a value that the input cannot assume, such as -1 if the variable is between 0 and 1 or disregard the instance if we have a lot of data.

Also, it's good to check whether the data respects the limitations of the values we are measuring. For example, a temperature in Celsius cannot be lower than 273.15 degrees, if that's the case, we know straight away that the data point is unreliable.

Other checks include the format, the data types, and the variance in the dataset.

It's possible to load some clean data directly from scikit-learn. There are a lot of datasets for all sort of tasks—for example, if we want to load some image data, we can use the following Python code:

from sklearn.datasets import fetch_lfw_people
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

This data is known as Labeled Faces in the Wild, a dataset for face recognition.

主站蜘蛛池模板: 琼海市| 叙永县| 九江市| 望江县| 九寨沟县| 凤冈县| 泽州县| 崇明县| 邯郸市| 庄河市| 泰州市| 河津市| 孝昌县| 凤山市| 六枝特区| 遵化市| 三河市| 海淀区| 临漳县| 彭州市| 呼伦贝尔市| 南陵县| 华阴市| 明光市| 万年县| 玉屏| 喀什市| 德清县| 富源县| 封开县| 万州区| 百色市| 白玉县| 务川| 林芝县| 清原| 石家庄市| 海阳市| 新宾| 临湘市| 万州区|