官术网_书友最值得收藏!

Classic datasets versus real-world datasets

Data scientists and machine-learning practitioners often use classic datasets to demonstrate the behavior of certain models. The Iris dataset, composed of 150 samples of three types of iris flowers, is one of the most commonly used to demonstrate or to teach predictive analytics. It has been around since 1936!

The Boston housing dataset and the Titanic dataset are other very popular datasets for predictive analytics. For text classification, the Reuters or the 20 newsgroups text datasets are very common, while image recognition datasets are used to benchmark deep learning models. These classic datasets are used to establish baselines when evaluating the performances of algorithms and models. Their characteristics are well known, and data scientists know what performances to expect.

These classic datasets can be downloaded:

However, classic datasets can be weak equivalents of real datasets, which have been extracted and aggregated from a perse set of sources: databases, APIs, free form documents, social networks, spreadsheets, and so on. In a real-life situation, the data scientist must often deal with messy data that has missing values, absurd outliers, human errors, weird formatting, strange inputs, and skewed distributions.

The first task in a predictive analytics project is to clean up the data. In the following section, we will look at the main issues with raw data and what strategies can be applied. Since we will ultimately be using a linear model for our predictions, we will process the data with that in mind.

主站蜘蛛池模板: 甘肃省| 杭锦后旗| 确山县| 三原县| 石柱| 延长县| 香格里拉县| 塔河县| 石屏县| 瑞丽市| 竹溪县| 新河县| 三原县| 西藏| 桑植县| 新晃| 塔河县| 百色市| 洛扎县| 体育| 仁寿县| 龙江县| 当阳市| 蕉岭县| 武宣县| 丹寨县| 都兰县| 灵石县| 右玉县| 洞头县| 常熟市| 奈曼旗| 巨鹿县| 凭祥市| 瑞昌市| 郸城县| 井冈山市| 慈利县| 东丽区| 景德镇市| 秀山|