官术网_书友最值得收藏!

scikit-learn toy datasets

scikit-learn provides some built-in datasets that can be used for testing purposes. They're all available in the package sklearn.datasets and have a common structure: the data instance variable contains the whole input set X while target contains the labels for classification or target values for regression. For example, considering the Boston house pricing dataset (used for regression), we have:

from sklearn.datasets import load_boston

>>> boston = load_boston()
>>> X = boston.data
>>> Y = boston.target

>>> X.shape
(506, 13)
>>> Y.shape
(506,)

In this case, we have 506 samples with 13 features and a single target value. In this book, we're going to use it for regressions and the MNIST handwritten digit dataset (load_digits()) for classification tasks. scikit-learn also provides functions for creating dummy datasets from scratch: make_classification(), make_regression(), and make_blobs() (particularly useful for testing cluster algorithms). They're very easy to use and in many cases, it's the best choice to test a model without loading more complex datasets.

Visit http://scikit-learn.org/stable/datasets/ for further information.
The MNIST dataset provided by scikit-learn is limited for obvious reasons. If you want to experiment with the original one, refer to the website managed by Y. LeCun, C. Cortes, C. Burges: http://yann.lecun.com/exdb/mnist/. Here you can download a full version made up of 70,000 handwritten digits already split into training and test sets.
主站蜘蛛池模板: 易门县| 铜鼓县| 山东省| 贵定县| 防城港市| 荆州市| 南川市| 烟台市| 昌邑市| 南漳县| 大余县| 桑日县| 基隆市| 左贡县| 日喀则市| 商洛市| 呼图壁县| 昭通市| 三台县| 房山区| 广水市| 亳州市| 神农架林区| 东丽区| 玛多县| 鞍山市| 东宁县| 宜春市| 剑川县| 湖口县| 读书| 石泉县| 京山县| 江孜县| 交口县| 凤翔县| 扬中市| 方山县| 东莞市| 察哈| 通化县|