官术网_书友最值得收藏!

scikit-learn toy datasets

scikit-learn provides some built-in datasets that can be used for testing purposes. They're all available in the package sklearn.datasets and have a common structure: the data instance variable contains the whole input set X while target contains the labels for classification or target values for regression. For example, considering the Boston house pricing dataset (used for regression), we have:

from sklearn.datasets import load_boston

>>> boston = load_boston()
>>> X = boston.data
>>> Y = boston.target

>>> X.shape
(506, 13)
>>> Y.shape
(506,)

In this case, we have 506 samples with 13 features and a single target value. In this book, we're going to use it for regressions and the MNIST handwritten digit dataset (load_digits()) for classification tasks. scikit-learn also provides functions for creating dummy datasets from scratch: make_classification(), make_regression(), and make_blobs() (particularly useful for testing cluster algorithms). They're very easy to use and in many cases, it's the best choice to test a model without loading more complex datasets.

Visit http://scikit-learn.org/stable/datasets/ for further information.
The MNIST dataset provided by scikit-learn is limited for obvious reasons. If you want to experiment with the original one, refer to the website managed by Y. LeCun, C. Cortes, C. Burges: http://yann.lecun.com/exdb/mnist/. Here you can download a full version made up of 70,000 handwritten digits already split into training and test sets.
主站蜘蛛池模板: 封丘县| 米脂县| 开原市| 疏附县| 张家口市| 政和县| 安阳县| 十堰市| 来安县| 许昌县| 郴州市| 焦作市| 唐海县| 昆明市| 武宣县| 寿光市| 凭祥市| 图木舒克市| 都匀市| 岱山县| 化州市| 鄂尔多斯市| 吴江市| 青神县| 南宫市| 临湘市| 泽普县| 健康| 东阿县| 威远县| 永靖县| 台江县| 萨嘎县| 石河子市| 武清区| 永和县| 吉林市| 尼勒克县| 连南| 元阳县| 庆城县|