- Python Data Science Essentials
- Alberto Boschetti Luca Massaron
- 330字
- 2021-08-13 15:19:38
Scikit-learn sample generators
As a last learning resource, the Scikit-learn package also offers the possibility to quickly create synthetic datasets for regression, binary and multilabel classification, cluster analysis, and dimensionality reduction.
The main advantage of recurring synthetic data lies in its instantaneous creation in the working memory of your Python console. It is, therefore, possible to create bigger data examples without having to engage in long downloading sessions from the internet (and saving a lot of stuff on your disk).
For example, you may need to work on a classification problem involving a million data points:
In: from sklearn import datasets
X,y = datasets.make_classification(n_samples=10**6,
n_features=10, random_state=101)
print (X.shape, y.shape)
Out: (1000000, 10) (1000000,)
After importing just the datasets module, we ask, using the make_classification command, for one million examples (the n_samples parameter) and 10 useful features (n_features). The random_state should be 101, so we are assured that we can replicate the same datasets at a different time and in a different machine.
For instance, you can type the following command:
In: datasets.make_classification(1, n_features=4, random_state=101)
This will always give you the following output:
Out: (array([[-3.31994186, -2.39469384, -2.35882002, 1.40145585]]),
array([0]))
No matter what the computer and the specific situation are, random_state assures deterministic results that make your experimentations perfectly replicable.
Defining the random_state parameter using a specific integer number (in this case, it's 101, but it may be any number that you prefer or find useful) allows easy replication of the same dataset on your machine, the way it is set up, on different operating systems, and on different machines.
By the way, did it take too long?
On a i3-2330M CPU @ 2.20 GHz machine, it takes this:
In: %timeit X,y = datasets.make_classification(n_samples=10**6,
n_features=10, random_state=101)
Out: 1 loops, best of 3: 1.17 s per loop
If it doesn't seem like it did take too long on your machine, and if you are ready, having set up and tested everything up to this point, we can start our data science journey.
- 機(jī)器學(xué)習(xí)實戰(zhàn):基于Sophon平臺的機(jī)器學(xué)習(xí)理論與實踐
- JavaScript實例自學(xué)手冊
- 機(jī)器學(xué)習(xí)及應(yīng)用(在線實驗+在線自測)
- Windows環(huán)境下32位匯編語言程序設(shè)計
- Implementing AWS:Design,Build,and Manage your Infrastructure
- 學(xué)會VBA,菜鳥也高飛!
- 基于敏捷開發(fā)的數(shù)據(jù)結(jié)構(gòu)研究
- AI的25種可能
- Linux Shell Scripting Cookbook(Third Edition)
- 工業(yè)機(jī)器人集成應(yīng)用
- Creating ELearning Games with Unity
- Unreal Development Kit Game Design Cookbook
- Learn Microsoft Azure
- 基于Proteus的PIC單片機(jī)C語言程序設(shè)計與仿真
- Oracle 11g Anti-hacker's Cookbook