- Learning Data Mining with Python
- Robert Layton
- 429字
- 2021-07-16 13:30:49
Pipelines
As experiments grow, so does the complexity of the operations. We may split up our dataset, binarize features, perform feature-based scaling, perform sample-based scaling, and many more operations.
Keeping track of all of these operations can get quite confusing and can result in being unable to replicate the result. Problems include forgetting a step, incorrectly applying a transformation, or adding a transformation that wasn't needed.
Another issue is the order of the code. In the previous section, we created our X_transformed
dataset and then created a new estimator for the cross validation. If we had multiple steps, we would need to track all of these changes to the dataset in the code.
Pipelines are a construct that addresses these problems (and others, which we will see in the next chapter). Pipelines store the steps in your data mining workflow. They can take your raw data in, perform all the necessary transformations, and then create a prediction. This allows us to use pipelines in functions such as cross_val_score
, where they expect an estimator. First, import the Pipeline
object:
from sklearn.pipeline import Pipeline
Pipelines take a list of steps as input, representing the chain of the data mining application. The last step needs to be an Estimator
, while all previous steps are Transformers
. The input dataset is altered by each Transformer
, with the output of one step being the input of the next step. Finally, the samples are classified by the last step's estimator. In our pipeline, we have two steps:
- Use
MinMaxScaler
to scale the feature values from 0 to 1 - Use
KNeighborsClassifier
as the classification algorithms
Each step is then represented by a tuple ('name', step)
. We can then create our pipeline:
scaling_pipeline = Pipeline([('scale', MinMaxScaler()), ('predict', KNeighborsClassifier())])
The key here is the list of tuples. The first tuple is our scaling step and the second tuple is the predicting step. We give each step a name: the first we call scale
and the second we call predict
, but you can choose your own names. The second part of the tuple is the actual Transformer or estimator object.
Running this pipeline is now very easy, using the cross validation code from before:
scores = cross_val_score(scaling_pipeline, X_broken, y, scoring='accuracy') print("The pipeline scored an average accuracy for is {0:.1f}%".format(np.mean(transformed_scores) * 100))
This gives us the same score as before (82.3 percent), which is expected, as we are effectively running the same steps.
In later chapters, we will use more advanced testing methods, and setting up pipelines is a great way to ensure that the code complexity does not grow unmanageably.
- Mastering JavaScript Functional Programming
- x86匯編語言:從實模式到保護模式(第2版)
- AngularJS深度剖析與最佳實踐
- Animate CC二維動畫設計與制作(微課版)
- Scratch 3.0少兒編程與邏輯思維訓練
- Banana Pi Cookbook
- STM32F0實戰(zhàn):基于HAL庫開發(fā)
- 單片機C語言程序設計實訓100例
- Spring Boot+Vue全棧開發(fā)實戰(zhàn)
- MySQL入門很輕松(微課超值版)
- 計算機應用基礎教程(Windows 7+Office 2010)
- Getting Started with Web Components
- C/C++程序設計教程
- C# 10核心技術指南
- Pandas入門與實戰(zhàn)應用:基于Python的數(shù)據(jù)分析與處理