- Feature Engineering Made Easy
- Sinan Ozdemir Divya Susarla
- 451字
- 2021-06-25 22:45:51
Evaluating unsupervised learning algorithms
This is a bit trickier. Because unsupervised learning is not concerned with predictions, we cannot directly evaluate performance based on how well the model can predict a value. That being said, if we are performing a cluster analysis, such as in the previous marketing segmentation example, then we will usually utilize the silhouette coefficient (a measure of separation and cohesion of clusters between -1 and 1) and some human-driven analysis to decide if a feature engineering procedure has improved model performance or if we are merely wasting our time.
Here is an example of using Python and scikit-learn to import and calculate the silhouette coefficient for some fake data:
attributes = tabular_data
cluster_labels = outputted_labels_from_clustering
from sklearn.metrics import silhouette_score
silhouette_score(attributes, cluster_labels)
We will spend much more time on unsupervised learning later on in this book as it becomes more relevant. Most of our examples will revolve around predictive analytics/supervised learning.
It is important to remember that the reason we are standardizing algorithms and metrics is so that we may showcase the power of feature engineering and so that you may repeat our procedures with success. Practically, it is conceivable that you are optimizing for something other than accuracy (such as a true positive rate, for example) and wish to use decision trees instead of logistic regression. This is not only fine but encouraged. You should always remember though to follow the steps to evaluating a feature engineering procedure and compare baseline and post-engineering performance.
It is possible that you are not reading this book for the purposes of improving machine learning performance. Feature engineering is useful in other domains such as hypothesis testing and general statistics. In a few examples in this book, we will be taking a look at feature engineering and data transformations as applied to a statistical significance of various statistical tests. We will be exploring metrics such as R2 and p-values in order to make judgements about how our procedures are helping.
In general, we will quantify the benefits of feature engineering in the context of three categories:
- Supervised learning: Otherwise known as predictive analytics
- Regression analysis—predicting a quantitative variable:
- Will utilize MSE as our primary metric of measurement
- Classification analysis—predicting a qualitative variable
- Will utilize accuracy as our primary metric of measurement
- Regression analysis—predicting a quantitative variable:
- Unsupervised learning: Clustering—the assigning of meta-attributes as denoted by the behavior of data:
- Will utilize the silhouette coefficient as our primary metric of measurement
- Statistical testing: Using correlation coefficients, t-tests, chi-squared tests, and others to evaluate and quantify the usefulness of our raw and transformed data
In the following few sections, we will look at what will be covered throughout this book.
- 企業(yè)數(shù)字化創(chuàng)新引擎:企業(yè)級PaaS平臺HZERO
- 數(shù)據(jù)之巔:數(shù)據(jù)的本質(zhì)與未來
- Visual Studio 2015 Cookbook(Second Edition)
- Oracle RAC 11g實戰(zhàn)指南
- 大數(shù)據(jù)導論
- 云計算與大數(shù)據(jù)應用
- 企業(yè)級數(shù)據(jù)與AI項目成功之道
- 科研統(tǒng)計思維與方法:SPSS實戰(zhàn)
- 數(shù)據(jù)庫原理與設(shè)計實驗教程(MySQL版)
- 云原生架構(gòu):從技術(shù)演進到最佳實踐
- 數(shù)據(jù)時代的品牌智造
- 計算機應用實務(wù)(第3版)
- 大數(shù)據(jù)原理與技術(shù)
- 用戶畫像:平臺構(gòu)建與業(yè)務(wù)實踐
- 機器視覺原理與案例詳解