- Hands-On Data Science with R
- Vitor Bianchi Lanzetta Nataraj Dasgupta Ricardo Anjoleto Farias
- 551字
- 2021-06-10 19:12:25
Predictive analytics (machine learning)
In popular media and literature, predictive analytics is known by various names. The terms are used interchangeably and often depend on personal preferences and interpretations. The terms predictive analytics, machine learning, and statistical learning are technically synonymous, and refer to the field of applying algorithms in machine learning to the data.
The algorithm could be as simple as a line-of-best-fit, which you may have already used in Excel, also known as linear regression. Or it could be a complex deep learning model that implements multiple hidden layers and inputs. In both cases, the mere fact that a statistical model, an algorithm was applied to generate a prediction qualifies the usage as a practice of machine learning.
In general, creating a machine learning model involves a series of steps such as the sequence:
- Cleanse and curate the dataset to extract the cohort on which the model will be built.
- Analyze the data using descriptive statistics, for example, distributions and visualizations.
- Feature engineering, preprocessing, and other steps necessary to add or remove variables/predictors.
- Split the data into a train and test set (for example, set aside 80% of the data for training and the remaining 20% for testing your model).
- Select appropriate machine learning models and create the model using cross validation.
- Select the final model after assessing the performance across models on a given (one or more) cost metric. Note that the model could be an ensemble, that is, a combination of more than one model.
- Perform predictions on the test dataset.
- Deliver the final model.
The most commonly used languages for machine learning today are R and Python. In Python, the most popular package for machine learning is scikit-learn (http://scikit-learn.org), while in R, there are multiple packages, such as random forest, Gradient Boosting Machine (GBM), kernlab, Support Vector Machines (SVMs), and others.
Although Python's scikit-learn is extremely versatile and elaborate, and in fact the preferred language in production settings, the ease of use and diversity of packages in R gives it an advantage in terms of early adoption and use for machine learning exercises.
Popular machine learning tools such as TensorFlow from Google (https://www.tensorflow.org), XGBoost (http://xgboost.readthedocs.io/en/latest/), and H2O (https://www.h2o.ai) have also released packages that act as a wrapper to the underlying machine learning algorithms implemented in the respective tools.
It is a common misconception that machine learning is just about creating models. While that is indeed the end goal, there is a subtle yet fundamental difference between a model and a good model. With the functions available today, it is relatively easy for anyone to create a model by simply running a couple of lines of code. A good model has business value, while a model built without the rigor of formal machine learning principles is practically unusable for all intents and purposes. A key requirement of a good machine learning model is the judicious use of domain expertise to evaluate results, identify errors, analyze them, and further refine using the insights that subject matter experts can provide. This is where domain knowledge plays a crucial and indispensable role.
- Canvas LMS Course Design
- 計算機網絡應用基礎
- 樂高創意機器人教程(中級 下冊 10~16歲) (青少年iCAN+創新創意實踐指導叢書)
- 80x86/Pentium微型計算機原理及應用
- 液壓機智能故障診斷方法集成技術
- Learn QGIS
- 傳感器與自動檢測
- 算法設計與分析
- Embedded Linux Development using Yocto Projects(Second Edition)
- Learning iOS 8 for Enterprise
- 當產品經理遇到人工智能
- 機器學習公式詳解
- Deployment with Docker
- 新手學Photoshop CS6數碼照片處理
- Hands-On Neural Networks with TensorFlow 2.0