- Healthcare Analytics Made Simple
- Vikas (Vik) Kumar
- 432字
- 2021-07-23 17:18:37
Corresponding machine learning algorithms – decision tree and random forest
In the preceding diagram, you may have noticed that the example tree most likely uses subjectively determined cutpoints in deciding which route to follow. For example, Diamond #5 uses a BMI cutoff of 25, and Diamond #7 uses a BMI cutoff of 30. Nice, round numbers! In the decision analysis field, trees are usually constructed based on human inference and discussion. What if we could objectively determine the best variables to cut (and the corresponding cutpoints at which to cut) in order to minimize the error of the algorithm?
This is just what we do when we train a formal decision tree using a machine learning algorithm. Decision trees evolved in the 1990s and used principles of information theory to optimize the branching variables/points of the tree to maximize the classification accuracy. The most common and simple algorithm for training a decision tree proceeds using what is known as a greedy approach. Starting at the first node, we take the training set of our data and split it based on each variable, using a variety of cutpoints for each variable. After each split, we calculate the entropy or information gain from the resulting split. Don't worry about the formulas for calculating these quantities, just know that they measure how much information is gained from the split, which correlates with how even the split is. For example, using the PUL algorithm shown previously, a split that results in eight normal intrauterine pregnancies and seven ectopic pregnancies would be favored over a split that results in 15 normal intrauterine pregnancies and zero ectopic pregnancies. Once we have the variable and cutpoint for the best split, we proceed and then repeat the method, using the remaining variables. To prevent overfitting the model to the training data, we stop splitting the tree when certain criteria are reached, or alternatively, we could train a big tree with many nodes and then remove (prune) some of the nodes.
Decision trees have some limitations. For one thing, decision trees must split the decision space linearly at each step based on a single variable. Another problem is that decision trees are prone to overfitting. Because of these issues, decision trees typically aren't competitive with most state-of-the-art machine learning algorithms in terms of minimizing errors. However, the random forest, which is basically an ensemble of de-correlated decision trees, is currently among the most popular and accurate machine learning methods in medicine. We will make decision trees and random forests in Chapter 7, Making Predictive Models in Healthcare of this book.
- Microsoft Dynamics CRM Customization Essentials
- Dreamweaver CS3網頁設計與網站建設詳解
- 數據運營之路:掘金數據化時代
- 群體智能與數據挖掘
- Mastering Elastic Stack
- 數據庫原理與應用技術
- PostgreSQL Administration Essentials
- 網絡綜合布線設計與施工技術
- Grome Terrain Modeling with Ogre3D,UDK,and Unity3D
- 大數據技術基礎:基于Hadoop與Spark
- 大數據導論
- 智能+:制造業的智能化轉型
- JSP通用范例開發金典
- 智能控制技術及其應用
- 深度學習之模型優化:核心算法與案例實踐