- IBM SPSS Modeler Cookbook
- Keith McCormick Dean Abbott Meta S. Brown Tom Khabaza Scott R.Mutchler
- 888字
- 2021-07-23 16:01:25
Detecting potential model instability early using the Partition node and Feature Selection node>
Model instability would typically be described as an issue most noticeably during the evaluation phase. Model instability usually manifests itself as a substantially stronger performance on the Train data set than on the Test data set. This bodes ill for the performance of the model on new data; in other words, it bodes ill for the practical application of the model to any business problem. Veteran data miners see this coming well before the evaluation phase, however, or at least they hope they do. The trick is to spot one of the most common causes; model instability is much more likely to occur when the same inputs are competing for the same variance in the model. In other words, when the inputs are correlated with each other to a large degree, it can cause problems. The data miner can also get themselves into hot water with their own behavior or imprudence. Overfitting, discussed in the Introduction of Chapter 7, Modeling – Assessment, Evaluation, Deployment, and Monitoring, can also cause model instability. The trick is to spot potential problems early. If the issue is in the set of inputs, this recipe can help to identify which inputs are at issue. The correlation matrix recipe and other data reduction recipes can assist in corrective action.
This recipe also serves as a cautionary tale about giving the Feature Selection node a heavier burden than it is capable of carrying. This node looks at the bivariate relationships of inputs with the target. Bivariate simply means two variables and it means that Feature Selection is blind to what might happen when lots of inputs attempt to collaborate together to predict the target. Bivariate analyses are not without value, they are critical to the Data Understanding phase, but the goal of the data miner is to recruit a team of variables. The team's performance is based upon a number of factors, only one of which is the ability of each input to predict the target variable.
Getting ready
We will start with the Stability.str
stream.
How to do it...
To detect potential model instability using the Partition and Feature Selection nodes, perform the following steps:
- Open the stream,
Stability.str
. - Edit the Partition node, click on the Generate seed button, and run it. (Since you will not get the same seed as the figure shown, your results will differ. This is not a concern. In fact, it helps illustrate the point behind the recipe.)
- Run the Feature Selection Modeling node and then edit the resulting generated model. Note the ranking of potential inputs may differ if the seed is different.
- Edit the Partition node, generate a new seed, and then run the Feature Selection again.
- Edit the Feature Selection generated model.
- For a third and final time, edit the Partition node, generate a new seed, and then run the Feature Selection. Edit the generated model.
How it works...
At first glance, one might anticipate no major problems ahead. RFA_6
, which is the donor status calculated six campaigns ago, is in first place twice and is in third place once. Clearly it provides some value, so what is the danger in proceeding to the next phase? The change in ranking from seed to seed is revealing something important about this set of variables. These variables are behaving like variables that are similar to each other. They are all descriptions of past donation behavior at different times. The larger the number after the underscore, the further back in time they represent. Why isn't the most recent variable, RFA_2
, shown as the most predictive? Frankly, there is a good chance that it is the most predictive, but these variables are fighting over top status in the small decimal places of this analysis. We can trust Feature Selection to alert us that they are potentially important, but it is dangerous to trust the ranking under these circumstances, and it certainly doesn't mean than if we were to restrict our inputs to the top ten that we would get a good model.
The behavior revealed here is not a good indication of how these variables will behave in a model, a classification tree, or any other multiple input techniques. In a tree, once a branch is formed using RFA_6
, the tendency would be for the model to seek a variable that sheds light on some other aspect of the data. The variable used to form the second branch would likely not be the second variable on the list because the first and second variables are similar to each other. The implication of this is that, if RFA_4
were chosen as the first branch, RFA_6
might not be chosen at all.
Each situation is different, but perhaps the best option here is to identify what these related variables have in common and distill it into a smaller set of variables. To the extent that these variables have a unique contribution to make—perhaps in the magnitude of their distance in the past—that too could be brought into higher relief during data preparation.
- 自愿審計動機與質量研究:基于我國中期財務報告審計的經驗證據
- VMware vCloud Director Essentials
- Splunk:Enterprise Operational Intelligence Delivered
- 中國政府統計問題研究
- IBM SPSS Modeler Cookbook
- 風險導向審計準則實施效果研究
- 企業并購審查中的相關市場界定:理論與案例
- 審計實務
- 多項目管理方法及其應用研究(國家社科基金后期資助項目)
- 圖解經濟博弈論(圖解經濟學叢書)
- TIBCO Spotfire for Developers
- 內部審計實務操作從入門到實戰
- 合規型內部審計:精準發現違規行為,實時化解合規風險
- 新編審計基礎與實務
- Instant VMware Player for Virtualization