- IBM SPSS Modeler Cookbook
- Keith McCormick Dean Abbott Meta S. Brown Tom Khabaza Scott R.Mutchler
- 734字
- 2021-07-23 16:01:24
Using an @NULL multiple Derive to explore missing data>
With great regularity the mere presence or absence of data in the input variable tells you a great deal. Dates are a classic example. Suppose LastDateRented_HorrorCategory
is NULL. Does that mean that the value is unknown? Perhaps we should replace it with the average date of the horror movie renters? Please don't! Obviously, if the data is complete, the failure to find Jane Renter in the horror movie rental transactions much more likely means that she did not rent a horror movie. This is such a classic scenario you will want a series of simple tricks to deal with this type of missing data efficiently so that when the situation calls for it you can easily create NULL flag variables for dozens (or even all) of your variables.
Getting ready
We will start with the NULL Flags.str
stream.
How to do it...
To use an @NULL
multiple Derive node to explore missing data, perform the following steps:
- Run the Data Audit and examine the resulting Quality tab. Note that a number of variables are complete but many have more than 5 percent NULL. The Filter node on the stream allows only the variables with a substantial number of NULL values to flow downstream.
- Add a Derive node, and edit it, by selecting the Multiple option. Include all of the scale variables that are downstream of the Filter node. Use the suffix
_null
, and select Flag from the Derive as drop-down menu. - Add another Filter node and set it to allow only the new variables plus TARGET_B to flow downstream.
- Add a Type node forcing TARGET_B to be the target. Ensure that it is a flag measurement type.
- Add a Data Audit node. Note that some of the new NULL flag variables may be related to the target, but it is not easy to see which variables are the most related.
- Add a Feature Selection Modeling node and run it. Edit the resulting generated model. Note that a number of variables are predictive of the target.
How it works...
There is no substitute for lots of hard work during Data Understanding. Some of the patterns here could be capitalized upon, and others could indicate the need for data cleaning. The Using the Feature Selection node creatively to remove or decapitate perfect predictors recipe in Chapter 2, Data Preparation – Select, shows how circular logic can creep into our analysis.
Note the large number of data and amount-related variables in the Generated model. These variables indicate that the potential donor did not give in those time periods. Failing to give in one time period is predicted with failing to give in another; it makes sense. Is this the best way to get at this? Perhaps a simple count would do the trick, or perhaps the number of recent donations versus total donations.

Also note the TIMELAG_null
variable. It is the distance between the first and second donation. What would be a common reason that it would be NULL? Obviously the lack of a second donation could cause that problem. Perhaps analyzing new donors and established donors separately could be a good way of tackling this. The Using a full data model/partial data model approach to address missing data recipe in Chapter 3, Data Preparation – Clean, is built around this very idea. Note that neither imputing with the mean, nor filling with zero would be a good idea at all. We have no reason to think that one time and two time donors are similar. We also know for a fact that the time distance is never zero.
Note the Wealth2_null
variable. What might cause this variable to be missing, and for the missing status alone to be predictive? Perhaps we need a new donor to be on the mailing list for a substantial time before our list vendor can provide us that information. This too might be tackled with a new donor/established donor approach.
See also
- The Using the Feature Selection node creatively to remove or decapitate perfect predictors recipe in Chapter 2, Data Preparation – Select
- The Using CHAID stumps when interviewing an SME recipe in this chapter
- The Binning scale variables to address missing data recipe in Chapter 3, Data Preparation – Clean
- The Using a full data model/partial data model approach to address missing data recipe in Chapter 3, Data Preparation – Clean
- 一本書學(xué)內(nèi)部審計:新手內(nèi)部審計從入門到精通
- Mastering Microsoft Forefront UAG 2010 Customization
- 中國新股民必讀手冊(2015年最新版)
- Managing IaaS and DBaaS Clouds with Oracle Enterprise Manager Cloud Control 12c
- 新形勢下中國投入產(chǎn)出核算的變革與實踐
- 金融保險集團內(nèi)部審計創(chuàng)新與實踐
- 中國特色社會主義國家審計制度研究
- EViews10.0的應(yīng)用與計量分析
- Microsoft Dynamics CRM 2011 Scripting Cookbook
- 2016年度注冊會計師全國統(tǒng)一考試專用輔導(dǎo)教材(圖解版):審計
- 新編統(tǒng)計學(xué)
- 企業(yè)內(nèi)部審計全流程指南
- Big Data Analytics with R and Hadoop
- Stata統(tǒng)計分析與行業(yè)應(yīng)用案例詳解(第2版)
- Business Intelligence Cookbook:A Project Lifecycle Approach Using Oracle Technology