- IBM SPSS Modeler Cookbook
- Keith McCormick Dean Abbott Meta S. Brown Tom Khabaza Scott R.Mutchler
- 734字
- 2021-07-23 16:01:24
Using an @NULL multiple Derive to explore missing data>
With great regularity the mere presence or absence of data in the input variable tells you a great deal. Dates are a classic example. Suppose LastDateRented_HorrorCategory
is NULL. Does that mean that the value is unknown? Perhaps we should replace it with the average date of the horror movie renters? Please don't! Obviously, if the data is complete, the failure to find Jane Renter in the horror movie rental transactions much more likely means that she did not rent a horror movie. This is such a classic scenario you will want a series of simple tricks to deal with this type of missing data efficiently so that when the situation calls for it you can easily create NULL flag variables for dozens (or even all) of your variables.
Getting ready
We will start with the NULL Flags.str
stream.
How to do it...
To use an @NULL
multiple Derive node to explore missing data, perform the following steps:
- Run the Data Audit and examine the resulting Quality tab. Note that a number of variables are complete but many have more than 5 percent NULL. The Filter node on the stream allows only the variables with a substantial number of NULL values to flow downstream.
- Add a Derive node, and edit it, by selecting the Multiple option. Include all of the scale variables that are downstream of the Filter node. Use the suffix
_null
, and select Flag from the Derive as drop-down menu. - Add another Filter node and set it to allow only the new variables plus TARGET_B to flow downstream.
- Add a Type node forcing TARGET_B to be the target. Ensure that it is a flag measurement type.
- Add a Data Audit node. Note that some of the new NULL flag variables may be related to the target, but it is not easy to see which variables are the most related.
- Add a Feature Selection Modeling node and run it. Edit the resulting generated model. Note that a number of variables are predictive of the target.
How it works...
There is no substitute for lots of hard work during Data Understanding. Some of the patterns here could be capitalized upon, and others could indicate the need for data cleaning. The Using the Feature Selection node creatively to remove or decapitate perfect predictors recipe in Chapter 2, Data Preparation – Select, shows how circular logic can creep into our analysis.
Note the large number of data and amount-related variables in the Generated model. These variables indicate that the potential donor did not give in those time periods. Failing to give in one time period is predicted with failing to give in another; it makes sense. Is this the best way to get at this? Perhaps a simple count would do the trick, or perhaps the number of recent donations versus total donations.

Also note the TIMELAG_null
variable. It is the distance between the first and second donation. What would be a common reason that it would be NULL? Obviously the lack of a second donation could cause that problem. Perhaps analyzing new donors and established donors separately could be a good way of tackling this. The Using a full data model/partial data model approach to address missing data recipe in Chapter 3, Data Preparation – Clean, is built around this very idea. Note that neither imputing with the mean, nor filling with zero would be a good idea at all. We have no reason to think that one time and two time donors are similar. We also know for a fact that the time distance is never zero.
Note the Wealth2_null
variable. What might cause this variable to be missing, and for the missing status alone to be predictive? Perhaps we need a new donor to be on the mailing list for a substantial time before our list vendor can provide us that information. This too might be tackled with a new donor/established donor approach.
See also
- The Using the Feature Selection node creatively to remove or decapitate perfect predictors recipe in Chapter 2, Data Preparation – Select
- The Using CHAID stumps when interviewing an SME recipe in this chapter
- The Binning scale variables to address missing data recipe in Chapter 3, Data Preparation – Clean
- The Using a full data model/partial data model approach to address missing data recipe in Chapter 3, Data Preparation – Clean
- 審計全流程技術操作實務指南
- 中國新股民必讀手冊(2015年最新版)
- 金融科技(FinTech)發展的國際經驗和中國政策取向(中國金融四十人論壇書系)
- 自愿審計動機與質量研究:基于我國中期財務報告審計的經驗證據
- Splunk:Enterprise Operational Intelligence Delivered
- 中國特色社會主義國家審計制度研究
- Big Data Visualization
- 財務審計實務指南
- 財務建模與綜合估值:數據研磨、模型校準、動態估值
- Minitab Cookbook
- 項目管理實務(第二版)
- Stata統計分析與行業應用案例詳解(第2版)
- 內部審計情景案例:理解審計行為,辨析審計決策
- 計量經濟學理論與應用:基于Eviews的應用分析
- 統計學理論前沿(谷臻小簡·AI導讀版)