- IBM SPSS Modeler Cookbook
- Keith McCormick Dean Abbott Meta S. Brown Tom Khabaza Scott R.Mutchler
- 690字
- 2021-07-23 16:01:26
Running a Statistics node on anti-join to evaluate the potential missing data>
There is typically some data loss when various data tables are integrated. Although we won't discuss data integration until a later chapter, it is important to gauge what (and how much) is lost at this stage. Financial variables are usually aggregated in very different ways for the financial planner and the data miner. It is critical that the data miner periodically translate the data of the data miner back into the form that middle and senior management will recognize so that they can better communicate.
The data miner deals with transactions and individual customer data, the language of individual rows of data. The manager speaks, generally, the language of spreadsheets: regions, product lines, months rolled up into aggregated cells in Excel.
On a project, we once discovered that a small percentage of missing rows represented a larger fraction of revenue than average—much larger actually. We suddenly revisited our decision to drop those rows. Dropping them seemed the right decision—they were just bad IDs weren't they? Well, it is never that simple. There are few accidents in data. That experience produced a lesson:
Always include a revenue assessment in your decisions even when revenue is neither your input nor your target.
In this recipe we will learn a simple trick for assessing these variables at times when there is the potential for data loss.
Getting ready
We will start with a blank stream, and will be using the retail Transactions
file and the Products_Missing
file.
How to do it...
To evaluate potential missing data when integrating data:
- Build a stream with both of the Source nodes, two Type nodes, and a Merge node.
- Perform an anti-join and make a note of the record count.
- Run a Statistics node and request: Count, Sum, Mean, and Median.
- Set Products_Missing to be the first input, and run the Merge node making a note of the record count. Since there is only one record we will not run a Stats node.
- Reverse the inputs and repeat the merge, again making note of the record count.
- Re-run the Statistics node.
How it works...
Years ago on a project we discovered that 5 percent of the data—data that happened to be missing—represented more than 20 percent of the revenue. We expect, or perhaps more likely, we hope, that missing data will not derail us, but sometimes it certainly threatens the whole project. This recipe is about choosing your battles, identifying when missing data is a particularly serious problem and when we can move on to other aspects of the project.
Here, the bad news is that there is a substantial amount of missing data in the products file. How could this ever occur? The novice might be surprised. It occurs frequently. Perhaps the company just acquired a smaller retailer and there are issues in the old transactional data of the old vendor not matching up properly. There is good news, however. The average and median of the inner join and the anti-join suggest that we appear to have missing data that is missing at random and not systematically. It would be a much bigger problem if the usual customers (maybe even our best customers) were the ones that were missing. This is not usually the case although sometimes it takes detective work to figure out why. There is one additional bit of bad news, however; the total amount of the missing data points is not trivial. While it is dwarfed by the nearly 90 million that we can analyze, the missing 8 million might be large enough to warrant extensive data cleaning. One would now move to diagnose the problem, and if it seems achievable, and at a reasonable cost, address the problem.
See also
- The Using an @NULL multiple Derive to explore missing data recipe in Chapter 1, Data Understanding
- The Creating an Outlier report to give to SMEs recipe in Chapter 1, Data Understanding
- The Using a full data model/partial data model approach to address missing data recipe in Chapter 3, Data Preparation – Clean
- 中國新股民必讀手冊(2015年最新版)
- Microsoft Dynamics GP 2016 Cookbook
- 新形勢下中國投入產出核算的變革與實踐
- Magento 2 Cookbook
- Azure Strategy and Implementation Guide
- Securing Hadoop
- 財務建模與綜合估值:數據研磨、模型校準、動態估值
- Team Foundation Server 2013 Customization
- OAuth 2.0 Identity and Access Management Patterns
- Tableau:Creating Interactive Data Visualizations
- 中國審計市場:制度變遷與競爭行為
- Big Data Analytics with R and Hadoop
- 中國國內生產總值核算問題研究
- QlikView for Developers Cookbook
- Managing Virtual Infrastructure with Veeam? ONE?