- Apache Spark Machine Learning Blueprints
- Alex Liu
- 899字
- 2021-07-16 10:39:50
Identity matching
In this section, we will cover one important data preparation topic, which is about identity matching and related solutions. We will discuss some of Spark's special features for solving identity issues and also some data matching solutions made easy with Spark.
After this section, we will be capable of taking care of some common data identity problems with Apache Spark.
Identity issues
For data preparation, we often need to deal with some data elements that belong to the same person or units, but which do not look similar to them. For example, we may have purchased some data for customer Larry Z. and web activity data for L. Zhang. Is Larry Z a same person as L. Zhang? Are there many identity variations in the data?
Matching entities is a big challenge for machine learning data preparation as these types of entity variation are very common and could be caused by many different reasons, such as duplications, errors, name variants, and intentional aliasing. Sometimes, it could be very difficult to complete the matching or even just to find the linking, and this work is definitely very time consuming. However, it is necessary and extremely important as any kind of mismatching will produce a lot of errors, and no matching will produce biases. At the same time, a correct matching also has additional values as an aid to group detection, such as with terror cells and drug cartels.
Some new methods, such as fuzzy matching, have been developed to attack this issue. However, in this section, we will focus on some commonly used methods. These commonly used approaches include:
- Manual search with SQL queries.
This is a labor intensive with few discoveries but good accuracy.
- Automated data cleansing.
This type of approach often adopts a few rules that use the most informative attributes.
- Lexical similarity.
This approach is rational and useful but can generate many false alarms.
- Feature and relationship statistics.
This approach is a good one but does not address nonlinear effects.
The accuracy of any of the preceding methods often depends on the sparseness and size of the data and also on whether these tasks are to resolve duplications, errors, variants, or aliases.
Identity matching on Spark
Similarly to the previous section, we would like to review some methods utilizing SampleClean to deal with entity matching issues even though the most commonly used tools are SparkSQL
or R.
Entity resolution
SampleClean provides an easy-to-use interface for some basic entity matching tasks. It provides the EntityResolution
class that wraps some common deduplication programming patterns.
A basic EntityResolution
class involves the following steps:
- Identifying a column of inconsistent categorical attributes.
- Linking together similar attributes.
- Selecting a single canonical representation of the linked attributes.
- Applying changes to the data.
Here, we have a column of short strings that are inconsistently represented (for example, United States, United States). The EntityResolution.shortAttributeCanonicalize
function takes as input the current context, the name of the working set to clean, the column to fix, and a threshold in [0,1] (0 merges all, and 1 merges only the exact matches). It uses EditDistance
as its default similarity metric. The following is a coding example:
val algorithm = EntityResolution.shortAttributeCanonicalize(scc,workingSetName,columnName,threshold)
Here, we have a column of long strings, such as addresses, that are close but not exact. The basic strategy is to tokenize these strings and compare the set of words rather than the whole string. It uses the WeightedJaccard
similarity metric as default. The following is a coding example:
longAttributeCanonicalize(scc,workingSetName,columnName,threshold)
A more advanced deduplication task is when records, rather than individual columns, are inconsistent. That is, there are multiple records that refer to the same real entity. RecordDeduplication
uses Long Attribute similarity metrics as default. The following is a coding example:
RecordDeduplication.deduplication(scc, workingSetName, columnProjection, threshold)
Note
For more information on the SampleClean
guide, visit http://sampleclean.org/guide/.
Identity matching made better
Similarly to data cleaning, with SampleClean and Spark together we can make things easy—that is write less code and utilize less data—as demonstrated in the previous section. As discussed, automated cleaning is easy and fast, but its accuracy may not be good. A common approach to make things better is to utilize more people for labor-intensive approval based on crowd sourcing.
Here, SampleClean combines Algorithms, Machines, and People, all in its crowd-sourced deduplication.
As crowdsourcing scales poorly to very large datasets, the SampleClean system asks the crowd to deduplicate only a sample of the data and then train predictive models to generalize the crowd's work to the entire dataset. In particular, SampleClean applies Active Learning to sample points that lead to a good model quickly.
To clean data using crowd workers, SampleClean uses the open source AMPCrowd service to support multiple crowd platforms and provide automated quality control. Thus, users must have a running installation of AMPCrowd. In addition, crowd operators must be configured to point to the AMPCrowd server by passing the CrowdConfiguration
objects.
SampleClean currently provides one main crowd operator: ActiveLearningMatcher. This is an add-on step to an existing EntityResolution
algorithm that trains a crowd-supervised model to predict duplicates. Take a look at the following code:
createCrowdMatcher(scc:SampleCleanContext, attribute:String, workingSetName:String) val crowdMatcher = EntityResolution.createCrowdMatcher(scc,attribute,workingSetName)
Make sure to configure the matcher here, as follows:
crowdMatcher.alstrategy.setCrowdParameters(crowdConfig)
To add this matcher to existing algorithms, use the following function:
addMatcher(matcher:Matcher) algorithm.components.addMatcher(crowdMatcher)