- Natural Language Processing with Java and LingPipe Cookbook
- Breck Baldwin Krishna Dayanidhi
- 689字
- 2021-08-05 17:12:51
Eliminate near duplicates with the Jaccard distance
It often happens that the data has duplicates or near duplicates that should be filtered. Twitter data has lots of duplicates that can be quite frustrating to work with even with the -filter:retweets
option available for the search API. A quick way to see this is to sort the text in the spreadsheet, and tweets with common prefixes will be neighbors:

Duplicate tweets that share a prefix
This sort only reveals shared prefixes; there are many more that don't share a prefix. This recipe will allow you to find other sources of overlap and threshold, the point at which duplicates are removed.
How to do it…
Perform the following steps to eliminate near duplicates with the Jaccard distance:
- Type in the command prompt:
java -cp lingpipe-cookbook.1.0.jar:lib/opencsv-2.4.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.DeduplicateCsvData
- You will be overwhelmed with a torrent of text:
Tweets too close, proximity 1.00 @britneyspears do you ever miss the Disney days? and iilysm please follow me. kiss from Turkey #AskBritneyJean ?? @britneyspears do you ever miss the Disney days? and iilysm please follow me. kiss from Turkey #AskBritneyJean ??? Tweets too close, proximity 0.50 Sooo, I want to have a Disney Princess movie night.... I just want to be a Disney Princess
- Two example outputs are shown—the first is a near-exact duplicate with only a difference in a final
?
. It has a proximity of1.0
; the next example has proximity of0.50
, and the tweets are different but have a good deal of word overlap. Note that the second case does not share a prefix.
How it works…
This recipe jumps a bit ahead of the sequence, using a tokenizer to drive the deduplication process. It is here because the following recipe, for sentiment, really needs deduplicated data to work well. Chapter 2, Finding and Working with Words, covers tokenization in detail.
The source for main()
is:
String inputPath = args.length > 0 ? args[0] : "data/disney.csv"; String outputPath = args.length > 1 ? args[1] : "data/disneyDeduped.csv"; List<String[]> data = Util.readCsvRemoveHeader(new File(inputPath)); System.out.println(data.size());
There is nothing new in the preceding code snippet, but the following code snippet has TokenizerFactory
:
TokenizerFactory tokenizerFactory = new RegExTokenizerFactory("\\w+");
Briefly, the tokenizer breaks the text into text sequences defined by matching the regular expression \w+
(the first \
escapes the second one in the preceding code—it is a Java thing). It matches contiguous word characters. The string "Hi, you here??" produces tokens "Hi", "you", and "here". The punctuation is ignored.
Next up, Util.filterJaccard
is called with a cutoff of .5
, which roughly eliminates tweets that overlap with half their words. Then, the filter data is written to disk:
double cutoff = .5; List<String[]> dedupedData = Util.filterJaccard(data, tokenizerFactory, cutoff); System.out.println(dedupedData.size()); Util.writeCsvAddHeader(dedupedData, new File(outputPath)); }
The Util.filterJaccard()
method's source is as follows:
public static List<String[]> filterJaccard(List<String[]> texts, TokenizerFactory tokFactory, double cutoff) { JaccardDistance jaccardD = new JaccardDistance(tokFactory);
In the preceding snippet, a JaccardDistance
class is constructed with a tokenizer factory. The Jaccard distance divides the intersection of tokens from the two strings over the union of tokens from both strings. Look at the Javadoc for more information.
The nested for
loops in the following example explore each row with every other row until a higher threshold proximity is found or until all data has been looked at. Do not use this for large datasets because it is the O(n2)algorithm. If no row is above proximity, then the row is added to filteredTexts
:
List<String[]> filteredTexts = new ArrayList<String[]>(); for (int i = 0; i < texts.size(); ++i) { String targetText = texts.get(i)[TEXT_OFFSET]; boolean addText = true; for (int j = i + 1; j < texts.size(); ++j ) { String comparisionText = texts.get(j)[TEXT_OFFSET]; double proximity = jaccardD.proximity(targetText,comparisionText); if (proximity >= cutoff) { addText = false; System.out.printf(" Tweets too close, proximity %.2f\n", proximity); System.out.println("\t" + targetText); System.out.println("\t" + comparisionText); break; } } if (addText) { filteredTexts.add(texts.get(i)); } } return filteredTexts; }
There are much better ways to efficiently filter the texts at a cost of extra complexity—a simple reverse-word lookup index to compute an initial covering set will be vastly more efficient—search for a shingling text lookup for O(n) to O(n log(n)) approaches.
Setting the threshold can be a bit tricky, but looking a bunch of data should make the appropriate cutoff fairly clear for your needs.
- 手機安全和可信應用開發指南:TrustZone與OP-TEE技術詳解
- 數據庫系統原理及MySQL應用教程(第2版)
- Azure IoT Development Cookbook
- ASP.NET 3.5程序設計與項目實踐
- 軟件項目管理實用教程
- Hands-On Reinforcement Learning with Python
- Swift語言實戰精講
- Java程序設計入門
- Visual FoxPro程序設計習題集及實驗指導(第四版)
- Red Hat Enterprise Linux Troubleshooting Guide
- Go語言開發實戰(慕課版)
- Python Machine Learning Blueprints:Intuitive data projects you can relate to
- AI自動化測試:技術原理、平臺搭建與工程實踐
- Apache Solr for Indexing Data
- Mastering Clojure