- Natural Language Processing with Java and LingPipe Cookbook
- Breck Baldwin Krishna Dayanidhi
- 544字
- 2021-08-05 17:12:50
Viewing error categories – false positives
We can achieve the best possible classifier performance by examining the errors and making changes to the system. There is a very bad habit among developers and machine-learning folks to not look at errors, particularly as systems mature. Just to be clear, at the end of a project, the developers responsible for tuning the classifier should be very familiar with the domain being classified, if not expert in it, because they have looked at so much data while tuning the system. If the developer cannot do a reasonable job of emulating the classifiers that you are tuning, then you are not looking at enough data.
This recipe performs the most basic form of looking at what the system got wrong in the form of false positives, which are examples from training data that the classifier assigned to a category, but the correct category was something else.
How to do it...
Perform the following steps in order to view error categories using false positives:
- This recipe extends the previous How to train and evaluate with cross validation recipe by accessing more of what the evaluation class provides. Get a command prompt and type:
java -cp lingpipe-cookbook.1.0.jar:lib/opencsv-2.4.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.ReportFalsePositivesOverXValidation
- This will result in:
Training data is: data/disney_e_n.csv reference\response \e,n, e 10,1, n 6,4, False Positives for e Malisímos los nuevos dibujitos de disney, nickelodeon, cartoon, etc, no me gustannn : n @meeelp mas que venha um filhinho mais fofo que o próprio pai, com covinha e amando a Disney kkkkkkkkkkkkkkkkk : n @HedyHAMIDI au quartier pas a Disney moi : n @greenath_ t'as de la chance d'aller a Disney putain j'y ai jamais été moi. : n Prefiro gastar uma baba de dinheiro pra ir pra cancun doq pra Disney por exemplo : n ES INSUPERABLE DISNEY !! QUIERO VOLVER:( : n False Positives for n request now "let's get tricky" by @bellathorne and @ROSHON on @radiodisney!!! just call 1-877-870-5678 or at http://t.co/cbne5yRKhQ!! <3 : e
- The output starts with a confusion matrix. Then, we will see the actual six instances of false positives for
p
from the lower left-hand side cell of the confusion matrix labeled with the category that the classifier guessed. Then, we will see false positives forn
, which is a single example. The true category is appended with:
, which is helpful for classifiers that have more than two categories.
How it works…
This recipe is based on the previous one, but it has its own source in com/lingpipe/cookbook/chapter1/ReportFalsePositivesOverXValidation.java
. There are two differences. First, storeInputs
is set to true
for the evaluator:
boolean storeInputs = true; BaseClassifierEvaluator<CharSequence> evaluator = new BaseClassifierEvaluator<CharSequence>(null, categories, storeInputs);
Second, a Util
method is added to print false positives:
for (String category : categories) { Util.printFalsePositives(category, evaluator, corpus); }
The preceding code works by identifying a category of focus—e
or English tweets—and extracting all the false positives from the classifier evaluator. For this category, false positives are tweets that are non-English in truth, but the classifier thought they were English. The referenced Util
method is as follows:
public static <E> void printFalsePositives(String category, BaseClassifierEvaluator<E> evaluator, Corpus<ObjectHandler<Classified<E>>> corpus) throws IOException { final Map<E,Classification> truthMap = new HashMap<E,Classification>(); corpus.visitCorpus(new ObjectHandler<Classified<E>>() { @Override public void handle(Classified<E> data) { truthMap.put(data.getObject(),data.getClassification()); } });
The preceding code takes the corpus that contains all the truth data and populates Map<E,Classification>
to allow for lookup of the truth annotation, given the input. If the same input exists in two categories, then this method will not be robust but will record the last example seen:
List<Classified<E>> falsePositives = evaluator.falsePositives(category); System.out.println("False Positives for " + category); for (Classified<E> classified : falsePositives) { E data = classified.getObject(); Classification truthClassification = truthMap.get(data); System.out.println(data + " : " + truthClassification.bestCategory()); } }
The code gets the false positives from the evaluator and then iterates over all them with a lookup into truthMap
built in the preceding code and prints out the relevant information. There are also methods to get false negatives, true positives, and true negatives in evaluator
.
The ability to identify mistakes is crucial to improving performance. The advice seems obvious, but it is very common for developers to not look at mistakes. They will look at system output and make a rough estimate of whether the system is good enough; this does not result in top-performing classifiers.
The next recipe works through more evaluation metrics and their definition.
- 自己動(dòng)手寫搜索引擎
- Developing Mobile Web ArcGIS Applications
- Mastering Kotlin
- 我的第一本算法書
- HTML5游戲開發(fā)案例教程
- Mastering LibGDX Game Development
- Python貝葉斯分析(第2版)
- Hands-On Natural Language Processing with Python
- Learning Hunk
- Visual Basic程序設(shè)計(jì)
- Visual Basic程序設(shè)計(jì)上機(jī)實(shí)驗(yàn)教程
- 軟件測(cè)試綜合技術(shù)
- R語(yǔ)言數(shù)據(jù)可視化:科技圖表繪制
- 算法圖解
- 貫通Tomcat開發(fā)