- Natural Language Processing with Java and LingPipe Cookbook
- Breck Baldwin Krishna Dayanidhi
- 1361字
- 2021-08-05 17:12:49
Getting confidence estimates from a classifier
Classifiers tend to be a lot more useful if they give more information about how confident they are of the classification—this is usually a score or a probability. We often threshold classifiers to help fit the performance requirements of an installation. For example, if it was vital that the classifier never makes a mistake, then we could require that the classification be very confident before committing to a decision.
LingPipe classifiers exist on a hierarchy based on the kinds of estimates they provide. The backbone is a series of interfaces—don't freak out; it is actually pretty simple. You don't need to understand it now, but we do need to write it down somewhere for future reference:
BaseClassifier<E>
: This is just your basic classifier of objects of typeE
. It has aclassify()
method that returns a classification, which in turn has abestCategory()
method and atoString()
method that is of some informative use.RankedClassifier<E> extends BaseClassifier<E>
: Theclassify()
method returnsRankedClassification
, which extendsClassification
and adds methods forcategory(int rank)
that says what the 1st to nth classifications are. There is also asize()
method that indicates how many classifications there are.ScoredClassifier<E> extends RankedClassifier<E>
: The returnedScoredClassification
adds ascore(int rank)
method.ConditionalClassifier<E> extends RankedClassifier<E>
:ConditionalClassification
produced by this has the property that the sum of scores for all categories must sum to 1 as accessed via theconditionalProbability(int rank)
method andconditionalProbability(String category)
. There's more; you can read the Javadoc for this. This classification will be the work horse of the book when things get fancy, and we want to know the confidence that the tweet is English versus the tweet is Japanese versus the tweet is Spanish. These estimates will have to sum to 1.JointClassifier<E> extends ConditionalClassifier<E>
: This providesJointClassification
of the input and category in the space of all the possible inputs, and all such estimates sum to 1. This is a sparse space, so values are log based to avoid underflow errors. We don't see a lot of use of this estimate directly in production.
It is obvious that there has been a great deal of thought put into the classification stack presented. This is because huge numbers of industrial NLP problems are handled by a classification system in the end.
It turns out that our simplest classifier—in some arbitrary sense of simple—produces the richest estimates, which are joint classifications. Let's dive in.
Getting ready
In the previous recipe, we blithely deserialized to BaseClassifier<String>
that hid all the details of what was going on. The reality is a bit more complex than suggested by the hazy abstract class. Note that the file on disk that was loaded is named 3LangId.LMClassifier
. By convention, we name serialized models with the type of object it will deserialize to, which, in this case, is LMClassifier
, and it extends BaseClassifier
. The most specific typing for the classifier is:
LMClassifier<CompiledNGramBoundaryLM, MultivariateDistribution> classifier = (LMClassifier <CompiledNGramBoundaryLM, MultivariateDistribution>) AbstractExternalizable.readObject(new File(args[0]));
The cast to LMClassifier<CompiledNGramBoundaryLM, MultivariateDistribution>
specifies the type of distribution to be MultivariateDistribution
. The Javadoc for com.aliasi.stats.MultivariateDistribution
is quite explicit and helpful in describing what this is.
Note
MultivariateDistribution
implements a discrete distribution over a finite set of outcomes, numbered consecutively from zero.
The Javadoc goes into a lot of detail about MultivariateDistribution
, but it basically means that we can have an n-way assignment of probabilities that sum to 1.
The next class in the cast is for CompiledNGramBoundaryLM
, which is the "memory" of the LMClassifier
. In fact, each language gets its own. This means that English will have a separate language model from Spanish and so on. There are eight different kinds of language models that could have been used as this part of the classifier—consult the Javadoc for the LanguageModel
interface. Each language model (LM) has the following properties:
- The LM will provide a probability that it generated the text provided. It is robust against data that it has not seen before, in the sense that it won't crash or give a zero probability. Arabic just comes across as a sequence of unknown characters for our example.
- The sum of all the possible character sequence probabilities of any length is 1 for boundary LMs. Process LMs sum the probability to 1 over all sequences of the same length. Look at the Javadoc for how this bit of math is done.
- Each language model has no knowledge of data outside of its category.
- The classifier keeps track of the marginal probability of the category and factors this into the results for the category. Marginal probability is saying that we tend to see two-thirds English, one-sixth Spanish, and one-sixth Japanese in Disney tweets. This information is combined with the LM estimates.
- The LM is a compiled version of
LanguageModel.Dynamic
that we will cover in the later recipes that discuss training.
LMClassifier
that is constructed wraps these components into a classifier.
Luckily, the interface saves the day with a more aesthetic deserialization:
JointClassifier<String> classifier = (JointClassifier<String>) AbstractExternalizable.readObject(new File(classifierPath));
The interface hides the guts of the implementation nicely and this is what we are going with in the example program.
How to do it…
This recipe is the first time we start peeling away from what classifiers can do, but first, let's play with it a bit:
- Get your magic shell genie to conjure a command prompt with a Java interpreter and type:
java -cp lingpipe-cookbook.1.0.jar:lib/lingpipe-4.1.0.jar: com.lingpipe.cookbook.chapter1.RunClassifierJoint
- We will enter the same data as we did earlier:
Type a string to be classified. Empty string to quit. The rain in Spain falls mainly on the plain. Rank Categ Score P(Category|Input) log2 P(Category,Input) 0=english -3.60092 0.9999999999 -165.64233893156052 1=spanish -4.50479 3.04549412621E-13 -207.2207276413206 2=japanese -14.369 7.6855682344E-150 -660.989401136873
As described, JointClassification
carries through all the classification metrics in the hierarchy rooted at Classification
. Each level of classification shown as follows adds to the classifiers preceding it:
Classification
provides the first best category as the rank 0 category.RankedClassification
adds an ordering of all the possible categories with a lower rank corresponding to greater likelihood of the category. Therank
column reflects this ordering.ScoredClassification
adds a numeric score to the ranked output. Note that scores might or might not compare well against other strings being classified depending on the type of classifier. This is the column labeledScore
. To understand the basis of this score, consult the relevant Javadoc.ConditionalClassification
further refines the score by making it a category probability conditioned on the input. The probabilities of all categories will sum up to 1. This is the column labeledP(Category|Input)
, which is the traditional way to write probability of the category given the input.JointClassification
adds the log2 (log base 2) probability of the input and the category—this is the joint probability. The probabilities of all categories and inputs will sum up to 1, which is a very large space indeed with very low probabilities assigned to any pair of category and string. This is why log2 values are used to prevent numerical underflow. This is the column labeledlog 2 P(Category, Input)
, which is translated as the log2 probability of the category and input.
Look at the Javadoc for the com.aliasi.classify
package for more information on the metrics and classifiers that implement them.
How it works…
The code is in src/com/lingpipe/cookbook/chapter1/RunClassifierJoint.java
, and it deserializes to a JointClassifier<CharSequence>
:
public static void main(String[] args) throws IOException, ClassNotFoundException { String classifierPath = args.length > 0 ? args[0] : "models/3LangId.LMClassifier"; @SuppressWarnings("unchecked") JointClassifier<CharSequence> classifier = (JointClassifier<CharSequence>) AbstractExternalizable.readObject(new File(classifierPath)); Util.consoleInputPrintClassification(classifier); }
It makes a call to Util.consoleInputPrintClassification(classifier)
, which minimally differs from Util.consoleInputBestCategory(classifier)
, in that it uses the toString()
method of classification to print. The code is as follows:
public static void consoleInputPrintClassification(BaseClassifier<CharSequence> classifier) throws IOException { BufferedReader reader = new BufferedReader(new InputStreamReader(System.in)); while (true) { System.out.println("\nType a string to be classified." + Empty string to quit."); String data = reader.readLine(); if (data.equals("")) { return; } Classification classification = classifier.classify(data); System.out.println(classification); } }
We got a richer output than we expected, because the type is Classification
, but the toString()
method will be applied to the runtime type JointClassification
.
See also
- There is detailed information in language models.
- Apache Oozie Essentials
- Visual FoxPro程序設計教程(第3版)
- Java Web基礎與實例教程(第2版·微課版)
- Visual FoxPro 程序設計
- Clojure for Domain:specific Languages
- 我的第一本算法書
- Learning AWS Lumberyard Game Development
- Julia Cookbook
- HDInsight Essentials(Second Edition)
- Flutter跨平臺開發入門與實戰
- 執劍而舞:用代碼創作藝術
- Learning Node.js for .NET Developers
- Django 5企業級Web應用開發實戰(視頻教學版)
- Python網絡爬蟲技術與應用
- Building Slack Bots