- Natural Language Processing with Java and LingPipe Cookbook
- Breck Baldwin Krishna Dayanidhi
- 997字
- 2021-08-05 17:12:49
Deserializing and running a classifier
This recipe does two things: introduces a very simple and effective language ID classifier and demonstrates how to deserialize a LingPipe class. If you find yourself here from a later chapter, trying to understand deserialization, I encourage you to run the example program anyway. It will take 5 minutes, and you might learn something useful.
Our language ID classifier is based on character language models. Each language model gives you the probability of the text, given that it is generated in that language. The model that is most familiar with the text is the first best fit. This one has already been built, but later in the chapter, you will learn to make your own.
How to do it...
Perform the following steps to deserialize and run a classifier:
- Go to the
cookbook
directory for the book and run the command for OSX, Unix, and Linux:java -cp lingpipe-cookbook.1.0.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.RunClassifierFromDisk
For Windows invocation (quote the classpath and use
;
instead of:
):java -cp "lingpipe-cookbook.1.0.jar;lib\lingpipe-4.1.0.jar" com.lingpipe.cookbook.chapter1.RunClassifierFromDisk
We will use the Unix style command line in this book.
- The program reports the model being loaded and a default, and prompts for a sentence to classify:
Loading: models/3LangId.LMClassifier Type a string to be classified. Empty string to quit. The rain in Spain falls mainly on the plain. english Type a string to be classified. Empty string to quit. la lluvia en Espa?a cae principalmente en el llano. spanish Type a string to be classified. Empty string to quit. スペインの雨は主に平野に落ちる。 japanese
- The classifier is trained on English, Spanish, and Japanese. We have entered an example of each—to get some Japanese, go to http://ja.wikipedia.org/wiki/. These are the only languages it knows about, but it will guess on any text. So, let's try some Arabic:
Type a string to be classified. Empty string to quit. ????? ?? ??????? ??? ????? ??? ???. japanese
- It thinks it is Japanese because this language has more characters than English or Spanish. This in turn leads that model to expect more unknown characters. All the Arabic characters are unknown.
- If you are working with a Windows terminal, you might encounter difficulty entering UTF-8 characters.
How it works...
The code in the jar is cookbook/src/com/lingpipe/cookbook/chapter1/ RunClassifierFromDisk.java
. What is happening is that a pre-built model for language identification is deserialized and made available. It has been trained on English, Japanese, and Spanish. The training data came from Wikipedia pages for each language. You can see the data in data/3LangId.csv
. The focus of this recipe is to show you how to deserialize the classifier and run it—training is handled in the Training your own language model classifier recipe in this chapter. The entire code for the RunClassifier FromDisk.java
class starts with the package; then it imports the start of the RunClassifierFromDisk
class and the start of main()
:
package com.lingpipe.cookbook.chapter1; import java.io.File; import java.io.IOException; import com.aliasi.classify.BaseClassifier; import com.aliasi.util.AbstractExternalizable; import com.lingpipe.cookbook.Util; public class RunClassifierFromDisk { public static void main(String[] args) throws IOException, ClassNotFoundException {
The preceding code is a very standard Java code, and we present it without explanation. Next is a feature in most recipes that supplies a default value for a file if the command line does not contain one. This allows you to use your own data if you have it, otherwise it will run from files in the distribution. In this case, a default classifier is supplied if there is no argument on the command line:
String classifierPath = args.length > 0 ? args[0] : "models/3LangId.LMClassifier"; System.out.println("Loading: " + classifierPath);
Next, we will see how to deserialize a classifier or another LingPipe object from disk:
File serializedClassifier = new File(classifierPath); @SuppressWarnings("unchecked") BaseClassifier<String> classifier = (BaseClassifier<String>) AbstractExternalizable.readObject(serializedClassifier);
The preceding code snippet is the first LingPipe-specific code, where the classifier is built using the static AbstractExternalizable.readObject
method.
This class is employed throughout LingPipe to carry out a compilation of classes for two reasons. First, it allows the compiled objects to have final variables set, which supports LingPipe's extensive use of immutables. Second, it avoids the messiness of exposing the I/O methods required for externalization and deserialization, most notably, the no-argument constructor. This class is used as the superclass of a private internal class that does the actual compilation. This private internal class implements the required no-arg
constructor and stores the object required for readResolve()
.
Note
The reason we use Externalizable
instead of Serializable
is to avoid breaking backward compatibility when changing any method signatures or member variables. Externalizable
extends Serializable
and allows control of how the object is read or written. For more information on this, refer to the excellent chapter on serialization in Josh Bloch's book, Effective Java, 2nd Edition.
BaseClassifier<E>
is the foundational classifier interface, with E
being the type of object being classified in LingPipe. Look at the Javadoc to see the range of classifiers that implements the interface—there are 10 of them. Deserializing to BaseClassifier<E>
hides a good bit of complexity, which we will explore later in the How to serialize a LingPipe object – classifier example recipe in this chapter.
The last line calls a utility method, which we will use frequently in this book:
Util.consoleInputBestCategory(classifier);
This method handles interactions with the command line. The code is in src/com/lingpipe/cookbook/Util.java
:
public static void consoleInputBestCategory( BaseClassifier<CharSequence> classifier) throws IOException { BufferedReader reader = new BufferedReader(new InputStreamReader(System.in)); while (true) { System.out.println("\nType a string to be classified. " + " Empty string to quit."); String data = reader.readLine(); if (data.equals("")) { return; } Classification classification = classifier.classify(data); System.out.println("Best Category: " + classification.bestCategory()); } }
Once the string is read in from the console, then classifier.classify(input)
is called, which returns Classification
. This, in turn, provides a String
label that is printed out. That's it! You have run a classifier.
- vSphere High Performance Cookbook
- Visual Basic程序設(shè)計教程
- Neo4j Essentials
- Learn Swift by Building Applications
- Internet of Things with Intel Galileo
- Java軟件開發(fā)基礎(chǔ)
- 零基礎(chǔ)入門學習Python(第2版)
- Julia高性能科學計算(第2版)
- Webpack實戰(zhàn):入門、進階與調(diào)優(yōu)
- Lighttpd源碼分析
- Mastering Xamarin.Forms(Second Edition)
- Procedural Content Generation for C++ Game Development
- Learning Material Design
- Python期貨量化交易實戰(zhàn)
- Oracle Data Guard 11gR2 Administration Beginner's Guide