- Natural Language Processing with Java and LingPipe Cookbook
- Breck Baldwin Krishna Dayanidhi
- 858字
- 2021-08-05 17:12:50
Training your own language model classifier
The world of NLP really opens up when classifiers are customized. This recipe provides details on how to customize a classifier by collecting examples for the classifier to learn from—this is called training data. It is also called gold standard data, truth, or ground truth. We have some from the previous recipe that we will use.
Getting ready
We will create a customized language ID classifier for English and other languages. Creation of training data involves getting access to text data and then annotating it for the categories of the classifier—in this case, annotation is the language. Training data can come from a range of sources. Some possibilities include:
- Gold standard data such as the one created in the preceding evaluation recipe.
- Data that is somehow already annotated for the categories you care about. For example, Wikipedia has language-specific versions, which make easy pickings to train up a language ID classifier. This is how we created the
3LangId.LMClassifier
model. - Be creative—where is the data that helps guide a classifier in the right direction?
Language ID doesn't require much data to work well, so 20 tweets per language will start to reliably distinguish strongly different languages. The amount of training data will be driven by evaluation—more data generally improves performance.
The example assumes that around 10 tweets of English and 10 non-English tweets have been annotated by people and put in data/disney_e_n.csv
.
How to do it...
In order to train your own language model classifier, perform the following steps:
- Fire up a terminal and type the following:
java -cp lingpipe-cookbook.1.0.jar:lib/opencsv-2.4.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.TrainAndRunLMClassifier
- Then, type some English in the command prompt, perhaps, a Kurt Vonnegut quotation, to see the resulting
JointClassification
. See the Getting confidence estimates from a classifier recipe for the explanation of the following output:Type a string to be classified. Empty string to quit. So it goes. Rank Categ Score P(Category|Input) log2 P(Category,Input) 0=e -4.24592987919 0.9999933712053 -55.19708842949149 1=n -5.56922173547 6.62884502334E-6 -72.39988256112824
- Type in some non-English, such as the Spanish title of Borge's The Garden of the Forking Paths:
Type a string to be classified. Empty string to quit. El Jardín de senderos que se bifurcan Rank Categ Score P(Category|Input) log2 P(Category,Input) 0=n -5.6612148689 0.999989087229795 -226.44859475801326 1=e -6.0733050528 1.091277041753E-5 -242.93220211249715
How it works...
The program is in src/com/lingpipe/cookbook/chapter1/TrainAndRunLMClassifier.java
; the contents of the main()
method start with:
String dataPath = args.length > 0 ? args[0] : "data/disney_e_n.csv"; List<String[]> annotatedData = Util.readAnnotatedCsvRemoveHeader(new File(dataPath)); String[] categories = Util.getCategories(annotatedData);
The preceding code gets the contents of the .csv
file and then extracts the list of categories that were annotated; these categories will be all the non-empty strings in the annotation column.
The following DynamicLMClassifier
is created using a static method that requires the array of categories and int
, which is the order of the language models. With an order of 3, the language model will be trained on all 1 to 3 character sequences of the text training data. So "I luv Disney" will produce training instances of "I", "I ", "I l", " l", " lu", "u", "uv", "luv", and so on. The createNGramBoundary
method appends a special token to the beginning and end of each text sequence; this token can help if the beginnings or ends are informative for classification. Most text data is sensitive to beginnings/ends, so we will choose this model:
int maxCharNGram = 3; DynamicLMClassifier<NGramBoundaryLM> classifier = DynamicLMClassifier.createNGramBoundary(categories,maxCharNGram);
The following code iterates over the rows of training data and creates Classified<CharSequence>
in the same way as shown in the Evaluation of classifiers – the confusion matrix recipe for evaluation. However, instead of passing the Classified
object to an evaluation handler, it is used to train the classifier.
for (String[] row: annotatedData) { String truth = row[Util.ANNOTATION_OFFSET]; String text = row[Util.TEXT_OFFSET]; Classification classification = new Classification(truth); Classified<CharSequence> classified = new Classified<CharSequence>(text,classification); classifier.handle(classified); }
No further steps are necessary, and the classifier is ready for use by the console:
Util.consoleInputPrintClassification(classifier);
There's more...
Training and using the classifier can be interspersed for classifiers based on DynamicLM
. This is generally not the case with other classifiers such as LogisticRegression
, because they use all the data to compile a model that can carry out classifications.
There is another method for training the classifier that gives you more control over how the training goes. The following is the code snippet for this:
Classification classification = new Classification(truth); Classified<CharSequence> classified = new Classified<CharSequence>(text,classification); classifier.handle(classified);
Alternatively, we can have the same effect with:
int count = 1; classifier.train(truth,text,count);
The train()
method allows an extra degree of control for training, because it allows for the count to be explicitly set. As we explore LingPipe classifiers, we will often see an alternate way of training that allows for some additional control beyond what the handle()
method provides.
Character-language model-based classifiers work very well for tasks where character sequences are distinctive. Language identification is an ideal candidate for this, but it can also be used for tasks such as sentiment, topic assignment, and question answering.
See also
The Javadoc for LingPipe's classifiers are quite extensive on the underlying math that drives the technology.
- Java面向對象思想與程序設計
- Windows系統管理與服務配置
- Swift 3 New Features
- 用Flutter極速構建原生應用
- TradeStation交易應用實踐:量化方法構建贏家策略(原書第2版)
- 編程數學
- Frank Kane's Taming Big Data with Apache Spark and Python
- Hands-On GUI Programming with C++ and Qt5
- Scala編程(第5版)
- 微信小程序開發實戰:設計·運營·變現(圖解案例版)
- Android Studio Cookbook
- 嵌入式Linux C語言程序設計基礎教程
- Python Web自動化測試設計與實現
- Python網絡爬蟲實例教程(視頻講解版)
- Python Social Media Analytics