官术网_书友最值得收藏!

MALLET

The Machine Learning for Language Toolkit (MALLET) is a large library of natural language processing algorithms and utilities. It can be used in a variety of tasks such as document classification, document clustering, information extraction, and topic modelling. It features a command-line interface as well as a Java API for several algorithms such as Naive Bayes, HMM, Latent Dirichlet topic models, logistic regression, and conditional random fields.

MALLET is available under the Common Public License 1.0, which means that you can even use it in commercial applications. It can be downloaded from http://mallet.cs.umass.edu. A MALLET instance is represented by name, label, data, and source. However, there are two methods to import data into the MALLET format, as shown in the following list:

  • Instance per file: Each file or document corresponds to an instance and MALLET accepts the directory name for the input.
  • Instance per line: Each line corresponds to an instance, where the following format is assumed—the instance_name label token. Data will be a feature vector, consisting of distinct words that appear as tokens and their occurrence count.

The library is comprised of the following packages:

  • cc.mallet.classify: These are algorithms for training and classifying instances, including AdaBoost, bagging, C4.5, as well as other decision tree models, multivariate logistic regression, Naive Bayes, and Winnow2.
  • cc.mallet.cluster: These are unsupervised clustering algorithms, including greedy agglomerative, hill climbing, k-best, and k-means clustering.
  • cc.mallet.extract: This implements tokenizers, document extractors, document viewers, cleaners, and so on.
  • cc.mallet.fst: This implements sequence models, including conditional random fields, HMM, maximum entropy Markov models, and corresponding algorithms and evaluators.
  • cc.mallet.grmm: This implements graphical models and factor graphs such as inference algorithms, learning, and testing, for example, loopy belief propagation, Gibbs sampling, and so on.
  • cc.mallet.optimize: These are optimization algorithms for finding the maximum of a function, such as gradient ascent, limited-memory BFGS, stochastic meta ascent, and so on.
  • cc.mallet.pipe: These are methods as pipelines to process data into MALLET instances.
  • cc.mallet.topics: These are topics modelling algorithms, such as Latent Dirichlet allocation, four-level pachinko allocation, hierarchical PAM, DMRT, and so on.
  • cc.mallet.types: This implements fundamental data types such as dataset, feature vector, instance, and label.
  • cc.mallet.util: These are miscellaneous utility functions such as command-line processing, search, math, test, and so on.
主站蜘蛛池模板: 苍溪县| 天水市| 石家庄市| 芜湖市| 大厂| 平和县| 开江县| 宜兰市| 石楼县| 邛崃市| 金阳县| 志丹县| 翼城县| 青河县| 木里| 崇明县| 肥东县| 博兴县| 香河县| 宁夏| 宜丰县| 萝北县| 莒南县| 句容市| 山西省| 离岛区| 定襄县| 竹北市| 东乡县| 留坝县| 龙泉市| 道真| 乐平市| 宜川县| 密云县| 印江| 三门峡市| 镇江市| 四会市| 万全县| 武陟县|