Andrew McCallum

Contact Info
Bio & Affiliations
Vita
Teaching
Publications
Research & Projects
Code & Data
Students & other collab's
Activities & Events
Personal

Links:
UMass ML Seminar


Code

  • MALLET is a library of Java code for machine learning applied to text. It provides facilities not only for document classification, but also information extraction, part-of-speech tagging, noun phrase segmentation, and much more. The development of the library is quite mature, however it does not yet have as polished front-ends or documentation as rainbow.
  • Libbow is a library of C code for document classification, clustering and retrieval. Also provided with the library is rainbow, its popular front-end for document classification, and archer, a speedy disk-based document retrieval engine with an AltaVista-like query interface, with the ability to handle several gigabytes of text.
  • Cora HMM is the C implementation of HMMs used for information extraction in Cora. It was written by Kristie Seymore.
  • RLKIT a software library that makes it easy to test various reinforcement learning algorithms in different environments with different sensory-motor systems. It's implemented in Objective-C and GNU Guile (Scheme).

Data

SRAA: Simulated/Real/Aviation/Auto UseNet data [document classification]
73,218 UseNet articles from four discussion groups, for simulated auto racing, simulated aviation, real autos, real aviation. I have often used this data for binary classification---separating real from simulated, and auto from aviation---making the point that the same data can be classified different ways depending on the user's needs. This is especially interesting for semi-supervised learning. This data was gathered by Andrew McCallum while at Just Research.

Cora Citation Matching [reference matching, object correspondence]
Text of citations hand-clustered into groups referring to the same paper.

Cora Research Paper Classification [relational document classification]
Research papers classified into a topic hierarchy with 73 leaves. We call this a relational data set, because the citations provide relations among papers.

Cora Information Extraction [information extraction]
Research paper headers and citations, with labeled segments for authors, title, institutions, venue, date, page numbers and several other fields.

Frequently Asked Questions [information extraction]
Several UseNet FAQ's segmented into questions and answers. Data gathered and labeled by Dayne Freitag and Andrew McCallum.

CMU Seminar Announcements [information extraction]
48 emailed seminar announcements, with labeled segments for speaker, title, start-time, end-time. Labeled by Dayne Freitag.

Industry Sector [document classification]
Corporate web pages classified into a topic hierarchy with about 70 leaves.

20 Newsgroups [document classification]
About 20,000 UseNet postings from 20 newsgroups. Gathered by Ken Lang at CMU in the mid-90's. This is the original set, without various editing done by Jason Rennie and others.