Andrew McCallum Contact Info Bio & Affiliations Vita Teaching Publications Research & Projects Code & Data Students & other collab's Activities & Events Personal Links: UMass ML Seminar Code * MALLET is a library of Java code for machine learning applied to text. It provides facilities not only for document classification, but also information extraction, part-of-speech tagging, noun phrase segmentation, and much more. The development of the library is quite mature, however it does not yet have as polished front-ends or documentation as rainbow. * Libbow is a library of C code for document classification, clustering and retrieval. Also provided with the library is rainbow, its popular front-end for document classification, and archer, a speedy disk-based document retrieval engine with an AltaVista-like query interface, with the ability to handle several gigabytes of text. * Cora HMM is the C implementation of HMMs used for information extraction in Cora. It was written by Kristie Seymore. * RLKIT a software library that makes it easy to test various reinforcement learning algorithms in different environments with different sensory-motor systems. It's implemented in Objective-C and GNU Guile (Scheme). Data SRAA: Simulated/Real/Aviation/Auto UseNet data [document classification] 73,218 UseNet articles from four discussion groups, for simulated auto racing, simulated aviation, real autos, real aviation. I have often used this data for binary classification---separating real from simulated, and auto from aviation---making the point that the same data can be classified different ways depending on the user's needs. This is especially interesting for semi-supervised learning. This data was gathered by Andrew McCallum while at Just Research. Cora Citation Matching [reference matching, object correspondence] Text of citations hand-clustered into groups referring to the same paper. Cora Research Paper Classification [relational document classification] Research papers classified into a topic hierarchy with 73 leaves. We call this a relational data set, because the citations provide relations among papers. Cora Information Extraction [information extraction] Research paper headers and citations, with labeled segments for authors, title, institutions, venue, date, page numbers and several other fields. Frequently Asked Questions [information extraction] Several UseNet FAQ's segmented into questions and answers. Data gathered and labeled by Dayne Freitag and Andrew McCallum. CMU Seminar Announcements [information extraction] 48 emailed seminar announcements, with labeled segments for speaker, title, start-time, end-time. Labeled by Dayne Freitag. Industry Sector [document classification] Corporate web pages classified into a topic hierarchy with about 70 leaves. 20 Newsgroups [document classification] About 20,000 UseNet postings from 20 newsgroups. Gathered by Ken Lang at CMU in the mid-90's. This is the original set, without various editing done by Jason Rennie and others.