Declarative Multilingual Information Extraction with SystemT and Polyglot

Presented by Laura Chiticariu and Alan Akbik

Thursday, November 17, 2016
11:00 a.m.
ICSI Lecture Hall

Abstract:

Information extraction (IE), the task of extracting structured information from unstructured or semi-structured data, is increasingly important to a wide array of enterprise applications, ranging from Business Intelligence to Data-as-a-Service.

In the first part of the talk, we give an overview of SystemT, a declarative IE system designed and developed to address the requirements driven by modern applications: scalability, expressivity, and transparency. SystemT is based on the basic principle underlying relational database technology: complete separation of specification from execution. SystemT uses a declarative language for expressing NLP algorithms called AQL, and an optimizer that generates high-performance algebraic execution plans for AQL rules. It makes IE orders of magnitude more scalable and easy  to use, maintain and customize. Today, SystemT ships with multiple products across 4 IBM Software Brands and it being taught in universities. Our ongoing research and development efforts focus on making SystemT more usable for both technical and business users, and continuing enhancing its core functionalities based on natural language processing, machine learning, and  database technology.

In the second part of the talk we present POLYGLOT, a multilingual semantic role labeling system capable of semantically parsing sentences in 9 different languages from 4 different language groups. The key feature of the system is that it treats the semantic labels of the English Proposition Bank as “universal semantic labels”: Given a sentence in any of the supported languages, POLYGLOT will predict appropriate English PropBank frame and role annotation. We illustrate how these universal semantic labels can be used within SystemT to create information extractors that immediately work across different languages. In addition, we illustrate how we automatically generate Proposition Banks for new languages in order to enable multilingual SRL and discuss some challenges of crosslingual semantics.

Speaker Bios:

Laura Chiticariu is a Research Staff Member in the Scalable Natural Language Processing group at IBM Research-Almaden. Her primary research is in Database Systems and Natural Language Processing. Laura joined IBM Research after obtaining her Ph.D. in Computer Science from University of California, Santa Cruz in 2008. Her current work focuses on building development support for information extraction systems, utilizing a range of techniques including data provenance, information integration and machine learning.

Alan Akbik is a postdoctoral researcher in the Scalable Natural Language Processing group at IBM Research-Almaden. He joined IBM Research earlier this year, after obtaining his Ph.D. in Computer Science from the Berlin Institute of Technology. His current research focuses on "universal" Semantic Role Labeling, i.e. enabling the shallow semantic parsing of textual data from many different languages into a shared, language-independent representation of semantics. He pursues this research in order to enable text analytics on multilingual data, for applications ranging from Information Extraction to Question Answering.