Corpus Linguistics and the Semantic Web

Youn Noh
Department of Information Studies
University of California, Los Angeles
IS 277 -- Term Paper
Dr. Philip Agre
June 15, 2004
Table of Contents
  1. Introduction
  2. Corpus Linguistics
  3. Domain Ontologies
     3.1 Syntax and Morphology
     3.2 Semantics
     3.3 Phonetics and Phonology
  4. Document Ontologies
  5. Service Ontologies
  6. Metadata Ontologies
  7. Research Implications
  8. References
List of Tables
  1. Major Suppliers of Electronic Corpora
  2. A Comparison of Tag Sets
  3. Features of the Word Class Adjective in the ICE Tag Set
  4. Software Tools for Corpus Research
1. Introduction

Tim Berners-Lee defines the Semantic Web as a web of machine-readable information whose meaning is well-defined by standards [Berners-Lee 2003]. Semantic Web activity, led by the World Wide Web Consortium [W3C], builds upon existing standards (e.g., [URI], [Unicode], [XML]) and conventions of practice (e.g., the use of namespaces in programming languages) to define the syntax and semantics of data structures for Web applications. The basic unit of representation is the Resource Description Framework [RDF], which provides a means of making statements about the properties of resources that may be identified on the Web. An RDF statement describes the attributes of a resource or expresses its relationship to other resources. RDF Schema [RDFS] extends RDF with the means of representing simple set-theoretic concepts (e.g., set membership, the subset relation, domain and range restrictions on properties, inheritance of properties) and creating user-defined types. The Web Ontology Language [OWL] extends RDFS with the means of representing properties of properties (transitivity, symmetry, inverse, functions, local domain and range restrictions, cardinality, specification of a set), relations between resources (equivalence or identity, disjointness or difference), and operations on sets (union, intersection, complement) to support logical inference.

The purpose of Semantic Web activity is to provide languages for communities to create ontologies that represent the information that they wish to make available on the Web. Typically, the ontologies will include representations of the domain of interest, the distinctive document genres of the domain, metadata for tracking documents or ontologies, and technical requirements for accessing documents in Web applications [Agre 2004]. Communities can use the ontologies to create structured Web documents with markup that makes assertions about the document, the data it contains, and its intended use. Computer programs can then use the markup to handle the documents or data in a more meaningful way.

Semantic Web languages are layered to support phased development. XML provides advantages over HTML by allowing users to define their own elements and element attributes by means of a document type definition (DTD). A DTD provides a template for a particular kind of information structure. In a DTD, users may specify valid elements, the contents of these elements, and which attributes may modify an element. The intended use of an element may usually be inferred from its identifier, but this information is not explicitly represented in its DTD. The correct handling of elements must be built into tools used to interpret or translate XML documents. There is no way for software tools to acquire this knowledge from the DTD itself. Furthermore, in order to exchange XML documents, the parties involved must agree upon a DTD or define mappings between elements in distinct DTDs, possibly with some loss of data.

RDF takes some preliminary steps toward representing the real world knowledge that elements are intended to capture in machine-processible form. In plain XML, the constraints on attribute values are at the level of character data. In RDF, attribute values are specified using URIs, which represent things such as Web pages, people, or places. RDF statements attempt to make machine-processible assertions about the real world. In other words, an RDF statement may fail for syntactic or semantic reasons. The statement may not adhere to RDF/XML syntax, or the URI given as an attribute value may not exist. RDF achieves some level of interoperability by using a consistent syntax for making statements so that agreement upon a DTD is unnecessary. RDF still relies upon implicit agreement on the meaning of elements. RDF/XML syntax provides a convenient means of leveraging this implicit knowledge through the use of namespaces. Elements defined by different communities may be reused through the use of namespace declarations. It is up to document creators to use elements appropriately or to work out discrepancies.

RDFS and OWL make more real world knowledge accessible to machines by providing means to represent inheritance and other semantic relationships. Automated reasoning tools can use the semantic relationships captured in an ontology to process data more efficiently and effectively. Statistical methods may be used with Semantic Web technologies to analyze structured data in applications such as information extraction.

Assuming that a critical mass of users exists in a given domain and that the benefit of standardization is constant across domains, the level of investment will determine the level of benefit. Communities that do not have the resources for ontology development still gain access to tools by having their data in a standard format, even if the meaning of the representation is not fully specified. Communities that develop ontologies can make better use of their resources by having their data represented in a machine-understandable form. It is questionable whether or not the same principle applies with respect to interoperability. The effort required to agree upon an ontology is considerably greater than that required to agree upon a DTD. The level of benefit is likely to depend upon a number of factors, such as the complexity of the data, the complexity of the operations defined on the data, the nature of the representations needed to support these operations, and the homogeneity of the community or the degree of similarity between different communities sharing the ontology.

This paper examines the application of Semantic Web technologies and principles to the domain of corpus linguistics. It attempts to answer the following questions:

  1. Do Semantic Web technologies make it easier to do research or improve the quality of research?
  2. Do Semantic Web principles apply to the design of linguistic annotation schemes and languages for encoding them?
  3. What makes it easy or difficult for communities of practice, other than the W3C, to become involved in Semantic Web activity?

Semantic Web technologies are the languages and specifications for creating and using documents and ontologies for Web applications. The [W3C Technical Reports] page gives an indication of their scope. Semantic Web principles reflect the worldview of [W3C Members] in the context of the Web. The Web community has traditionally valued autonomy and open access. While holding to these values, the Semantic Web community is concerned with the design of an intelligent, elegant, efficient infrastructure for the Web. The idea is not to enforce conformity but to make technologies available for communities to create their own standardized tools for collaboration.

After providing a brief introduction to corpus linguistics in section 2, I will discuss the four types of ontologies mentioned above. Section 3 on domain ontologies reviews projects and types of annotation of interest to linguists working in a selection of research areas: syntax and morphology (section 3.1), semantics (section 3.2), and phonetics and phonology (section 3.3). Section 4 on document ontologies examines the views on data and the types of display needed to support research. Section 5 on service ontologies takes a closer look at XML applications in this domain. Section 6 on metadata ontologies describes markup for tracking the creation and maintenance of a corpus, including any annotations. The paper concludes with a discussion of the research implications of the Semantic Web for corpus linguistics.

2. Corpus Linguistics

The basic aim of linguistic theory is to provide a principled account of the rules that generate natural language expressions. Two assumptions of this approach are that language is rule-based and that the rules for generating language have psychological reality in the minds of speakers. A distinction is made between linguistic production (e.g., speech disfluencies) and underlying linguistic competence. There are two objections to this approach:

  1. Variation in linguistic production is not random but can be used to characterize dialects or idiolects.
  2. The analysis of linguistic production can provide insight on the cognitive processes involved.

Sociolinguists and psycholinguists do not contest the hypothesis that language is rule-based but have a different perspective on the role of empirical research in linguistics. This is where corpora come in.

A corpus, in the linguist's sense, is a body of written or spoken material upon which a linguistic analysis is based [OED 2]. A corpus may include anything from all the documents produced by a company in the conduct of its affairs to a field linguist's recordings of the native speaker of an endangered language. In recent years, computers and software have made corpus work possible on a scale previously unimagined. Some of the main organizations that distribute linguistic corpora are listed below, along with their contact URLs:

Table 1. Major Suppliers of Electronic Corpora
Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu/
European Language Resources Association (ELRA) http://www.elra.info/
International Computer Archive of Modern English (ICAME) http://helmer.hit.uib.no/icame.html
British National Corpus (BNC) http://www.natcorp.ox.ac.uk/
International Corpus of English (ICE) http://www.ucl.ac.uk/english-usage/ice/
Child Language Data Exchange System (CHILDES) http://childes.psy.cmu.edu/

Readers may also wish to consult the web site maintained by Michael Barlow at Rice University for links to different types of corpora in multiple languages and to other resources for corpus linguistics: http://www.ruf.rice.edu/~barlow/corpus.html.

Corpus research has been conducted in different areas of linguistic inquiry to answer questions such as the following [Aston & Burnard 1998]:

Lexical information:
How often does a particular word form appear in a corpus? Is it more or less common than some other variant? Is 'start' more common than 'begin'?
What are the collocation (co-occurrence) patterns observed for a particular word form? How often is 'ass' immediately preceded by 'you silly'? (due to Firth, p. 13)
Morphosyntactic information:
How frequent is a particular morphological form or grammatical structure? How common is recursive central embedding in sentences such as 'the mouse the cat the dog chased caught squeaked'? (examined by Sampson, p. 6)
How often does a particular structure appear in a particular type of text? Are passives more common in scientific texts?
Semantic information:
What fields of metaphor are employed in economic discourse?
Pragmatic information:
How do speakers close conversations, or open lectures?
Phonetic and phonological information:
What is the phonetic environment for t-deletion for native English speakers from Edinburgh?
What phonological rules of standard American English are most difficult for native French speakers to learn?

Patterns of linguistic production and usage observed in corpora provide linguists with the means of formulating, refining, and testing theories. The insights gained from corpus research are not necessarily conclusive, but they can clear up misconceptions and generally provide a useful check on the linguist's intuitions.

In working with a corpus, there are two major criteria that determine the validity of results obtained through statistical analysis: the sample size and the representativeness of the sample. Since the distribution of word forms in a corpus generally follows Zipf's law, any study based on patterns of occurrence of word forms or larger linguistic units will face the problem of low counts. Increasing corpus size can improve sample size, often at the cost of less detailed analyses. Automatic and semi-automatic means for annotating corpora and processing queries are active areas of research, but much work remains to be done. The question of representativeness arises because researchers often work with a corpus in a domain of interest without participating in its construction. The criteria according to which material was selected for inclusion in the corpus might not completely match the criteria for texts characteristic of the researcher's domain of interest. In that case, two options are available:

  1. Create a subcorpus that meets the researcher's criteria for texts in the domain of interest.
  2. Use more training data to determine the types of text the corpus represents and whether the results obtained transfer to the domain of interest.

The other major consideration in corpus design is how the texts are to be encoded, that is, how features of the text are to be made explicit for processing and analysis by humans or by machines. However, in the words of Henry S. Thompson, since "linguistic analysis is annotation," encoding will be addressed in the following section on domain ontologies [Thompson & Carletta 2001]. Other important aspects of encoding such as normalization of white space, tokenization, and segmentation are beyond the scope of this paper.

3. Domain Ontologies

[Gruber 1993] introduces a definition of ontology commonly cited in literature on the Semantic Web: "An ontology is a formal explicit specification of a shared conceptualization." Of course, ontologies predate the Semantic Web. The Oxford English Dictionary provides the following definition: "The science or study of being; that department of metaphysics which relates to the being or essence of things, or to being in the abstract" [OED 2]. Both definitions assume a level of abstraction, but for different purposes. In the context of the Semantic Web, abstraction is used to model the relevant concepts of some phenomenon, in other words, to make the ontology as general as possible [Fensel et al. 2003]. The essential characteristics of a phenomenon are those that are useful in a representation. Ontologies are created to support activities. The Semantic Web definition also emphasizes formalization and the capture of consensual knowledge. The former is necessary for the ontology to be machine-readable, the latter to obtain broad based support from different communities.

Domain ontologies for the annotation of linguistic corpora need to be theoretically motivated. The Semantic Web community advocates a layered approach to ontology engineering. It is possible to create upper level ontologies that reflect shared paradigms and lower level ontologies that express different theoretical perspectives. In this section, I will review the types and uses of annotation in different research areas. The basic purpose of annotation is to serve as input to further analysis. Any structure induced on a text makes it easier to learn more. The most basic type of annotation is for word, sentence, or paragraph boundaries. Most annotation schemes go beyond tokenization and segmentation to represent some level of linguistic analysis. These are discussed below.

3.1 Syntax and Morphology

A common first step of analysis in preparing a text corpus is to tag each word for its grammatical category or part of speech. Modern electronic corpora are tagged automatically, although in the early days, the output of the tagger had to be manually corrected, a process that could take several years. No standard tag set for encoding part of speech has yet been adopted. Historically, annotation schemes were developed on an ad hoc basis for individual corpora as the need arose and were generally not reused for other corpora. Given the effort that goes into creating a corpus, many of these ad hoc systems are still in use. The absence of standardization makes it inconvenient to conduct studies with material drawn from different corpora. Mappings have to be defined between tag sets.

Tag sets from widely used corpora are the best candidates for standards. The tag set for the [Brown Corpus], the first modern, machine-readable corpus, consisting of just over a million words of written American English from 1961, compiled by W. Nelson Francis and Henry Kucera and documented in [Francis & Kucera 1964], was copied in the tag set for the [LOB Corpus] (Lancaster-Oslo/Bergen Corpus) and has subsequently been modified for use in various corpora, such as the [Penn Treebank Project]. The modifications, however, are not consistent with one another and the original tag set is somewhat outdated. More recently, the CLAWS5 tag set, used to tag the BNC [Table 1], has been widely adopted and forms the basis of [CES], the Corpus Encoding Standard recommended by [EAGLES], the European Union's Expert Advisory Group on Language Engineering Standards [Aston & Burnard 1998]. CES, an SGML application, and XCES, the XML version, are still under development.

A comparison of adjective tags from the CLAWS5, Brown, and Penn tag sets illustrates some of the differences, which are discussed below:

Table 2. A Comparison of Tag Sets
Category Examples CLAWS5 Brown Penn
Adjective happy, bad AJ0 JJ JJ
Adjective, comparative happier, worse AJC JJR JJR
Adjective, superlative happier, worst AJS JJT JJS
Adjective, superlative, semantically chief, top AJ0 JJS JJ
Adjective, ordinal number sixth, 72nd, last ORD OD JJ
Adjective, cardinal number 3, fifteen CRD CD CD
Source: [Manning & Schütze 1999], p. 141

Tag sets differ in a number of respects. They are encoded using different notational formats. In terms of the information they represent, the greatest source of difference is overall size. Since tag sets choose to make distinctions in different areas, a larger set will not necessarily make more fine-grained distinctions with respect to the phenomenon of interest. Tag sets also choose to make different distinctions. Some schemes use morphological criteria whereby words with common affixes are grouped together. Other schemes use syntactic criteria based upon a word's combinatorial or distributional properties. (The latter strategy is probably better for English which does not have very productive inflectional morphology and in which different (combinations of) morphological features are frequently represented by a single morpheme.) The mapping between notation and representation is not always straightforward. For example, in the [Brown Corpus], contractions are represented using combined tags -- two tags joined with a plus sign -- whereas the recent trend has been towards dividing such graphic words into two for the purposes of tagging [Manning & Schütze 1999].

The different distinctions made by tag sets described above do not reflect different theoretical assumptions so much as different design decisions. [Manning & Schütze 1999] remark, "There has [...] been some apparent development in people's ideas of what to encode. The early tag sets made very fine distinctions in a number of areas such as the treatment of certain sorts of qualifiers and determiners that were relevant to only a few words, albeit common ones. More recent tag sets have generally made fewer distinctions in such areas" (p. 144).

It is possible for tags to represent more structured information. The ICE [Table 1] tag set takes this approach [Nelson, Wallis, & Aarts 2002]. Each lexical item is assigned a word class tag. Each word class is further distinguished by zero or more features. The features can take different values. For example, the word class adjective, tagged ADJ, is further distinguished by a morphology feature and by an optional comparison feature:

Table 3. Features of the Word Class Adjective in the ICE Tag Set
Feature Feature Value Code Example Tag
morphology general ge happy ADJ(ge)
-ed participle edp fluted ADJ(edp)
-ing participle ingp sweeping ADJ(ingp)
comparison comparative comp happier ADJ(ge, comp)
superlative sup happiest ADJ(ge,sup)
Source: [Nelson, Wallis, & Aarts 2002], p. 25 (column labels modified and examples added)

The comparison feature only applies to adjectives whose morphology feature has value general, but this is not represented in the annotation scheme.

Structured tags could also be used to represent the constructs of a syntactic theory. For example, the basic construct of [HPSG] (Head Driven Phrase Structure Grammar) is the lexical head, a typed feature structure that represents either a word or its phrasal projection. Heads are labeled with part of speech information, and additional features are used to encode arguments (e.g., the object of a verb), modifiers, and semantic information that bears upon the syntactic properties of the word or phrase (e.g., that the subject of the verb 'run' must be animate).

Simple and structured part of speech tags could easily be represented in XML, and RDF triples could be used for tagging. For example, ICE word classes could be represented by elements, word class features by element attributes. The fact that the morphology feature must have value general in order for the comparison feature to be defined could be captured by declaring three subclasses of the word class adjective and making the comparison feature an attribute of subclass general, but not the others. (On the other hand, there may not be any benefit to doing this.) Lower level ontologies of highly structured tags could represent relationships between the constructs of a syntactic theory, such as the type hierarchy of HPSG (e.g., that the type auxiliary verb is a subtype of verb). Constraints on the presence of features (i.e., attributes) and their values, formalized in HPSG's type hierarchy, could also be represented in an ontology.

3.2 Semantics

Most of the corpus based research in semantics has been at the lexical level, primarily because syntactic parsing has not been robust enough to support compositional semantic analyses that build upon syntactic structure. Two notable efforts are [WordNet] and [FrameNet].

WordNet is an electronic dictionary of English developed in the Cognitive Science Laboratory at Princeton University under the direction of George A. Miller since 1985. The dictionary is represented as a network. The nodes of the network are lists of synonyms called synsets. Each synset represents a meaning. That meaning is the overlap in the meanings of the words contained in the synset plus anything that sets it apart from related synsets in the network. The network is partitioned according to part of speech. Different relations are defined for each part of speech category: hyponymy for nouns, antonymy for adjectives, implication for verbs, etc. WordNet is more than a dictionary since the network establishes relations between meanings, not lexical items. The conceptual nature of the network is apparent in the use of WordNet categories as semantic tags for corpus annotation in applications such as query expansion in information retrieval [Voorhees 1994].

WordNet synsets are used as training data in statistically based corpus work, such as word sense disambiguation. WordNet synsets can be used to seed a learning algorithm to develop a domain specific ontology from a specialized corpus. An example of this type of research is [Cucchiarelli & Velardi 1998]. Although words may have specialized meanings within a domain that do not agree with their conventional usage, starting from pre-established categories reduces ambiguity and makes the learning task easier. Based upon the large number of research projects that use WordNet, its high level category labels could be a de facto standard for semantic annotation. The relations defined between more abstract synsets in the network could serve as an upper level ontology in many domains besides linguistics.

FrameNet is a lexicon-building project at the University of California, Berkeley, based upon Charles Fillmore's Case Grammar [Fillmore 1968], a theory of case relations, or in the contemporary linguistics literature, thematic roles. Thematic roles indicate the way in which entities participate in, or are related to, the event or state described by a sentence. Proposed thematic roles include agent, theme, goal, and instrument. Fillmore's original hypothesis was that there are a finite number of thematic roles and that predicates are universally specified in terms of the thematic roles associated with their arguments. Transformational rules such as passivization account for the realization of thematic roles on different syntactic arguments. Since Case Grammar was introduced, linguistic theory has become more surface-oriented (i.e., transformational rules are deprecated) and lexically based. Reflecting this paradigm shift, the FrameNet project takes an inductive approach, attempting to characterize the thematic roles of individual lexical items -- now frames (the term for case relations in the artificial intelligence community in which Fillmore's ideas were highly influential) -- and to detect patterns that define lexical classes. An ontology for representing frame-to-frame relations, such as inheritance, is under development.

Unlike WordNet, constituents, rather than words, form the basis of annotation. The primary form of annotation is relative to the syntactic governor of a frame. Syntactic dependents within the frame are annotated with the following types of information: Frame Element (FE), Grammatical Function (GF), and Phrasal Type (PT). The annotations need to be layered because semantic arguments may be realized on different syntactic arguments, and different phrasal types may fill the same grammatical function. Here is the FrameNet annotation for give. Note the use of FE labels specific to the verb rather than generic labels such as agent. GF and PT are not tagged in the examples provided with the annotation, so here is an example with all three levels of annotation:.

He gives money to local charities.
FE Donor Theme Recipient
GF Subject Object Indirect Object
PT Noun Phrase Noun Phrase Prepositional Phrase

Unlike WordNet, FrameNet is founded upon corpus attestations. Sentences containing the target word are extracted from corpora -- so far, the BNC and the LDC North American Newswire Corpora in American English [Table 1] -- and frames are manually identified and annotated. Tagging for GF and PT is automatic. FrameNet data are written in a proprietary XML format. A conversion to OWL has been written by Srini Narayanan and was presented at the Second International Semantic Web Conference [Narayanan et al. 2003]. A project to convert FrameNet data to OWL using [XSLT] is reported in [Burchardt 2004].

FrameNet data have been used for research in lexicography and as input to machine learning algorithms to develop automatic and semi-automatic taggers for the annotation of frame elements. An atheoretical approach to classification based upon statistical natural language processing techniques is reported in [Gildea & Jurafsky 2002]. In addition to supporting the development of automatic taggers for the FrameNet project, the study compares the usefulness of different features and feature combinations for the annotation of frame elements. [Frank & Erk 2004] present a system to develop semi-automatic taggers based upon machine learning techniques that use combined syntactic and semantic representations as input. The system architecture is intended to support semi-automatic semantic annotation for German in [SALSA] (Saarbrücken Semantics Annotation and Analysis Project). The first step is to develop semi-automatic taggers for frame elements using the [TIGER Corpus]. The TIGER Corpus is a large syntactically annotated corpus of German with a surface-oriented, theory neutral annotation scheme. The second step is to represent associations between syntactic constituents and frame elements using [LFG] (Lexical Functional Grammar). Lexical Functional Grammar has been used to analyze free word order languages, such as Warlpiri, which require more than one level of representation for mapping between word order and syntactic structure. Different levels of representation can also be used to represent mappings from syntactic to semantic arguments. LFG representations that show the associations between syntactic arguments and frame elements induced in the first training task serve as input to the third step in which learning algorithms are applied to the combined representations. The basic idea is that any generalizations about these associations captured in the first training task will improve performance in the third step, and in any subsequent iterations. The entire process is interactive, with human annotators accepting or correcting assignments proposed by the system.

3.3 Phonetics and Phonology

Corpus research in phonetics and phonology differs from corpus research in syntax and semantics in a number of ways. First of all, the environment for research is different in that commercial and industrial applications are already well-defined. Text to speech (TTS) synthesis systems are used in numerous commercial applications. Less progress has been made in the development of speech recognition systems, but it is an extremely active area of research. Recognition is a much harder task than generation. It presupposes a model of our ability to normalize variation in the speech signal (e.g., according to age, sex, or dialect) and our ability to tune in to certain linguistic or extralinguistic cues to identify components of a signal in noisy conditions. The second major difference is that phoneticians and phonologists work with audio files, which have different hardware and software requirements and are generally more difficult to manipulate than text files. The third distinction is that transcription and annotation are performed by humans rather than by machines. To date, the focus in speech corpus research has been on developing tools for making the transcription and annotation task easier rather than on developing automatic processing techniques.

Orthographic transcription is not properly within the domain of linguistics but is a preliminary step in the creation of a corpus, analogous to the tokenization and segmentation of text corpora. The [DARPA Broadcast News] transcription task, started in 1995, was the first wide scale effort to create corpora from spontaneous speech not produced in an experimental setting [Barras et al. 2001]. The corpora were intended for use in the development of automatic speech recognition systems for the indexing and retrieval of Broadcast News in several languages. The orthographic transcription task sets the baseline data modeling requirements for linguistic annotation. These are listed below:

  1. Transcription
  2. Representation of speaker turns
  3. Representation of overlapping speech
  4. Representation of background conditions
  5. Representation of the unique characteristics of spoken, as opposed to written, language (e.g., disfluencies, fragments, discourse markers, etc.)

Each level of representation must be time-aligned to the audio file. Linking between levels is insufficient since annotations may be modified.

There are two basic levels of representation of speech in phonetics and phonology: segmentals and suprasegmentals. The segmental elements of speech are the individual consonants and vowels, or phonemes, of a language. Suprasegmentals are those features that apply to a sequence of phonemes, such as syllables, words, or sentences. Suprasegmental features include intonation, stress, tone, and length. Tone and length are contrastive in some languages, such as Mandarin Chinese and Czech, respectively, but not in English. Contrastive sounds are those that change the meaning of a word (e.g., 'pit' vs. 'pet'). Intonation is the most commonly studied suprasegmental feature in English and the only suprasegmental feature that will be discussed in this paper. The categories required to represent segmental features are independent of the categories required to represent suprasegmental features. Suprasegmental features convey syntactic, semantic, and pragmatic information. Segmental features convey information about morphology but do not directly contribute to other levels of analysis. Phonological rules for the distribution of phonemes or morphemes in a language do not interact with the syntactic, semantic, or pragmatic components of a grammar. In other words, the sequence of sounds in a word is arbitrary.

Segmental features are represented with a phonetic or phonemic transcription. Only contrastive sounds are represented in a phonemic, or broad, transcription. Phonemes are represented using the [IPA] (International Phonetic Alphabet). Nonconstrastive phonetic detail is included in a phonetic, or narrow, transcription. The IPA includes diacritics for representing phonetic features. In addition to the transcription, a gloss, or word by word translation, is provided. The gloss will generally be unintelligible, so a free sentence-level translation may also be provided. The gloss may serve as input to further morphological, syntactic, semantic, or pragmatic analyses. To summarize, a segmental representation includes the following:

  1. A phonemic or phonetic transcription
  2. A word by word gloss
  3. A free sentence-level translation

The transcription is time-aligned to the speech file and to a spectrogram showing changes in the fundamental frequency and resonant frequencies, or formants, over time. Transcription is based upon information in the speech file and the spectrogram. The transcription need not be associated with annotation at other levels of analysis.

[ToBI] (Tone Breaks and Indices) is a common system for representing intonation. A ToBI transcription is represented on four tiers:

  1. An orthographic tier: a word by word transcription of the utterance
  2. A tone tier: a representation of pitch events associated with phrases or accented syllables
  3. A break-index tier: an indication of the degree of juncture between phrases, with integer values ranging from 0 to 4
  4. A miscellaneous tier for comments (e.g., for silence, audible breaths, laughter, disfluencies, etc.)

Tones are represented with two components. The basic tone levels are high (H) in the local pitch range versus low (L) in the local pitch range. The transcription is time-aligned to the speech file and to a record of the fundamental frequency contour. Transcription is based upon information in the speech file and the fundamental frequency contour. Since intonation conveys syntactic, semantic, and pragmatic information, intonation labels need to be associated with annotation at those levels of analysis.

Although phonologists or phoneticians may disagree upon a particular transcription, the categories for representing segmental features are standardized and widely accepted. The IPA, developed by British and French phoneticians under the auspices of the International Phonetic Association, dates from 1886. It is an official standard with nearly universal acceptance in the linguistics community, and it has undergone numerous revisions to make it more representative of the world's languages. The IPA can be represented in Unicode and is the most commonly used alphabet in software tools for linguistic transcription.

The theoretical status of suprasegmental features is more controversial. Hence there is less agreement about their representation. Systems for representing intonation, for example, differ with respect to the number of components used to represent tones and the degree to which fundamental frequency information is specified [Kochanski & Shih 2003]. In phonological theory, tone has traditionally been specified categorically. This is the approach of the ToBI system. Other schools of thought consider different realizations of tone to be continuous variations of a single class. A similar divergence of opinion exists with respect to the degree of specification believed to be necessary to represent linguistically significant changes in fundamental frequency. From a comprehension point of view, an information rich system is preferable. From a production point of view, a minimal system is preferable. Then again, since intonation transcriptions are used for pitch generation in TTS systems, if the goal is not to model language as a cognitive phenomenon but to solve technical problems, an information rich system is preferable. A technical solution to the problem of multiple systems for representing intonation is to define an annotation scheme that can be used with different theoretical models. [Kochanski & Shih 2003] present one such model.

4. Document Ontologies

P. Agre (personal communication, April 7, 2004) points out that on the Web, the distinction between documents and services is often blurred. Documents may be created interactively, require software to be rendered, or incorporate multimedia. One possible conceptual difference is that we think of documents as artifacts, however transient, but associate services with activities. Consequently, we might be attentive to the physical properties of documents but expect services to operate smoothly and naturally so that they appear invisible. For the purpose of organizing this paper, I will attempt to observe a distinction between the presentation of data (this section) and its use (section 5).

Research activity will involve working with XML documents. Researchers may also wish to consult documentation, manuals, and other metadata for the corpus; mailing lists for users of the corpus; and similar resources. To present research results, portions of an XML file may be extracted and dumped into a plain text or postscript file in which annotations may be displayed in a more user-friendly format, e.g., as triples, directed acyclic graphs, predicate logic formulae, trees, or feature structures. Speech files and spectrograms or waveform displays may be associated with the XML files. These may be printed out or rendered by software during live presentations. Spectrograms and waveform displays may be included in a text.

Presentation of research results outside the linguistics community may involve more elaborate presentation forms. Possible scenarios include applications for grants, presentations to industry, or classroom demos for undergraduate courses. (At some universities, undergraduate students may be considered a part of the research community and have the opportunity to work with data in their native format.)

Document genres in the domain of linguistics include books; book chapters written for festschrifts, anthologies, handbooks, etc.; scientific articles published in journals or deposited in digital repositories; papers presented at conferences and published in conference proceedings; working papers published by individual departments; web sites or web pages for departments, research centers, research projects, conferences, summer schools, faculty members, etc.; contributions to mailing lists or discussion boards; and squibs or blog entries. Some forms, such as books and scientific articles, are well-established. Digital repositories, whether organized by discipline or institutionally based, are gaining ground as an alternative publishing model. Part of the reason for their success may be due to the fact that they are based upon a well-established document genre, namely, the scientific article. New document genres such as web sites and blogs that require continual maintenance often die out. Part of the reason for this may be that researchers do not have the time or the support staff to maintain pages after the activities they document are concluded, or even as they are taking place. Cooperative efforts such as team blogs, mailing lists, and discussion boards often have greater success, in part because they do not require as much investment, but also because the pay off is immediate and obvious. In order for web sites to become a productive document genre, it has to become common knowledge that they are. It has to be taken for granted that others will participate. Many web sites for research projects in this area have not been updated for the past few years. Unless explicit reciprocal relationships exist between universities, the status of a project with an abandoned web site will be unclear to other community members. The overall effect of this might be to diminish the authority or credibility of web based resources in academic research.

5. Service Ontologies

The overview domain ontologies in section 3 brought to light a number of requirements for content models that need to be supported by components of a service ontology for corpus linguistics. I will discuss characteristics of the content models, the types of interaction that need to be supported, and conventions of practice for creating and sharing software tools.

Many situations call for the representation of information on multiple tiers, each of which needs to be anchored to the source material. For audio, annotations need to be time-aligned. For text, alignment is by token. Pointers are required to represent associations between elements on different tiers, but these cannot be substituted for anchors to the source material since annotations may be modified. These associations may cross structural boundaries on one or more of the tiers. The scope of annotations on the same tier may overlap, e.g., in the transcription overlapping speech. It should be possible for annotations to consist of partial information but still be considered well-formed for situations in which data are lost or incomplete from the start. The hierarchical structure of XML is not convenient for representing this type of content. A workaround introduced by the [Language Technology Group] at the University of Edinburgh is to use stand off annotation, in which each layer of annotation is kept in a separate file and hyperlinks are used as pointers between elements in different files [Isard 2001]. [Thompson & McKelvie 1997] also provide a useful discussion of the motivations behind stand off annotation.

In creating, retrieving, editing, or browsing annotated text or speech, it should be possible to take aerial and ground views of the data to improve navigation. Responses to queries should be returned in context. It should be possible to highlight the phenomenon of interest. When there are multiple layers of annotation, it will be useful to be able to hide some layers. When working with more than one layer of annotation or more than one file, e.g., a speech file and a spectrogram, it should be possible to align the layers or files and synchronize the cursor display. Some actual requests from users of Transcriber, a tool for orthographic transcription, include playback speed control, the ability to manage many small signal (i.e., waveform) files rather than a single big file, the ability to manage video in addition to sound, and the autosaving of backup transcriptions. Practical experience will be required to fine tune any particular tool.

A selection of research projects that provide XML-based software tools for use with their corpora are listed below, along with their contact URLs. Also listed are projects to develop XML-based software tools.

Table 4. Software Tools for Corpus Research
Transcriber Orthographic Transcription http://www.ldc.upenn.edu/mirror/Transcriber/en/refFrame.html
LACITO Phonetic Transcription http://lacito.vjf.cnrs.fr/archivage/index.html
MARY Text to Speech Synthesis http://mary.dfki.de/
Language Technology Group Syntactic Annotation http://www.ltg.ed.ac.uk/software/index.html
MATE Multi-level Annotation http://mate.nis.sdu.dk/
NITE Multi-level Annotation http://nite.nis.sdu.dk/

Transcriber was developed for use in the [DARPA Broadcast News] project [Barras et al. 2001]. It is intended for use by nonlinguists but makes use of some of the technologies used in phonetic transcription. Transcriber has potential for use in industrial applications. The goal of the LACITO (Langues et Civilisations à Tradition Orale) Linguistic Archive Project is to conserve and disseminate the oral literature of unwritten languages, giving simultaneous access to sound recordings and text annotations [Jacobson, Michailovsky, & Lowe 2001]. The tools of the project are intended for use by trained linguists. The archive is intended for use in research and teaching. MARY (Modular Architecture for Research on speech sYnthesis) defines a markup language called MaryXML, which provides a low level translation of high level languages such as [SSML], developed primarily for use in industry [Schröder & Trouvain 2003]. Users can interactively modify MaryXML annotations at each stage of the synthesis process to see how the modifications affect the output. MARY is intended for use in research, development, and teaching.

As explained in section 3, syntactic annotation is frequently performed automatically or semi-automatically. Since the data are not handled by humans, the text-based document structure of an XML serialization does not provide any advantage. Corpora with highly structured syntactic annotations may be more conveniently stored in relational databases such as MySQL, which are more compact and which process queries more efficiently. C. Brew (personal communication, June 15, 2004) suggests, "For corpus work, you may want the best of both worlds. For example, you could read information out of an XML document into a relational database then run complex queries on it." [Annotate] is one example of a tool that stores annotated corpora in a relational database. Annotate has a specified interface for communicating with external taggers and parsers.

An alternative model for collaboration is to define APIs (application programming interfaces) and create software libraries that allow programmers to access data and develop tools suited to their own needs. The Language Technology Group, the MATE project, and its successor NITE, take the "toolkit" approach. [Carletta et al. 2003] discuss the philosophy behind it:

Ordinarily, when one thinks of software support, one thinks of end-user tools for coding data and analyzing the results. Although we could easily provide such tools for specific coding schemes -- sets of codes and the permitted relationships among them -- it is impossible to provide a uniform end-user interface for these functions that will always be usable. Different structures require different displays, data entry methods, and analytical techniques. (p. 355)

NXT (NITE XML Toolkit) includes support for handling data, querying them, and building interfaces, as well as sample Java programs demonstrating the use of the libraries and an engine for building limited interfaces from declarative specifications of their appearance and behavior.

6. Metadata Ontologies

A metadata ontology for corpus linguistics should include elements to document corpora, annotations, and software. Documentation of a corpus typically includes a description of the sources of texts or recordings; for speech corpora, a description of the subjects, if available; and other bibliographic information. Documentation of annotations typically includes a description of the annotations and guidelines for their use. Documentation of software typically includes software and hardware requirements; information about downloads, or other instructions for obtaining access; information about licensing; and contact information for support. Documentation for developers may include information about bugs and provide access to patches or software libraries. Mailing lists may also be available to end users and developers. Corpora are generally accompanied by manuals with all three types of documentation.

One possible future scenario for XML-based corpora is distributed annotation and software development, which immediately raises concerns about protecting the integrity of the data. One solution is for institutions to negotiate agreements to share data and tools. Another solution is for annotations to be stored separately from the source material. All three types of metadata discussed above will be necessary to track changes.

7. Research Implications

The application of Semantic Web technologies to the domain of corpus linguistics raises a number of questions:

  1. What is the appropriate level for collaboration?
  2. What is the appropriate level of encoding?

Until now, corpora have been created as individual or team projects and distributed by institutions such as the LDC [Table 1]. The widespread adoption of Semantic Web technologies would make it possible to create distributed corpora, integrating the output of several projects, and to decentralize their distribution. Annotations, software, and experience could be shared in a similar manner.

The model sketched above presupposes some common ground between participants. It is still too early to say how far that common ground extends, but that is something that will need to be explored in order for Semantic Web technologies to be adopted on a wide scale. For example, researchers and developers who work with speech corpora have devoted their efforts to creating end-user software tools. Researchers and developers who work with text corpora have devoted their efforts to developing APIs and software libraries that allow others to create tools tailored to their own needs. The latter model presupposes that users have some expertise in developing software, but this may not always be the case. In order for annotation to become a team project, standardized tag sets need to be established and adopted, or mappings need to be defined between a reasonable number of sets. In order to implement mappings, the logical structure of individual tags and the logical structure of the schemes need to be extensible. Annotation schemes are representations of linguistic theory. If linguistics as a field has shared paradigms, then these can form the basis of upper level ontologies for annotation schemes. However, as P. Agre (personal communication, June 9, 2004) points out, the intermediate layers of an ontology are the most difficult to define. This is where the paradigms of a field are tested.

The Semantic Web reflects the perspective of the artificial intelligence community. Ontology engineering might not come so naturally to other communities. As C. Brew (personal communication, June 15, 2004) remarks, "My feeling is that RDF and OWL are very similar to knowledge representation formalisms which were designed by AI people in the 80s, and those turned out to be harder to use than people imagined. There's a risk that the same thing is going to happen again." If the Semantic Web community creates tools that lower the barrier to participation by hiding complexity, then wide scale adoption of Semantic Web technologies may be possible. The linguistics community itself is not uniform. Some members of the community may be ready to adopt Semantic Web technologies as they stand. However, in order for the Semantic Web to become a reality, it needs to have broad based support.

8. References
[Agre 2004]
Agre, P.E. (2004). Information Retrieval Systems [course page]. Retrieved June 13, 2004, from http://polaris.gseis.ucla.edu/pagre/is277.html
[Annotate]
Annotate. (2000, October 17). Home page. Retrieved June 13, 2004, from http://www.coli.uni-sb.de/sfb378/negra-corpus/annotate.html
[Aston & Burnard 1998]
Aston, G., & Burnard L. (1998). The BNC Handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press.
[Barras et al. 2001]
Barras, C., Geoffrois, E., Wu, Z., & Liberman, M. (2001). Transcriber: Development and use of a tool for assisting speech corpora production. Speech Communication, 33, 5-22.
[Berners-Lee 2003]
Berners-Lee, T. (2003). Foreword. In D. Fensel, J. Hendler, H. Lieberman, & W. Wahlster (Eds.), Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential (xi-xxiii). Cambridge, MA: MIT Press.
[Brown Corpus]
Brown Corpus. Available for purchase on CD, from http://nora.hd.uib.no/icame/newcd.htm. Available to UCLA faculty, students, and staff for download, from http://www.sscnet.ucla.edu/issr/da/index/techinfo/M0911.HTM.
[Burchardt 2004]
Burchardt, A. (2004, April 27). Representing FrameNet's Frames in OWL. Retrieved June 13, 2004, from http://www.coli.uni-sb.de/%7Ealbu/wd/frames2owl/
[Carletta et al. 2003]
Carletta, J., Evert, S., Heid, U., Kilgour, J., Robertson, J., & Voorman, H. (2003). The NITE XML Toolkit: Flexible annotation for multimodal language data. Behavior Research Methods, Instruments, & Computers, 35 (3), 353-363.
[CES]
Corpus Encoding Standard (Version 1.5). (2000, March 20). Retrieved June 13, 2004, from http://www.cs.vassar.edu/CES/
[Cucchiarelli & Velardi 1998]
Cucchiarelli, A. & Velardi P. (1998). Finding a domain-appropriate sense inventory for semantically tagging a corpus. Natural Language Engineering, 4 (4), 325-344.
[DARPA Broadcast News]
National Institute of Standards and Technology. Speech Group. (2000, September 27). 1999 NIST Broadcast News Evaluation. Retrieved June 13, 2004, from http://www.nist.gov/speech/tests/bnr/bnews_99/bnews_99.htm
[EAGLES]
Expert Advisory Group on Language Engineering Standards. Home page. Retrieved June 13, 2004, from http://www.ilc.cnr.it/EAGLES96/home.html
[Fensel et al. 2003]
Fensel, D., Hendler, J., Lieberman, H., & Wahlster, W. (Eds.). (2003). Introduction. Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential (1-25). Cambridge, MA: MIT Press.
[Fillmore 1968]
Fillmore, C. J. (1968). The case for case. In E. Bach & R. T. Harms (Eds.), Universals in Linguistic Theory (1-88). New York: Holt, Rinehart, & Winston.
[FrameNet]
FrameNet. Home page. Retrieved June 14, 2004, from http://www.icsi.berkeley.edu/~framenet/
[Francis & Kucera 1964]
Francis, W. N., & Kucera, H. (1964). Brown Corpus Manual: Manual of information to accompany a standard corpus of present-day edited American English, for use with digital computers. Revised 1971. Revised and amplified 1979. Providence, RI: Department of Linguistics, Brown University. Retrieved June 14, 2004, from http://helmer.aksis.uib.no/icame/brown/bcm.html
[Frank & Erk 2004]
Frank, A., & Erk, K. (2004). Towards an LFG syntax-semantics interface for frame semantics annotation. In A. Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing: 5th International Conference, CICLing 2004, Seoul, Korea, February 15-21, 2004, Proceedings (1-13). Berlin: Springer-Verlag.
[Gildea & Jurafsky 2002]
Gildea, D., & Jurafsky, D. (2002). Automatic labeling of semantic roles. Computational Linguistics, 28 (3), 245-288.
[Gruber 1993]
Gruber, T. R. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition, 5, 199-220.
[HPSG]
Head Driven Phrase Structure Grammar (HPSG). Project web site maintained at Stanford University. Retrieved June 14, 2004, from http://hpsg.stanford.edu/
[IPA]
International Phonetic Association. (2004, January 6). International Phonetic Alphabet. Revised to 1993. Updated 1996. Retrieved June 14, 2004, from http://www.arts.gla.ac.uk/IPA/ipa.html
[Isard 2001]
Isard, A. (2001). An XML architecture for the HCRC Map Task Corpus. In P. Kühnlein, H. Rieser, & H. Zeevat (Eds.), Bi-Dialog 2001 (1-8). Retrieved June 14, 2004, from http://www.hcrc.ed.ac.uk/maptask/maptask-papers.html
[Jacobson, Michailovsky, & Lowe 2001]
Jacobson, M., Michailovsky, B., & Lowe, J. B. (2001). Linguistic documents synchronizing sound and text. Speech Communication 33, 79-96.
[Kochanski & Shih 2003]
Kochanski, G., & Shih, C. (2003). Prosody modeling with soft templates. Speech Communication, 39, 311-152.
[LFG]
Lexical Functional Grammar. (2003, September 23). Project web site maintained at Stanford University. Retrieved June 14, 2004, from http://www-lfg.stanford.edu/lfg/
[Language Technology Group]
University of Edinburgh. Language Technology Group. Home page. Retrieved June 14, 2004, from http://www.ltg.ed.ac.uk/index.html
[LOB Corpus]
Lancaster-Oslo/Bergen Corpus. Available for purchase on CD, from http://nora.hd.uib.no/icame/newcd.htm.
[Manning & Schütze 1999]
Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.
[Narayanan et al. 2003]
Narayanan, S., Baker, C. F., Fillmore, C. J., & Petruck, M. R. L. (2003). Framenet meets the semantic web. Lexical semantics for the web. In D. Fensel, K. Sycara, & J. Mylopoulos (Eds.), The Semantic Web - ISWC 2003: Second International Semantic Web Conference, Sanibel Island, FL, USA, October 20-23, 2003, Proceedings (771-787), Heidelberg: Springer-Verlag.
[Nelson, Wallis, & Aarts 2002]
Nelson, G., Wallis, S., & Aarts, B. (2002). Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam: John Benjamins Publishing Company.
[OED 2]
Oxford English Dictionary. 2nd ed. (2002). Oxford, England: Oxford UP.
[OWL]
Smith, M. K., Welty, C., & McGuinness, D. L. (Eds.). (2004, February 10). OWL Web Ontology Language Guide. Retrieved June 15, 2004, from http://www.w3.org/TR/2004/REC-owl-guide-20040210/
[Penn Treebank Project]
Penn Treebank Project. Home page. Retrieved June 15, 2004, from http://www.cis.upenn.edu/~treebank/home.html
[RDF]
Manola, F., & Miller, E. (Eds.). (2004, February 10). RDF Primer. Retrieved June 15, 2004, from http://www.w3.org/TR/2004/REC-rdf-primer-20040210/
[RDFS]
Brickey, D., & Guha, R. V. (Eds.). (2004, February 10). RDF Vocabulary Description Language 1.0: RDF Schema. Retrieved June 15, 2004, from http://www.w3.org/TR/2004/REC-rdf-schema-20040210/
[SALSA]
SALSA - The Saarbrücken Lexical Semantics Annotation and Analysis Project. Home page. Retrieved June 15, 2004, from http://www.coli.uni-sb.de/lexicon/index.phtml
[Schröder & Trouvain 2003]
Schröder, M., & Trouvain, J. (2003). The German text-to-speech synthesis system MARY: A tool for research, development, and teaching. International Journal of Speech Technology 6, 365-277.
[SSML]
Burnett, D. C., Walker, M. R., & Hunt, A. (Eds.). (2003, December 18). Speech Synthesis Markup Language Version 1.0. Retrieved June 15, 2004, from http://www.w3.org/TR/speech-synthesis/
[Thompson & McKelvie 1997]
Thompson, H. S., & McKelvie, D. (1997, May). Hyperlink semantics for standoff markup of read-only documents. Retrieved June 15, 2004, from http://www.ltg.ed.ac.uk/~ht/sgmleu97.html
[Thompson & Carletta 2001]
Thompson, H. S., & Carletta, J. (2001, March 28). XML Markup Technologies for Working with Linguistic Data. Slides from the workshop. Part I. Retrieved June 15, 2004, from http://www.cogsci.ed.ac.uk/~jeanc/corpus-linguistics/
[TIGER Corpus]
TIGER Corpus. Home page. Retrieved June 15, 2004, from http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/
[ToBI]
ToBI. Home page. Retrieved June 15, 2004, from http://www.ling.ohio-state.edu/~tobi/
[Unicode]
Unicode 4.0.1. Retrieved June 15, 2004, from http://www.unicode.org/versions/Unicode4.0.1/
[URI]
Berners-Lee, T. (Ed.). (1998). Uniform Resource Identifiers (URI): Generic Syntax. RFC 2396. Retrieved June 15, 2004, from http://www.faqs.org/rfcs/rfc2396.html
[Voorhees 1994]
Voorhees, E. M. (1994). Query expansion using lexical-semantic relations. In W. B. Croft & C. J. van Rijsbergen (Eds.), SIGIR 1994: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, 3-6 July 1994, Dublin, Ireland (61-69). London: Springer-Verlag.
[W3C]
World Wide Web Consortium. Home page. Retrieved June 15, 2004, from http://www.w3.org/
[W3C Members]
World Wide Web Consortium (W3C) Members. Retrieved June 15, 2004, from http://www.w3.org/Consortium/Member/List
[W3C Technical Reports]
W3C Technical Reports and Publications. Retrieved June 15, 2004, from http://www.w3.org/TR/
[WordNet]
WordNet. Home page. Retrieved June 15, 2004, from http://www.cogsci.princeton.edu/~wn/
[XML]
Bray, T., Paoli, J., Sperberg-McQueen, C. M., Maler, E., Yergeau, F., & Cowan, J. (Eds.). (2004, April 15). Extensible Markup Language (XML) 1.1. Retrieved June 15, 2004, from http://www.w3.org/TR/2004/REC-xml11-20040204/
[XSLT]
Clark, J. (Ed.). (1999, November 16). XSL Transformations (XSLT) Version 1.0. Retrieved June 15, 2004, from http://www.w3.org/TR/1999/REC-xslt-19991116

Valid XHTML 1.0!