Publications related to FrameNetTopSecond Thoughts and ImprovementsDeliverables: The FrameNet Databases

Deliverables: The FrameNet Databases

Introduction

The FrameNet project has produced two types of data, a collection of approxiamtely 50,000 hand-annotated sentences and a database containing information about frames, frame elements, lemmas and lexical entries. All of this data is distributed as ASCII files with markup that is compatible with both SGML and XML, with accompanying DTDs. (For brevity, this will be referred to as XML format hereafter.) If resources permit, other data formats will be made available.

These databases have not yet been released, but will soon be available for downloading from the FrameNet website. The web site also contains the complete documentation of the project and will contain a web interface to a search engine which can handle a wide variety of linguistically interesting queries.

The remainder of this appendix describes the contents and structure of the data files.

Annotated files

In the FrameNet project, we have created approximately 1,600 annotated files, each comprising a set of sentences selected from the BNC containing a given lemma and grouped by syntactic pattern, as described in the section on subcorporation; the number of sentences ranges from very few to about 300 depending on the frequency of the lemma in the corpus. Typically, only about 20% of the sentences will be annotated; our objective has been to document and exemplify the range of possible patterns of occurrence, rather than to annotate everything.

Format of headers

Each file consists of a header followed by a body, all wrapped in a <CORPUS> element. The corpus element has four required attributes:
CORPNAME (always "bnc")
DOMAIN listed in link
FRAME listed in link
LEMMA listed in link

(NOTE: The semantic domains were defined mainly to ensure that our work spanned many different semantic areas. We make no ontological claims about them, and have abandoned this terminology in FrameNet II.)

The lemma value is the base (uninflected) form of the word, followed by a period and the part of speech ("n", "v", or "a"). The other element in the header is called CNOTES, giving information about the creation of the file, and an element CHANGES, containing the dates of each change in the file, including regular annotation and occasional systematic, global revisions such as renaming of frame elements which occurred during the course of the project. The lines in the CHANGES element were produced by the RCS revision control system.

Format of body

The body consists of a series of sentences marked up as S elements, interspersed with COMMENT elements. The COMMENT elements are used to mark the stages of the subcorpus extraction process. Each COMMENT contains a SC element giving the name of the subcorpus, and a STATS element, giving (1) the number of BNC sentences initially selected for the subcorpus, (2) the number considered "usable" after eliminating those considered too long, too short, or likely to contain sentence fragments from one of the BNC speech corpora, and (3) the number of sentences saved from these (limited to 20, if there were more than 20 usable).

Constituent Tags

Each S element has one attribute, an 8- or 9-digit number which represents the position of the target word in the BNC corpus. The serves both as a unique identifier and a key to find the sentence in the corpus.

The content of the S element is the sentence from the BNC, a series of words separated by whitespace, each containing a slash and the part-of-speech tag from the BNC (the CLAWS C5 Tagset).

There must be at least one tagged word, the target, enclosed in a C element, with the attribute TARGET and the value "y". If the sentence has been annotated, there will be one or more frame elements, also enclosed in C tags, each with the three attributes for frame element name (FE), phrase type (PT) and grammatical function (GF).

Say something about TARGET="mate" ! ! !

Implicit FEs (Null Instantiation)

In cases where some FEs are conceptually required in a frame but not expressed in a given sentence, this is indicated by a constituent tag containing no text. The FE will be a regular frame element name for the given frame, and the PT will be one of DNI, INI, or CNI; there should be no GF attribute. See implicit FEs for more explanation of implicit FEs.

Sentence-level Tags

There may also be sentence-level tags for features that apply to the sentence as a whole. The format is <T TYPE= type ></T> The most important of these are:
sensen An instance of sense n of the target lemma
Idiom Idiomatic use of the target lemma
Metaphor Metaphorical use of the target lemma
Blend Sentence represents a blending of frames

In the data release, we have combined all the annotated sentences from these 1,600 files into one large XML file, called fn1.xml. Unannotated sentences have been omitted, as have empty subcorpora. The CORPUS element from each annotation file has been included to mark the beginning and end of sentences annotated for a particular lexical unit (i.e. a lemma in a frame). Within lexical units, the beginning of each subcorpus is marked with a <SC> element (contained within <COMMENT> tags); the content of this element shows in abbreviated form the syntactic criterion used to extract the subcorpus. For example, "V-570-np-ppagainst" means that this subcorpus contains sentences in which the target verb is followed by an NP and then a PP headed by against. The DTD for the annotation file is part of the release, in the file fnc.dtd.

Frame-and-Lexicon Database

This part of the data consists of four tables, which collectively can be thought of as a database describing all the frames and frame elements (FEs) from the project, listing the lexical units, and giving a few of the proposed inheritance (elaboration) relations between frames. The relations between the four tables are indicated by the use of unique names for frames and frame elements. (Note that there are many instances of FEs of the same name in different frames, but they refer to the same entry in the FE table. In the frame inheritance table, whre it is necessary to refer to two FEs of the same name in different frames, the dotted notation frame.frame_element is used.)

Each of the four tables is provided in two formats, XML and flat, tab-separated. In the latter, the first row contains the names of the fields, also tab-separated.

Frame Description Table

This table provides the basic information about the eighty-eight frames completed to date; the fields are:
domain semantic domain (cf. Fn. *)
frame name of the frame
FEs list of names of FEs
Description a brief description of the frame
Examples examples of the frame:n

The descriptions and examples given here are just enough to remind someone already familiar with the frame what it means; the full descriptions, with more complete examples, are given in the Appendix of Frame Descriptions of this document.

Frame Element Table

This table provides the basic information about frame elements. Conceptually, each FE is defined relative to one frame, but in practice, some FE names are specific to a particular frame, while other names are used in more than one frame. Some of the fields in the table will have little use to anyone outside the FrameNet project. As for the preceding table, fuller descriptions should be sought in the Appendix of Frame Descriptions of this document.

The fields are:
domain semantic domain of the frame containing the FE, if unique; otherwise "NA"
frame name of the frame containing the FE, often "NA" for FEs used in many frames
attribute full attribute used in the XML tag, which usually contains the abbreviation of the FE name
x1 used internally by Alembic
text display color for text of FE
bgcolor display color for background of FE
x2 used internally by Alembic
key keystroke used in annotation software
fename the full name of the FE
description a brief description of the meaning of the FE
example a short example of the FE

Lexical Unit Table

This table contains one row for each lexical unit treated in FN1, that is, for each pairing of a lemma with a frame, roughly equivalent to each dictionary sense of each lemma. Many of the fields of this table are mostly empty, because there is no information of the relevant type on this particular lexical unit in the FN1 data. For example, if it is clear which sense of a lemma is intended on the basis of the lemma and frame, there may be no definition written in; if there are no lexically specific observations on the syntax, or FEs to be realized, or commonly null-instantiated FEs, these fields will be empty.

The fields are:
lemma base form of word
pos Part of Speech (usually Noun, Verb, or Adjective)
sense number (usually 1)
domain semantic domain (cf. Fn. *)
frame name of frame for this sense
FN definition definition of this sense, written by FN staff
Senses
FE note Notes on FEs
SR note Notes on syntactic realization
Collocates Frequent collocates
Null Instantiated Constituents
Frame Elements List of frame elements
Done internal bookkeeping field
COD dfn Definition from Concise Oxford Dictionary
WS dfn
WN link Link to WordNet synset
sequence

Frame Inheritance Table

This table describes some of the inheritance relations between frames, by virtue of showing the mappings between their elements. For example, the first two lines of the file show that the Candidness frame is a child of the Communication frame, and that the FEs Speaker and Addressee of Candidness inherit from the FEs of Communication of the same name. The next two lines show that Commitment is another child of Communication, and that the FEs Commitment.Communicator and Commitment.Addressee inherit from Communication.Speaker and Communication.Addressee respectively.

The inheritance relations given in this table are very preliminary, and subject to revision. Many more such relations will probably be described as the work of FrameNet II progresses.

This section is currently under revision.


Publications related to FrameNetTopSecond Thoughts and ImprovementsDeliverables: The FrameNet Databases