Keywords: conversational agents, virtual storytelling, logic programming, natural language and speech processing, WordNet, FrameNet and Open Mind based knowledge processing
Our interest in chat agents has emerged from the development of interactive Web based storytelling programs. Given the nature of storytelling performances is ephemeral and not replicable, important artifacts of world culture are being lost. In other types of work settings, corporate memory and organizational knowledge is being similarly lost or not exploited for its optimal use. Using conversational agents for storytelling has been shown to ``bring to life" the collaborative computer-centered work environments necessary to sustain and make thrive the work of distributed teams and groups or academic classes working in virtual environments. Developers approach this in a number of ways through new technologies [1], applications [2], authoring tools [3], virtual characters [4], and models for narrative construction [5].
We have coded our story-telling agents as a combination of story specific Semantic Web metadata (encoded as XML/RDF [6,7] files) and Prolog Web services integrating the WordNet [8,9,10] and FrameNet [11,12,13] lexical knowledge bases and a subset of the Open Mind [14,15,16] collection of common sense ontologies. For some stories, online chat transcript are used for establishing the Prolog query/answer patterns through an example driven learner implemented in Prolog, which uses WordNet based generalizations (hypernyms) to extend its coverage. The query/answer correlations extract ontology specific knowledge from the story's RDF metadata, text and related chat recordings. The agent's conversational capabilities are enhanced by matching content from WordNet, FrameNet and Open Mind. A first level in our rule hierarchy provides an essentially stateless, reactive dialog layer. Given that the reactive rules are specialized with respect to the content of a given story, this captures the most likely questions and provides them with largely predefined answers (up to a fairly flexible WordNet based semantic equivalence relation). However to provide access to the state of the interaction as well as to the content of the XML based story database, we extend the shallow pattern processing with a logic-based inference engine. The engine consists of a natural language parser, a common sense database, a lexical disambiguation module, as well as a set of transformation rules mapping surface structures to semantic skeletons in a way similar to the natural language processor described in [17]. The inference engine uses a dynamic knowledge base, which accumulates facts related to the context of the interaction. Such facts can be used for future inferences. This dynamic knowledge base works as a short-term memory similar to the one implicit in human dialogue and also provides means to disambiguate anaphoric references. On the other hand, a permanent static database obtained by scanning XML-based metadata for each story, built by a human indexer, provides more specific information, when the semantic structure of the query can be translated into predicates matching metadata tags. A fragment of such encoding follows:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="../specificstory.xsl"?>
<story>
<dc:title>The Ant and His Treasure</dc:title>
....
<sc:genre>folktale, animal tale, inspiration story,
success story, trickster story
</sc:genre>
<sc:theme>self-reliance, determination, </sc:theme>
<sc:tale-type>fable</sc:tale-type>
<sc:motif>breadcrumb, weak characters</sc:motif>
<sc:setting>anthill, nature</sc:setting>
<sc:characters>ant, bee, cockroach, spider</sc:characters>
<sc:archetypes>trickster animals, weak-will</sc:archetypes>
<sc:coda>Stop crying and keep trying.</sc:coda>
....
</story>
The XML/RDF metadata allows users to search the story database by title, abstract, taletype, performer name, etc., for a digital video/audio storytelling performance, select the one they want to play, view the performance via a streaming media player, focus on relevant parts of the narrative transcript of the storytelling performance, and/or interact with the agents in a dialogue about the narrative. We have integrated access to information sources related to a given story into a natural metaphor - a virtual storytelling agent which is modeled after what people ask and answer about the story - while being aware of the ontology and the context of the story, modeled as hierarchy of classes.
The knowledge base creates an agent instance based on the class to which the story is known to belong and provides inferences about related stories and default assumptions, which are used for queries not covered by the pattern extracted from the online chat sessions. The Jinni 2003 [18] Prolog compiler's support for multiple cyclic inheritance allows stories to be organized based on multiple classification criteria, very much as if they were related Web pages linked to each other. Jinni's classes are simply Prolog files with include declarations. They can be located at arbitrary URLs on the Web and can inherit predicate definitions from each other. When story instances are created, the object constructor receives the URLs to the locations of the multimedia (digital video/audio) recording of the storytelling event, the story transcript, and the log of the story-related query/answering chat session.
During the iterative design and development process of our storytelling agents we have noticed that the focus can be easily lifted towards a framework supporting conversational agents where domain specific information, lexical and semantic knowledge and common sense rules interoperate and enhance each others expressiveness. The resulting generic agent architecture is described in Fig 1.
We will overview the various components of the architecture and their interactions in the following sections.
Our Agents are deployed using Prolog Web servers and server side Prolog script processing capabilities. This provides seamless integration between the knowledge base, the shallow script processor and the XML metadata reflected as a Prolog set of story specific facts and rules. The Web components are also developed using exclusively XML/XSL/XHTML pages to ensure a natural binding of Web content to database fields and easy parsing by script processors.
We have used Microsoft Agent [23] components embedded in a dynamically generated Java Script Web page to provides easy integration of client-side voice and animation services. Under Internet Explorer, the dynamically generated Web pages trigger automatic download of the Microsoft Agent controls from the Microsoft server on first use. Client-side voice interaction is provided through the SAPI voice API's text-to-speech component. Specific text and animation commands are generated by our Prolog Server Agent processor which edits annotations made in a page template like the following:
<html>
<head>
<title>Prolog Server Agent Output</title>
</head>
<body language="Javascript" onLoad="OnLoad()">
<OBJECT id=
/* reference to Microsoft Agent downloads ... */
</OBJECT>
<SCRIPT language="Javascript">
var aAgent;var qAgent;var res;
function initAgent(name,url) {
AgentControl.Characters.Load(name,url);
name = AgentControl.Characters.Character(name);
return name;
}
function initQ() {
qAgent=initAgent("qAgent",
"http://agent.microsoft.com//agent2//chars//peedy//peedy.acf");
}
function initA() {
aAgent=initAgent("aAgent",
"http://agent.microsoft.com//agent2//chars//merlin//merlin.acf");
}
function agentSpeak(agent,message) {
agent.Get("state", "Showing, Speaking");
agent.Get("animation", "Greet, GreetReturn");
agent.Show();agent.Get("state", "Hiding");
agent.Play("Greet");
res=agent.Speak(message);
agent.Hide();
}
function OnLoad() {
initQ();initA();
agentSpeak(qAgent,"{{?spoken_query}}");
aAgent.Wait(res);
agentSpeak(aAgent,"{{?spoken_answer}}");
}
</SCRIPT>
<p>
<font color="#000099"><b>{{?login}}:</b>{{?query}}</font>
<br><b>agent:</b> {{?answer}}
</p>
<hr><b>History Window</b>
<pre>
{{?history}}
</pre>
</body>
</html>
Note the presence of the {{?..} patterns which will be expanded by our Prolog Server Agent processor into the actual text to be spoken by the client-side text-to-speech processor. The Pattern Processor described in the next function (also providing shallow natural language processing for queries) is used to locate the patterns and replace them on the fly with content from the Prolog database or an associative list.
We have designed a generic Definite Clause Grammar based Pattern processor which works on arbitrary data (character codes, tokens, sentences) to detect and aggregate patterns at a given syntactic level.
The predicate mach_pattern(Pattern,InputList) matches Pattern against InputList. Pattern can contain any combination of constants, constrained variables of the form X:P where P is a predicate about X, as well as Gap variables which match arbitrary sequences located between constants and constrained variables. Note that constants and constrained variables match single items and function as known index elements in the InputList, while Gap variables collect the text to be retrieved. If patterns contain other patterns (embedded as Lists) the mechanism allows recursive application - but most of the uses of the pattern matching mechanism in our applications have involved ``shallow use'' i.e. recursive embedded patterns have been seldom used. The code actually handles more powerful annotations (i.e. regular expressions and disjunctive patterns) which have been proven very useful in applications like text mining and Internet content processing.
Here is an example of a rule working on a list of natural language tokens:
try_match(Login,Password,Ys,Os):-
ensure_last(Ys,'?',Is),
% What do you <know> about <life>?
match_pattern([V:wh_word(V),do,you,
Verb:is_verb_phrase([Verb|_]),Obs,'?'],Is),
!,rotate_answer(Login,Password,what_do_you(Verb,Obs,Ds),Ds),
ensure_last(Ds,'.',Os).
As the answer handler what_do_you generates multiple answers, we are applying to it a higher order transformer rotate_answer which first accumulates the answers in the dynamic Prolog database (short-term-memory) and, when no more answers can be produced, rotates the answers - this is quite important to avoid boring the users with repeated answers. The parameters Login and Password uniquely identify the user allowing allocation of user-specific server side context which provides a conversational short term memory function.
WordNet maps word forms and word meanings as a many-to-many relation. An important characteristic of WordNet is that semantic relations (hypernymy, hyponymy, synonymy, meronimy etc.) are defined in WordNet between meanings instead of being defined between words or word phrases.
Meanings are represented by integers (called synsets) associated to sets of words and word phrases collectively defining a sense element (concept, predicate or property and also usable for indexing.
So, for example, the meaning identifier (synset) Id=100011413 maps to the following list of words and word phrases: [[animal], [animate,being], [beast], [brute], [creature], [fauna]], which collectively define a common meaning.
We have refactored the set of predicates provided by WordNet closely following the WordNet relation set (see ) to support bidirectional constant time access to the set of meanings associated to a given word phrase (indexed by a unique head word) and for the set of word phrases and relations associated to a given (unique) meaning.
The refactored WordNet Prolog database contains the following basic binary relations.
For a given word (as humble in the following example) the relation w/2 returns a list of meanings:
?- w(humble,Meanings).
Meanings=[
302269648,201415418,301839431,
201414096,302162242,301551117,109699111
).
For a meaning, we have a number of alternative words or word phrases, with attributes. Among them, the last argument provides frequency of occurrence in a corpus of texts and will be used for disambiguation:
?-i(302269648,WordInfo).
WordInfo=[
f(1,[humble],s,1,3),
f(2,[low],s,7,0),
f(3,[lowly],s,1,1),
f(4,[modes],s,5,0),
f(5,[small],s,3,155)]).
Links (like the sim/1 synonymy link) are collected on a list:
?-l(302269648,Links).
Links=[sim(302269385)]).
Definitions and examples originally present in WordNet are preparsed so that they can be processed efficiently, if needed, at runtime. We also collect frequency information and word forms not present in the form of WorldNet entries.
?-g(302269648,DefinitionAndExamples).
DefinitionAndExamples=[
def([low,or,inferior,in,station,or,quality]),
ex([a,humble,cottage],
ex([a,lowly,parish,priest]),
ex([a,modest,man,of,the,people]),
ex([small,beginnigs])]).
Note that multiple syntactic categories can be present for words like ``humble'' (v=verb, and a=adverb).
i(201415418,[f(1,[humble],v,1,1)])
...
i(301839431,[f(1,[humble],a,2,1)])
...
i(201414096,[
f(1,[humiliate],v,1,2),
f(2,[mortify],v,3,0),
f(3,[chagrin],v,1,0),
f(4,[humble],v,2,1),
f(5,[abase],v,1,0)]).
Note also the presence of reversed relations like hyponyms (reverse hypernyms) and reverse meronyms. These are are precomputed to support high performance graph walk operations using BinProlog and Jinni's fast database indexing mechanisms (blackboard operations), to provide constant time access to edges related to a given node for a given relation.
We have also precomputed mappings from word variants to related meanings, based on the dictionary entry they belong.
Finally, we have precomputed ``toplevel'' nouns and verbs, (meaning which do not have further hypernym links).
This refactoring provides sets of facts with the following properties:
Overall, our refactoring simplifies WordNet while providing an efficient inference engine through Prolog rules that can digest the information contained in these basic relations.
In the directed graph G of meaning nodes NG we can see various WordNet relations as defining elements of a set of edges EG.
We implement our abstraction operators as depth-K walks in such graphs. Besides the primitive WordNet relations, we use closures1 as edge generators.
To explore various semantic relations provided by or inferred from WordNet, we have implemented a generic graph walking algorithm which uses arbitrary relations to dynamically define edges and applies arbitrary conditions to the nodes.
Among the interesting applications to chat agents - a fast least common ancestor computation provides answers like in:
guest: Relate woman and love!
agent: After scratching my head 1 times, relating woman to
love makes me think about girlfriend.
As WordNet seems to know about logic programming (the technology used in this project), the following comes out quite easily:
guest: Relate logic and programming.
agent: After scratching my head 1 times, relating logic to programming makes me
think about logic programming.
Unfortunately, with depth limited to 3, our agent will try to save face, after not finding a connection2:
guest: Relate logic and love!
agent: About love. Love is a kind of emotion, are you
thinking about any strong feeling?
This is a simple but useful abstraction mechanism - with obvious applications to indexing. It consists of lifting words and word phrases extracted from the story transcripts to more general equivalents obtained by following upward links in the WordNet hypernym hierarchy - as in the following example:
?- lift_word(spy,1,S).
S=[[[secret,agent],
[intelligence,officer],
[intelligence,agent],
[operative],...]];
no
?- lift_word(spy,2,S).
S=[[[perceiver],[observer],[beholder],[agent],...]]
];
no
?- lift_word(spy,3,S).
S=[[[person],[individual],[someone],[somebody],[mortal],[human],...]];
Hypernym relations are more meaningful for nouns than for other syntactic categories. Noun-related synsets form relatively deep (up to 10-12) hierarchies coming from fairly reliable common sense and natural sciences classifications. By restricting a story trace to nouns, one can get an approximation of what the story is about - at different levels of abstractions.
Pure verb traces (obtained by selecting only verb sequences) provide an abstract view of a story's dynamics. Like Web links, WordNet graphs exhibit small-world structure (strong clustering and small diameter). At a deep enough level (5-10), all stories will map to sequences like [act] [transfer] [change] [rest] and similar verbs, organized as collections of story-independent patterns. Such patterns indicate dramatic intensity and can be used to spot out climactic points in a story.
WordNet provides a cs(Cause, Effect) and an ent(Action,Consequence) relation applying only to verbs. By using them in the context of a story, one can derive hints about what might happen next or explain why something has happened.
Causal relations provide possible explanations, and as such, answers to why questions about a story or generate explanatory sentences usable in abstracts.
The FrameNet corpus provides ontology data at a higher level than WordNet and allows detection of most of the semantic roles [12] relevant to the understanding of XML/RDF story-specific data as well as in detecting key elements of the conversational context. The ``granularity'' of FrameNet data which is described in terms of predicates (corresponding to verbs) and their arguments matches well our internal Prolog representations, also consisting of predicate definitions. We use a SAX based event driven parser (part of Jinni 2003) to extract only relevant data from FrameNet (a collection of a few thousand XML files of around 1000 Mb total size).
After simple syntactic transformations we have mapped various Open Mind and human chat transcripts to a Prolog database of ``canned'' question/answer facts. Through the use of noun abstractions and verb abstractions as well as synonymy relations we have significantly extended the coverage of this database.
Conversational agents [25,26,27,4,28] have been identified as an effective multi-modal interface element with applications ranging from user support automation to video games and interactive fiction [3,5,29]. Interestingly enough, conversational agents are reopening some 50 year old methodological dilemmas and challenges of artificial intelligence. We will overview them informally here, based on our own journey through various architecture and implementation decisions in building a fairly large Prolog based conversational agent in a virtual storytelling application which integrates more than a GigaByte of knowledge base data from WordNet, FrameNet and Open Mind. The distinctions stem from aspects related to conversational intelligence (reasoning) as well as (factual) conversational knowledge.
Symbolic vs. statistical inference processing This is an instance of the old AI dilemma between using logic/predicate calculus or semantic nets/conceptual graphs as a symbolic reasoning mechanism versus statistical mechanisms like Bayesian networks, genetic algorithms or artificial neural networks.
Programming vs. machine learning Should conversational agents be coded in (possibly customized) high level languages or we should use various machine learning/data mining algorithms to extract conversational intelligence from various online and offline information sources?
Hand-coded vs. automatically acquired knowledge In a way similar to the choice between hand coded and machine learned conversational intelligence, hand built conversational knowledge competes against knowledge acquired automatically by transformation/adaptation of existing knowledge repositories.
Shallow vs. deep Natural Language understanding Somewhat related to effectiveness in the ``Turing test'' (the ability to fool humans about an agent being or not being human) in half-serious contests like the Loebner prize. Also, it terms of conversational realism, in the context of video gaming or interactive fiction, shallow natural language processing, using a large set of patterns, often collected as an open-source, user contributed process (see [28]), has been proven a valid de facto alternative to sophisticated morphological, syntactic or semantic ``deep'' natural language understanding techniques, using translation of natural language to various formal representations and query processing languages.
Mimetic vs. real conversational intelligence Realistic human-like synthetic actors are now routinely used in movies and video games. The need for domain-independent conversational intelligence is particularly important in applications from interactive fiction to user support. The subconscious or explicit user expectation in all these areas is that if an agent looks like a human and speaks like a human, then it has all the other human attributes. This has lead chatbot implementors to a focus on ``deception'' techniques ranging from Eliza-style rephrasings and oracular/ambiguous/unspecific answer generation, to shifting the focus of the user (offering them a drink or talking to another agent - as seen often in video games or interactive fiction) - sometimes quite successfully in terms of psychological realism, as it is the case in Façade [30].
Knowledge intensive vs. Inference intensive AI With the advent of large, freely available lexical and common sense knowledge repositories like WordNet [8,9] and Open Mind [16,14] it is becoming increasingly possible to cover a large spectrum of natural language questions by finding satisfactory answers through relatively shallow (but computationally efficient) pattern matching. Interestingly, issues like ``story consistency'', coherent handling of the context of the conversation and its implicit assumptions require refocusing our inferential techniques towards textual rather than sentential aspects of conversational intelligence, while balancing mimetic realism and usefulness of our agents as information sources.
The agent technology involved in the automation of the interactive chat query/answer patterns needs both a natural language analysis and a natural language generation component. The analysis capabilities are needed to understand the question and the generation capabilities are needed to construct the answer.
Answering what-is-this-about questions is relatively easy - by extraction of dominant nouns and noun phrases from each story. However, creating a dialogue to get at the deeper hermeneutics of the story or the impact of a storytelling performance narrative upon an individual is harder. Different people will select a different trace in a story to chat about. A story trace is a sequence of meanings extracted from the lexical material of a story to which one or more meaning transformations are applied. The semantic ambiguity coming from the polysemy of the lexical material is intensified by the pragmatic ambiguity of the listener's personal experience, the parameters of the storytelling performance, and the nature of the multimedia experience (seeing video, hearing audio tracks, reading a transcript, listening to a musical story, etc.)
Through the use of WorldNet, abstractions can be traced to help determine what a given story and its parts are about. WorldNet contains semantic links to allow the users to navigate on a network of meaning-to-meaning relationships. Meaning elements obtained by navigating WorldNet concept hierarchies naturally generalize the meaning of individual sentences. By starting from a story's lexical material and working upward in word meaning hierarchies to understand higher level indexing terms, story similarities and differences can be compared and query/answer patterns can be automatically extracted.
In the context of our general architecture, we have identified the following issues for the future developments of our conversational agent technology:
1 Our closures are predicate name+argument combinations, which receive two graph nodes as extra arguments to make-up a callable predicates.
2 Well, the same happens at level 12 - and that's because in WorldNet’s view, there is no connection!