Spatio-temporal networks for speech and visual pattern recognition
I am interested in the representational, computational, and
adaptive properties of spatio-temporal networks and the use of such
nets in speech and visual pattern recogntion.
A spatio-temporal
neural net differs from other neural networks in two ways.
Propagation delays along links and recurrence
play an important role in the computations carried out by the network.
The representational state of the network depends not only on
which nodes are firing, but also on the relative firing times of
nodes.
Consequently, the representational significance of a node varies
with time and depends on the firing state of other nodes. The use of
recurrence and multiple links with variable propagation delays provides
a rich mechanism for integration, context sensitivity, feature extraction
and pattern recognition. Recurrent links enable nodes to integrate and
differentiate inputs, detect the onset of features and measure their
duration. At the same time, multiple links with variable propagation
delays between nodes serve as a short-term memory and allow the network
to maintain context over a window of time.
The combination of the above characteristics makes spatio-temporal
neural networks a potentially powerful mechanism for pattern recognition.
Needless to say spatio-temporal neural networks have a sound basis in
biology.
In our research we use the Temporal Flow Model (TFM)
[Watrous and Shastri, 1986][Watrous, 1988]. For some details on how
we train TFM look
here.
The need for spatio-temporal network arises naturally when dealing
with problems such as speech recognition and time series prediction
where the input signal has an explicit temporal aspect.
In work with Thomas Fontaine
[Fontaine and Shastri, 1993][Shastri and Fontaine, 1995]
we have demonstrated that certain
tasks that do not have an explicit temporal aspect can also be
processed advantageously with neural networks capable of dealing with
temporal information. In particular, we have proposed that converting
static patterns into time-varying (spatio-temporal) signals
by scanning the image would lead to a number of significant
advantages.
- An obvious but significant advantage of scanning approach
is that it naturally leads to a recognition system that is
shift invariant along the temporalized axis(es).
- The spatio-temporal approach explicates the image geometry since
the local spatial relationships in the image along the
temporalized dimension are naturally expressed as local temporal variations
in the scanned input.
- In the spatio-temporal approach a spatial dimension is replaced by
the temporal dimension and this leads to models that are
architecturally less complex than similar models that use two spatial
dimensions. This reduction of complexity occurs because
the spatial extent of any feature in an object's image
is much less than the spatial extent of the object's image.
If we assume that internal nodes act as feature detectors then the
moving receptive field of an internal node in the spatio-temporal
model leads to an effective tessellation of the feature detector
over the image without the actual (physical) replication of the
feature detector.
-
Work on visual pattern recognition often treats an image as a
static fixed length two-dimensional pattern. This is unrealistic
since in general an agent must scan its environment to locate
and identify objects of interest. Observe that even reading text
involves processing a continuous stream of visual data having
an essentially arbitrary extent. The scanning approach allows
a visual pattern recognition system to deal with inputs of arbitrary
extent.
The effectiveness of the above ideas has
been demonstrated in the dissertation work of Thomas Fontaine
who designed and trained a system for recognizing sequences of handwritten
digits. The system has a 96% recognition rate on a dataset of 2,700
isolated digits provided by USPS and a 96.5% recognition rate on a
set of 207,000 isolated digits provided by NIST. On a set of 540
real-word ZIP code images provided by USPS, the system achieved a
raw accuracy of 66.0%.
A postscript paper describing this work may be found here.
Another paper (in postscript form) describing work done with C. Privitera on a
hierarchical self-organizing model for visual trajectory classification
may be found here. This paper appeared in the Proceedings of
ICANN-96 -- the 1996
International Conference on Artificial Neural Networks, Bochum, Germany.
An
extended version of the above paper is also available.
Support:
This research has been supported by:
NSF grants SBR-9720398, ONR grant N00014-93-1-1149 to L. Shastri, and
ARO grants DAA29-84-9-0027 and DAAL03-89-C-0031 to the AI Research Center,
the University of Pennsylvania.