For multiple choice questions, also write a brief (1-2
sentence) explanation of why the answer is correct. The parenthesized
numbers after each question number give the relative point value
of each question.
-
(2) The excitation model for the Voder was
- a)
-
a pulse generator
- b)
-
a collection of sin generators
- c)
-
wide band noise
- d)
-
(some other answer - specify your choice).
-
-
d) A combination of a a) a pulse generator (for voiced sounds) and c)
a wide band noise source (for unvoiced sounds) were used in the Voder
-
(2) A range of different vowels can be synthetically produced by
- a)
-
exciting a single uniform tube with periodic pressure pulses.
- b)
-
exciting a multi-tube structure with the sum of sine waves.
- c)
-
exceptionally smart single-cell animals
- d)
-
(some other answer - specify your choice).
-
-
b) works since a correctly weighted sum of sine waves can generate
any excitation function -- OR d) exciting a multi-tube structure with
a periodic pulse train.
-
(2) An acoustic tube closed at both ends and excited at its midpoint
will resonate at frequencies
- a)
-
higher than
- b)
-
lower than
- c)
-
equal to
that of an acoustic tube open at one end and excited at the other end.
-
-
a) since the boundary conditions of the first tube force
the standing wave to be equal at both ends so that the lowest mode is
a wave length of twice the tube length. I.e. the volume velocity is maximal
at the center and 0 at both ends and hence a half-wave can fit in the tube.
The second tube's wave length is four times the length of the tube since the
pressure is maximal at the driven end and zero at the open end. (I.e.
a quarter-wave can fit in the tube.) Thus since the wave length in the first
case is half the wave length of the second case, the resonant frequencies
of the first case are higher than those of the second case.
-
(2) In response to pure sinusoidal tones, an auditory nerve will spike
at different rates, depending on the frequency of the stimulus. These
rates are predominantly determined by
- a)
-
the properties of the hair cell stereocilia
- b)
-
the basilar membrane vibration in the vicinity of the hair cell
- c)
-
the specific parameters of the neuron
- d)
-
the strength of the dollar in European markets
-
-
b) is correct since the basilar membrane will achieve maximum vibration
at some point, depending on the frequency of the tone. Hair cells
in that vicinity will react and cause attached auditory nerves to
spike.
-
(2) Ignoring air absorption, frequency dependencies, nonlinearities,
or spatial dependencies, what is the effect on the reverberation time
(e.g., RT60) of doubling the absorption coefficient of a room's surfaces?
- a)
-
increase the reverberation time for quiet sounds and decrease it for loud sounds
- b)
-
double the reverberation time
- c)
-
halve the the reverberation time
- d)
-
reduce the reverberation time by a factor of 4.
-
-
c) is correct since reverberation time is inversely proportional to
absorption, so doubling the absorption coefficient reduces the reverb time
by a factor of 2.
-
(2) Three tones (100 Hz, 2000 Hz, and 7000 Hz)
are presented monaurally over wideband headphones (40Hz -16 kHz 1 dB)
to a young adult subject with normal hearing. In each case, the
Sound Pressure Level (SPL) at the subject's ear is 40 dB.
What would be the expected order of loudness for the tones,
going from the loudest to the least loud?
- a)
-
100, 2000, 7000
- b)
-
2000, 100, 7000
- c)
-
2000, 7000, 100
- d)
-
All would have roughly the same loudness.
-
-
c) as was determined by the Fletcher-Munson curves. The main
point is that the subjective sensation of ``loudness'', is a function of
both frequency and intensity.
-
(2) Dynamic programming has been applied to speech recognition
for many years. One major advantage to this approach (as it has been
commonly used) is
- a)
-
The effects of different vocal tract lengths are normalized
- b)
-
The effects of different durations for the same sounds are normalized
- c)
-
The effects of different loudnesses for the same sounds are normalized
- d)
-
The effects of post-nasal drip are normalized
-
-
b) Dynamic programming has been applied to speech recognition
primarily for the purpose of time normalization. It is used to find
the best warping of the match between the input speech frames and
the reference frames.
-
(2) A classifier is trained on a set of examples that are labeled for
class membership. The resulting classifier is used to estimate the
class of each example in the training set, and is correct 90% of the
time. The same classifier is then tested on a new set of examples that
were not used in the training. For real-world data, the most likely
result would be
- a)
-
The test set performance would be at most 90%
- b)
-
The test set performance would be at least 90%
- c)
-
The test set performance would be close to zero
- d)
-
The test set performance would depend on the eye of the beholder
-
-
a) Performance on the training set is an optimistic estimate
of the pattern classifier, since the training and test are done with
the same examples. In some sense, the classifier is typically trained to
be optimal for that group of examples. Examples from outside of the
training set are likely to differ due to sampling variability,
so the system trained on the first group is unlikely to do as well
on the second. Thus, since the accuracy on the training set was 90%,
the performance on the second is likely to be lower. It was fair
to note that in any individual example it is possible to get ``lucky''
and have a higher test set accuracy than this, but this is not
``the most likely result''. Similarly, it is also possible to get very
``unlucky'' and get nearly a zero accuracy, but the system trained
on the training set is in fact your current best guess of
pattern characteristics, so this would not be the most likely result
either.
-
(6) Consider a digital filter consisting of two cascaded sections:
Section 1 is defined by the equation
Section 2 is defined by the equation
- a)
-
Sketch the filter's poles and zeros for K=12 and = 60 degrees
-
-
The numerator (Section 1) contributes K=12 zeros evenly spaced about the
unit circle since these are the 12th complex roots of 1. It also contributes
12 poles at z=0 since when z=0, goes to .
The denominator, from section 2, can be factored into
and thus Y(z) has poles
at and and two zeros at z=0.
With = 60 degrees, the two poles in the denominator cancel the two
zeros of the numerator located at 60 degrees and the two zeros in the
denominator cancel two of the poles at the origin. Thus we are left with:
- b)
-
Sketch the frequency response, i.e. the amplitude of
Y(z) evaluated on the unit circle, vs. .
-
-
Working from the pole-zero plot, as we trace around the unit
circle, when we hit a zero, the magnitude spectrum goes to 0, and as we get
further from the zeros or closer to poles, the magnitude spectrum increases.
This will give a fine sketch of:
If you wanted to plot it exactly (which you probably do not have time for
in an exam), you can plug into the eq of Y(z)
to get the amplitude response
:
-
(4) Define the following terms (1 sentence per definition):
- a)
-
interval histogram
-
-
An interval histogram is a histogram of the times between
successive neural spikes for a given neuron. For example, a pure
or complex tone is repeated until a reasonable number of spikes
are obtained.
- b)
-
efferent
-
-
Efferent is used to describe neurons projecting from deeper in the
brain towards lower levels -- a form of feedback.
- c)
-
stapes
-
-
The stapes is the third bone in the middle ear that strikes the
oval window of the inner ear to excite the inner ear.
- d)
-
oval window
-
-
The oval window is the outer membrane of the cochlea that, when
struck by the stapes, sets the cochlea to vibrating.
-
(4) Show that the constraint u(x,t) = w(x)v(t) can be applied to the
1-D wave equation, and that exponentials are a solution to the resulting
form. Recall that the 1-D wave equation is
-
-
This can be shown as follows:
-
(4) Give two dimensions of difficulty for a speech recognition task.
For each dimension, describe an easier and a harder example,
and explain (in one sentence) the difference in difficulty.
-
-
Obviously there are many ``dimensions of difficulty'' for ASR,
and anything reasonable was accepted. Some of the common ones are:
- Speaker-dependent vs speaker independent - the latter is harder, as it
refers to training on some speakers and testing on
others; The recognizer must be trained to ignore speaker-specific
characteristics that are irrelevant to the message.
- Vocabulary size - small is easier, large is harder; for more words,
acoustically confusable words are more likely.
- Perplexity - low perplexity is easier, high perplexity is harder -
a very constrained task may have low perplexity (uncertainty about
the next word) even if the vocabulary is large, and in that case
the acoustic classifier does not have to be as accurate to get the
right answer.
- Isolated vs. continuous - words separated by pauses are easier
to recognize than (for the same task) words spoken continuously;
in the latter case, there is coarticulation across words and
the pronunciations and spectral content are changed.
- Speaking style - read vs conversational; speech that is read from
a written text is easier to recognize than speech from a natural
conversation. Speech in the latter case tends to have many phone
deletions, slurred sounds, varied pronunciations, and more rapid speech
with increased coarticulation effects.
- Environmental noise - low noise is easier, high noise is harder.
Particularly if the same noise is not present during training,
the speech features used for recognition are a worse match
to the training set; even if noisy data are available during training,
more noise tends to blur the boundaries between potentially confusable
sounds.
- Reverberation - low reverberation time (and energy)
is easier, high reverberation is harder. Longer reverberation times
cause smearing between neighboring phonetic elements.
- Microphone type - a close microphone is easier and a distant microphone
is harder, due to increased noise and reverberation for the latter
case. It is also easier if the training and testing microphones are
the same.
-
(6) Let be a discrete random variable, and let a and b be two
classes that examples corresponding to can belong to. Further,
let the 2 class-conditional densities be known:
and
Finally, the class priors are also known, and are
and .
Find the optimal decision rule to decide on class a or b given .
-
-
Here the point was to use the Bayes Decision rule,
in which you pick the class with the largest posterior
(in this case, pick a if p(a|x) > p(b|x), and pick b if p(b|x) > p(a|x).)
An equivalent formulation is based on the Bayes' transformation:
and looking at whether
is true (since p(x) is the same irregardless of class)
or vice versa. Computing, we get:
(Both products are zero outside of these ranges.)
So class a ``wins'' if 1<=x<=8 , and class b ``wins'' for 9<=x<=10. This
is the optimal (Bayes) decision.