David Schlangen : Home Page > minutes230807
- present: Timo, David
- move of system architecture to transport of audio via RTP almost
completed; massive efficiency gain.
- audio now exists in two forms in the system: transported via RTP
and inside the -- extended / modified -- Sphinx pipeline (i.e., as
Java objects in memory). All other communication between modules
will be purely symbolic, and handled via OAA (or perhaps some
purpose-built extension / simplification)
- Baustellen:
(0. overhearer module that allows to listen in on conversation of
agents)
1. extraction of prosodic features; beginning with pitch.. Then:
intensity, vowel lengthening, ... "rhythm" of speaker /
conversation..
2. interface ASR <-> Parser; who detects changes in hypotheses,
and how are they passed on (lattices?); how does information
from parser get filtered back in;
3. incremental parsing of lattices; first demo system
4. syntactic model of `remaining utterance length', given string
of current input (word or pos string, or partial parse?), how
many words / constituents / etc are predicted to follow?
5. get predictions out of acoustic model on how long *current*
word might still go on for
- talked a bit again about autumn project: dynamic prediction of
transition relevance point. During processing of input, create
distribution over future time points of probability that utterance
will end. This distribution will be very broad and unfocussed at
first (basically corresponding to distribution of utterance
lengths in corpus, with a peak at average length -- simple normal
distribution), with increasing utterance length, more information
sources become relevant. At the latest, during the final word
hypothesis should become stable.
Predictions can also be about *past*: in a silence after an
utterance end, there should be a very sharp peak at the last sound
frame. Conversely, silences that are mid-utterance hesitations
should not trigger such predictions.
- two dynamic factors:
- when is the first time that the correct point is identified (is
within a certain confidence interval)?
- when does the information become available from different
modules (syntactic model / parser, feature extraction, ASR)?
That is, what's the difference in prediction accuracy between a
situation where processing time is slowed down so that all
modules work in real-time (= with minimal latency) and real
real-time? If system is run in real time, how much does it lag
behind, is it still better than a simple pause-length threshold
system?
- this dynamic updating of predictions / hypotheses business sounds
like it's made for display in Zeitwort...
das, 08/23/07 01:42 (GMT)
Keyword: inpro,
meetings,
minutesAdd a new page under this one