Log on: Remember me
Powered by Elgg
  • Publish Comment:

  • David Schlangen's Pages:

    Pages
  • David Schlangen

  • Owned communities

David Schlangen : Home Page > minutes230807

  - present: Timo, David
  - move of system architecture to transport of audio via RTP almost
    completed; massive efficiency gain.
  - audio now exists in two forms in the system: transported via RTP
    and inside the -- extended / modified -- Sphinx pipeline (i.e., as
    Java objects in memory). All other communication between modules
    will be purely symbolic, and handled via OAA (or perhaps some
    purpose-built extension / simplification)
  - Baustellen:
    (0. overhearer module that allows to listen in on conversation of
        agents)
    1. extraction of prosodic features; beginning with pitch.. Then:
       intensity, vowel lengthening, ... "rhythm" of speaker /
       conversation..
    2. interface ASR <-> Parser; who detects changes in hypotheses,
       and how are they passed on (lattices?); how does information
       from parser get filtered back in;
    3. incremental parsing of lattices; first demo system
    4. syntactic model of `remaining utterance length', given string
       of current input (word or pos string, or partial parse?), how
       many words / constituents / etc are predicted to follow?
    5. get predictions out of acoustic model on how long *current*
       word might still go on for
  - talked a bit again about autumn project: dynamic prediction of
    transition relevance point. During processing of input, create
    distribution over future time points of probability that utterance
    will end. This distribution will be very broad and unfocussed at
    first (basically corresponding to distribution of utterance
    lengths in corpus, with a peak at average length -- simple normal
    distribution), with increasing utterance length, more information
    sources become relevant. At the latest, during the final word
    hypothesis should become stable.
    Predictions can also be about *past*: in a silence after an
    utterance end, there should be a very sharp peak at the last sound
    frame. Conversely, silences that are mid-utterance hesitations
    should not trigger such predictions.
  - two dynamic factors:
    - when is the first time that the correct point is identified (is
      within a certain confidence interval)?
    - when does the information become available from different
      modules (syntactic model / parser, feature extraction, ASR)?
      That is, what's the difference in prediction accuracy between a
      situation where processing time is slowed down so that all
      modules work in real-time (= with minimal latency) and real
      real-time? If system is run in real time, how much does it lag
      behind, is it still better than a simple pause-length threshold
      system?
  - this dynamic updating of predictions / hypotheses business sounds
    like it's made for display in Zeitwort...



das, 08/23/07 01:42 (GMT)

Keyword: inpro, meetings, minutes

Add a new page under this one