RIDR pipeline: Integrating Jovo with Hermod


#1

Hey All,
For the one line readers,

Is there any documentation or sample code detailing how the RIDR pipeline feature of JOVO3 works?

Just read Jan’s article on context-first com, alexa-please-send-this-to-my-screen.

I’ve been noticing Google struggle with voice ecosystem collaboration. With my old phone,when I ask a question, Google Home answers in the next room. New phone fixed.
They use speaker identification to solve a bunch of problems. I think they have stuff in place to solve which device is closest/responder if many hear a request. Alexa answers everywhere. Home assistant uses gps logging of a phone to place users in space.
It’s complicated.

Chromecast is a great multi modal device. Miracast proved that screens need to support streaming media without sucking power on a portable device. Websockets can be used to synchronise a chromecast with a phone web application for UI control via shared server that also hosting an Alexa app for voice control.
I used this approach in a very early DialogFlow application - Meeka Music which could ask Google Home to stream music to my phone (if it was logged into a web page. Don’t think it was possible to stream directly to the Google Home at that stage). Just Dance is possibly the premier commercial example of multiple devices collaborating around a single chromecast.

I haven’t built anything with JOVO as I only discovered it after building a couple of apps and then found Snips which captivated me with the possibility of a standalone voice recognition system.
By the time I found Mycroft I was well down the pathway of building my own voice dialog manager modeled on Snips using a central MQTT server to coordinate flow between a suite of machine learning services which has ended up reasonably polished as of last week as Hermod https://github.com/syntithenai/hermod.
My focus is integrating voice using RASA nlu and routing into websites. With authenticated MQTT, web clients are first class citizens.

A demo uses voice to fill crosswords and ask questions of wikipedia, youtube, unsplash. https://edison.syntithenai.com.

I’m interested in integrating JOVO to allow a RASA actions backend to serve Alexa, Google, Web applications or standalone devices.
Obviously format conversion of training data in build scripts RASA -> JOVO -> * is required.

As a minimum, I imagine a JOVO server process that injects an intent message into the hermod message flow and waits for a tts / say or speaker / play message before responding.
There is a picture showing the flow of a dialog through the hermod services at the top of the github README.

I see that in the release notes that JOVO3 is promoting an RIDR Pipeline that flexibly integrates various machine learning components.
In contrast to your amazing existing documentation, the documentation around the pipeline is insufficient for me to get any real idea of what your approach will be.
I am very keen to see how you are thinking about integrating the various machine learning components used in voice applications.
In both locally hosted and online services there exists clear division into ASR, NLU, TTS components.
I see you have noted SLU integrations where those divisions don’t necessarily occur.
RASA introduces a bender in providing machine learning based routing which perhaps deserves consideration when thinking of pipelines.

Any food for thought by way of pipeline documentation or feedback much appreciated.

cheers
Steve


#2

Hey @Steve_Ryan, welcome to the Jovo Community :wave:

Sounds exciting! We’ve discovered Hermod quite a while ago and are impressed with your work (apparently I was the 9th person starring the repo, haha). Thank you for reaching out.

Not too much, unfortunately. The RIDR pipeline is a simplification of the Jovo middleware architecture, which can be found here: https://www.jovo.tech/docs/architecture (this really needs to get updated and described in more detail, too!)

Each Jovo plugin can hook into one or multiple of our middlewares. For example, if you want to take a closer look, it might be helpful to look into some of the integrations and see which middlewares they use:

That’s super cool! We’re also working on a few examples that use our web client.

Would love to brainstorm a little more about this. Could you elaborate a little more how the process would ideally look like for you? where would Hermod/Jovo hand off information to each other? Would Jovo integrate with Rasa and handoff information to Rasa, or the other way?

Thanks,
Jan


#3

Hey Jan,
Right. I was under the impression that JOVO (in line with Alexa and Google) started at NLU and ended with text plus whatever media metadata the target platform supports.
Having poked around I now understand how the middleware flow allows for the possibility of audio data to initiate a request and possibly use TTS on the way out.

Few thoughts,

  • ASR streaming makes a huge difference to the user experience as responsiveness. The current JOVO API doesn’t allow for the possibility.
    [node-gyp gave me no end of grief trying to get the ASR streaming libraries working in node, especially on ARM. It was the final straw that pushed me to migrate to python ;)]
  • it would be nice to have an integration hooked in at the router/handler stage to use RASA core routing.
  • using EventBrokers to communicate with RASA is more immediate and potentially saves memory and is more efficient because you don’t need to run their HTTP server.

In terms of Hermod, I’m thinking that a JOVO service can send and receive MQTT messages so that an Alexa skill can be backed by the same action code as a website or Hermod based standalone device using RASA stories.

A request hits a JOVO endpoint, the integration sends an intent message and waits until it hears a dialog/end or dialog/continue message from Hermod before sending a HTTP response.

The intent message has a body including parsed NLU.

Hermod sends multiple tts/say messages which can include metadata describing buttons and images. These would need to be collated by the JOVO endpoint integration into a format suitable for a single response from the platform.output middleware phase.

To the extent that it is possible on any given platform, the output should restart the microphone for further input if a dialog/continue message is heard.

I’ll chew on that and see what I can do.

cheers

Steve


Alexa Conversations - More Amazon Lock-in?
#4

Completely agree, this is something we need to work on a little more. Our web client supports the Web Speech API which offers streaming capabilities from speech to text. Definitely lots of room for improvement!

Yes! Have been thinking about this but haven’t gotten around to it yet

Interesting. We’ll take a look! cc @AlexSwe

Btw: Just slightly edited the thread title and moved to “Feedback & Feature Requests” :slight_smile:

Thanks for all your ideas! Excited to explore further


#5

Hi @Steve_Ryan, just published a more in-depth introduction to RIDR. Would love to get your thoughts on this:


#6

Hey Jan,

some diverse thoughts that jump out at me after reading your article are

  • asynchronicity is necessarily part of a distributed environment serving various ML models as inputs and outputs
    For example, with RASA the action processing involves staged asynchronous requests to the ML core routing model and the HTTP action endpoint.

  • some of the concerns raised could be addressed by external integrations

    • Using the example of output generation/display, ML models could generate text/button/image response AND the action layer could send additional display messages.

      • a standard schema for most output options could serve diverse standard connectors - chat, voice, display with the core software delivering custom output per platform

        • text
        • images
        • buttons
        • scheduled sounds
      • the action layer could provide for additional asynchronous outputs.
        For example a web application (optionally via chromecast to tv) is used an additional display and listens for MQTT messages, action code on the server triggered by the fill_crossword intent sends JSON structured messages that are interpreted by the web application as a trigger to fill the answer on the screen and then send a message to trigger a verbal confirmation or issue a warning if the fill action is disallowed. In this way, the logic of filling the crossword and generating responses only exists on the web client.

  • some interaction types/inputs

    • spoken input
    • button clicks
      • current user
      • client application state
    • speaker identification
    • mood detection
    • visual face id
    • point at on display device
    • face commands - blink, turn, nod, shake head
    • gestures
    • bluetooth device status (heartrate, …)
    • GPS location
    • friend tracking
    • phone movement detection (Just Dance)
    • VR inputs
    • Chromecast/FireTV messaging (eg from TV remote pause message)
    • hassio state
    • current input devices
    • input devices used in request
    • most recent output

The relevance of many of these inputs is restricted to the action layer or externalised.


In Hermod, the pipeline is based on asynchronous messaging between a suite of services potentially distributed over a network.
A dialog manager service collates the output of previous steps in the pipeline and triggers next steps until completion.

Services need to implement a simple asynchronous messaging API but are free to determine the implementation. Multiple implementations exist for the ASR (speech recognition) and TTS (text to speech) steps.

The dialog flow and the services are forgiving about how the pipeline is triggered. The entrypoint of a dialog could be

  • starting the hotword service
  • or triggering with a transcript
  • or triggering with a parsed intent and entities
  • or triggering an action end point
  • and/or multiple messages directly to the tts or speaker endpoints

My first multimodal example featured a browser based crossword application interacting with a voice model.

There is the potential to build low power voice clients on ESP32/ARM that rely on a central messaging/dialog server.

There is the potential to build voice flows with multiple asynchronous requirements. For example a custom dialog manager requiring visual and voice id as well as parsed intent data before proceeding.

There is the potential to build voice flows of arbitrary duration rather than relying on external scheduling because there is no premise of a single request in fulfilling a pipeline.

There is the potential to build voice flows that allow for input from multiple listening devices by collating and selecting the loudest.

A messaging based approach to exposing a service API involves the extra overhead of running a messaging server and extra complexity in the conceptual model :frowning:

It’s horses for courses.
If you want to build something that supports standalone fullfillment of a dialog process with network distribution of hardware heavy services, Hermod is useful.
If you want to build an alexa/google skill that based on a single request/response cycle, it’s perhaps unnecessary overhead.
To build a browser application you may be best off implementing ASR directly to a commercial API and using HTTP to a RASA or JOVO endpoint. Although if you want to enable dynamic cooperation between a web UI and a voice backend, messaging starts to look good.

Being browser focussed I’m interested in offloading the server costs of NLU and other ML models back into the browser with tensorflow-js. At least as an experiment. Hard to keep up with the team of NLU ML experts at RASA.

Integrating RASA into the “Interpretation” and “Dialog and Logic” steps would bring ML based routing but does need consideration to generating multiple compatible models for Google/Alexa/RASA.

The JOVO RIDR cycle could be made more forgiving to pipeline entrypoint and achieve many of the benefits.


J: Right now, this only covers user-initiated (pull) request-response interactions. What if the system starts (push)? Could sensory data be used as a trigger?
S: I see that intent and action (and other) endpoints should be exposed so external processes can trigger the pipeline.

J: Should interpretation and dialog/logic be tied together more closely? How about dialog/logic and response? Rasa’s TED policy is one example where the interpretation step is doing some dialog and response work.
S: If you are using RASA retrieval intents, the output generation step is easier :). Could work with fallbacks to db/text systems.

J: Are there use cases where this abstraction doesn’t work at all? Do (currently still experimental) new models like GPT-3 work with this?
S: You could run an NLU parse in parallel, triggering intents but using the GPT-3 output as response text.
Could also allow for intent/action free triggering of responses.

cheers

Steve


#7

Thank you @Steve_Ryan, this is amazing feedback! Will have to do some more digging as I’m not super experienced with some of the ML terminology, but this is extremely helpful.

Some thoughts:

Ah yes, this makes sense and I should have made that clearer in the post. I also believe that RIDR shouldn’t be one synchronous process and that there’s probably some back and forth etc. involved.

I’ll also have to play more with Rasa to understand the architecture better. Would love to work on a deeper integration.

This is a great list! Have you experimented with anything beyond speech or text input? I think stuff like “point at device” is very interesting, haven’t come across at an example yet (need to spend more time searching)

Yes, this makes sense. Right now, this works with Jovo as some platforms trigger with completed NLU data, some only send over audio etc.

Yes, this is how we got started (Alexa and Google Assistant only) and it’s now a fine line to support more complex scenarios while not being too much overhead for simpler applications.

That makes sense!

I’ll take a look at Rasa retrieval intents, thanks :+1:

Interesting, thanks for all your thoughts!