RIDR pipeline: Integrating Jovo with Hermod


#1

Hey All,
For the one line readers,

Is there any documentation or sample code detailing how the RIDR pipeline feature of JOVO3 works?

Just read Jan’s article on context-first com, alexa-please-send-this-to-my-screen.

I’ve been noticing Google struggle with voice ecosystem collaboration. With my old phone,when I ask a question, Google Home answers in the next room. New phone fixed.
They use speaker identification to solve a bunch of problems. I think they have stuff in place to solve which device is closest/responder if many hear a request. Alexa answers everywhere. Home assistant uses gps logging of a phone to place users in space.
It’s complicated.

Chromecast is a great multi modal device. Miracast proved that screens need to support streaming media without sucking power on a portable device. Websockets can be used to synchronise a chromecast with a phone web application for UI control via shared server that also hosting an Alexa app for voice control.
I used this approach in a very early DialogFlow application - Meeka Music which could ask Google Home to stream music to my phone (if it was logged into a web page. Don’t think it was possible to stream directly to the Google Home at that stage). Just Dance is possibly the premier commercial example of multiple devices collaborating around a single chromecast.

I haven’t built anything with JOVO as I only discovered it after building a couple of apps and then found Snips which captivated me with the possibility of a standalone voice recognition system.
By the time I found Mycroft I was well down the pathway of building my own voice dialog manager modeled on Snips using a central MQTT server to coordinate flow between a suite of machine learning services which has ended up reasonably polished as of last week as Hermod https://github.com/syntithenai/hermod.
My focus is integrating voice using RASA nlu and routing into websites. With authenticated MQTT, web clients are first class citizens.

A demo uses voice to fill crosswords and ask questions of wikipedia, youtube, unsplash. https://edison.syntithenai.com.

I’m interested in integrating JOVO to allow a RASA actions backend to serve Alexa, Google, Web applications or standalone devices.
Obviously format conversion of training data in build scripts RASA -> JOVO -> * is required.

As a minimum, I imagine a JOVO server process that injects an intent message into the hermod message flow and waits for a tts / say or speaker / play message before responding.
There is a picture showing the flow of a dialog through the hermod services at the top of the github README.

I see that in the release notes that JOVO3 is promoting an RIDR Pipeline that flexibly integrates various machine learning components.
In contrast to your amazing existing documentation, the documentation around the pipeline is insufficient for me to get any real idea of what your approach will be.
I am very keen to see how you are thinking about integrating the various machine learning components used in voice applications.
In both locally hosted and online services there exists clear division into ASR, NLU, TTS components.
I see you have noted SLU integrations where those divisions don’t necessarily occur.
RASA introduces a bender in providing machine learning based routing which perhaps deserves consideration when thinking of pipelines.

Any food for thought by way of pipeline documentation or feedback much appreciated.

cheers
Steve


#2

Hey @Steve_Ryan, welcome to the Jovo Community :wave:

Sounds exciting! We’ve discovered Hermod quite a while ago and are impressed with your work (apparently I was the 9th person starring the repo, haha). Thank you for reaching out.

Not too much, unfortunately. The RIDR pipeline is a simplification of the Jovo middleware architecture, which can be found here: https://www.jovo.tech/docs/architecture (this really needs to get updated and described in more detail, too!)

Each Jovo plugin can hook into one or multiple of our middlewares. For example, if you want to take a closer look, it might be helpful to look into some of the integrations and see which middlewares they use:

That’s super cool! We’re also working on a few examples that use our web client.

Would love to brainstorm a little more about this. Could you elaborate a little more how the process would ideally look like for you? where would Hermod/Jovo hand off information to each other? Would Jovo integrate with Rasa and handoff information to Rasa, or the other way?

Thanks,
Jan


#3

Hey Jan,
Right. I was under the impression that JOVO (in line with Alexa and Google) started at NLU and ended with text plus whatever media metadata the target platform supports.
Having poked around I now understand how the middleware flow allows for the possibility of audio data to initiate a request and possibly use TTS on the way out.

Few thoughts,

  • ASR streaming makes a huge difference to the user experience as responsiveness. The current JOVO API doesn’t allow for the possibility.
    [node-gyp gave me no end of grief trying to get the ASR streaming libraries working in node, especially on ARM. It was the final straw that pushed me to migrate to python ;)]
  • it would be nice to have an integration hooked in at the router/handler stage to use RASA core routing.
  • using EventBrokers to communicate with RASA is more immediate and potentially saves memory and is more efficient because you don’t need to run their HTTP server.

In terms of Hermod, I’m thinking that a JOVO service can send and receive MQTT messages so that an Alexa skill can be backed by the same action code as a website or Hermod based standalone device using RASA stories.

A request hits a JOVO endpoint, the integration sends an intent message and waits until it hears a dialog/end or dialog/continue message from Hermod before sending a HTTP response.

The intent message has a body including parsed NLU.

Hermod sends multiple tts/say messages which can include metadata describing buttons and images. These would need to be collated by the JOVO endpoint integration into a format suitable for a single response from the platform.output middleware phase.

To the extent that it is possible on any given platform, the output should restart the microphone for further input if a dialog/continue message is heard.

I’ll chew on that and see what I can do.

cheers

Steve


Alexa Conversations - More Amazon Lock-in?
#4

Completely agree, this is something we need to work on a little more. Our web client supports the Web Speech API which offers streaming capabilities from speech to text. Definitely lots of room for improvement!

Yes! Have been thinking about this but haven’t gotten around to it yet

Interesting. We’ll take a look! cc @AlexSwe

Btw: Just slightly edited the thread title and moved to “Feedback & Feature Requests” :slight_smile:

Thanks for all your ideas! Excited to explore further