For the one line readers,
Is there any documentation or sample code detailing how the RIDR pipeline feature of JOVO3 works?
Just read Jan’s article on context-first com, alexa-please-send-this-to-my-screen.
I’ve been noticing Google struggle with voice ecosystem collaboration. With my old phone,when I ask a question, Google Home answers in the next room. New phone fixed.
They use speaker identification to solve a bunch of problems. I think they have stuff in place to solve which device is closest/responder if many hear a request. Alexa answers everywhere. Home assistant uses gps logging of a phone to place users in space.
Chromecast is a great multi modal device. Miracast proved that screens need to support streaming media without sucking power on a portable device. Websockets can be used to synchronise a chromecast with a phone web application for UI control via shared server that also hosting an Alexa app for voice control.
I used this approach in a very early DialogFlow application - Meeka Music which could ask Google Home to stream music to my phone (if it was logged into a web page. Don’t think it was possible to stream directly to the Google Home at that stage). Just Dance is possibly the premier commercial example of multiple devices collaborating around a single chromecast.
I haven’t built anything with JOVO as I only discovered it after building a couple of apps and then found Snips which captivated me with the possibility of a standalone voice recognition system.
By the time I found Mycroft I was well down the pathway of building my own voice dialog manager modeled on Snips using a central MQTT server to coordinate flow between a suite of machine learning services which has ended up reasonably polished as of last week as Hermod https://github.com/syntithenai/hermod.
My focus is integrating voice using RASA nlu and routing into websites. With authenticated MQTT, web clients are first class citizens.
A demo uses voice to fill crosswords and ask questions of wikipedia, youtube, unsplash. https://edison.syntithenai.com.
I’m interested in integrating JOVO to allow a RASA actions backend to serve Alexa, Google, Web applications or standalone devices.
Obviously format conversion of training data in build scripts RASA -> JOVO -> * is required.
As a minimum, I imagine a JOVO server process that injects an intent message into the hermod message flow and waits for a tts / say or speaker / play message before responding.
There is a picture showing the flow of a dialog through the hermod services at the top of the github README.
I see that in the release notes that JOVO3 is promoting an RIDR Pipeline that flexibly integrates various machine learning components.
In contrast to your amazing existing documentation, the documentation around the pipeline is insufficient for me to get any real idea of what your approach will be.
I am very keen to see how you are thinking about integrating the various machine learning components used in voice applications.
In both locally hosted and online services there exists clear division into ASR, NLU, TTS components.
I see you have noted SLU integrations where those divisions don’t necessarily occur.
RASA introduces a bender in providing machine learning based routing which perhaps deserves consideration when thinking of pipelines.
Any food for thought by way of pipeline documentation or feedback much appreciated.