Alexa Conversations - More Amazon Lock-in?



After Alexa Live and all the hype about Alexa Conversations, I’m wondering what people in the Jovo community who’ve been working on cross-platform skills think about it. While I like the idea of having the technology facilitate a more natural conversational style with my skill, I’m concerned that implementing Alexa Conversations might be locking me into parts of the Amazon ecosystem that will make my voice app harder to deploy across platforms.

To use Alexa Conversations, you need to train a model that seems to reside in a black box. If you are reliant on the capabilities of this module (which basically resides outside of your control), will that make it more difficult to deploy the same feature set on other platforms?

Just wondering if anyone else has thought about this.


I 100% agree. This will increase platform lock-in. Debugging and troubleshooting language models is always really limited to the tools the vendor provides (which are quite poor with Google and Alexa at the moment). Analytics (about accuracy and other intents considered as well) should be included in every json response to have them at hand for long term improvements and real life data.

So far i haven’t seen a Alexa Conversations demo that i could not have built with the existing capabilities. It would have been a real effort in terms of voice design (thinking a lot in repair cases) but thats our job as voice developers i guess. The happy path is only 20% of the work done.

I will get a deep dive into Alexa Conversations to see if they hold up against expectations and let you know!

I hope its not another feature like the Dialog Directive wich no one uses any more (correct me if i am wrong) due to so many shortcomings.


Thanks for your views, Dominik. I’m interested in exploring the use of Rasa for a machine-learning based approach to dialogue management. But I actually listened to a podcast with Alan Nichol today (the Voice Tech Podcast from Carl Robinson), and Alan said he wasn’t sure if Amazon was ok with certifying skills that use a 3rd party NLU. Do you have any knowledge of that?

I suppose we could pursue Rasa and then take that learning to retrofit it into the Alexa platform if we had to (via Alexa Conversations or sophisticated use of their Dialogue Management capability). I just think that for what we want to do going forward with our smart storytelling AI platform, it would be great to have total control of the NLU layer.


Thank you for kicking off this interesting discussion @Talks2Bots!

I agree with you and @dominik-meissner. It makes complete sense for Amazon to gain more control over the UX. This comes with a lot of potential drawbacks for third party developers.

I think this depends. I know some Alexa Skills that make extended use of a “catch all” custom slot type and forward this to their own NLU. However, Amazon doesn’t seem to like it and recommend to partners to stick to the Alexa NLU. Also, the Alexa transcriptions are not the greatest, making it a bit difficult to plug into tools like Rasa (that afaik rely on raw text right now).

We are exploring a Rasa Core (dialog management) integration with the Jovo Framework. I think it makes a lot of sense as more and more voice/chat experiences will need to move away from a pure rules-based approach


From my perspective, Alexa Conversations is not a solution for every intent. I think about it more like Dialog Management ++. As I try to focus on cross-platform implementations, I would have to do more work to include Alexa Conversations and then still handle it the current way for Google and Bixby.


We want to add a slot filling abstraction into Jovo. Maybe we’ll find a way that this can be translated into the Alexa Conversations format while still working with states etc. for other platforms


So here’s a specific challenge we’re looking at.

How do we best handle this situation: someone is listening to one of our virtual storytellers tell a story, the storyteller asks them a question related to the story and expects a certain narrow range of responses, but the listener says something completely unrelated to the story question (what time is it? what can I buy? etc.). The only real solution right now is to wire in the expectation for these other “off topic” intents. But with something like Rasa (or Alexa Conversations?) could we potentially train the system to adapt to these off topic requests by creating a list of sample dialogues that contain a bunch of these random responses, along with how to handle them?

What do you guys think? Is there a simpler way to elegantly deal with these situations?


I personally think there is no benefit in plugging in Rasa (or any other NLU) as NLU for an Alexa Skill. The “Catch-All”-intent approach mentioned here is a hacky solution. You get pre-filtered, pre-interpreted utterances. For certain narrow use cases this might work well enough - i doubt you could not achieve this level with the built in Alexa NLU as well. We have dealt with interactive stories as well (which in my opinion carry the most complexity because of a need for a lot of out of domain intents) and have extensive knowledge in building robust models with a couple of hundred intents. There is a point where every model starts to smell and turns upside down. You need solid tooling and a lot of experience to make sure you are improving the model and not making everything worse (eg. a hundred sample utterances more per intent never improved the results).


I feel like I’ve seen this kind of “magic logic generator” kind of technology pop up from time to time (going from desktop app development in the 90’s to web development after that). In almost all the cases I can think of, the automated code development was useful to casual developers but not to professional ones. At least part of the problem is verifiability - how do we confirm that it solved the problem well enough if we haven’t identified the edge cases and tested them. And, if we identify the edge cases, we can probably think and implement the edge cases better than the “magic logic generator”. With that said, maybe there’s a place for it as a supplemental handler to catch whatever cases might be left over after you’ve explicitly handled the known edge cases?


I wonder if something similar to RASA core/Alexa Conversations will be offered as a standalone AWS service. They offer TTS, ASR, NLU and other machine learning tools standalone.

From a quick scan of the Conversations doco it looks like RASA core offers more features (eg Forms/Slot filling). No doubt Amazon will improve. It will be interesting to see how the tools differ. Does anyone know if Google has anything similar?

I have a strong bias to open source tools and managing my own data. More work though :wink:

I shared some thoughts about integrating RASA with jovo here in the context of my voice stack hermod.
Using the JOVO middleware hooks “router” through “response” RASA core could be used as the handler instead of the explicit JOVO context routing.

As I understand it, Alexa devices do not deliver a transcript to a skill endpoint, only a parsed NLU structure so it’s not possible to inject a custom NLU. This means we are stuck with converting training data for multiple NLU engines and trying to deal with any differences.

Whether machine learning based routing is worth the effort really depends on what you are trying to accomplish. Explicit routing works great for simple flattish cases. Once pathways start getting messy (like talking to real people), ML models can work wonders given enough data.


I think Alexa Conversations is a little immature and is really focused on a tightly constrained conversation and it handles slot filling and you have be be imaginative with how you use. I find “catch all” can work for some things using states in Jovo because you control the specific context. One thing with Conversations, it’s really time-consuming to run training every time you update something.