Alexa Conversations - Game changer or niche feature?

Florian · 2019-06-06 11:27:48 UTC

You certainly didn’t miss yesterday’s announcement of Alexa Conversations at re:MARS 2019, but I think this is a great place to talk about it in some more detail.
I’ll start off by reviewing some of what we know up to this point, and then I am looking forward to hearing your thoughts on it - Speculations, thoughts on possible use cases and criticism are explicitly welcome!

So, here’s how Rohit Prasad, VP and Head Scientist of Alexa AI, presented Alexa Conversations:

The key to making Alexa useful for our customers is to make it more natural to discover and use her functionality.
To provide more utility we envision a world where customers will converse naturally with Alexa: seamlessly transitioning between topics, asking questions, making choices, and speaking the same way that you would with a friend, or family member.
Today, I am excited to announce the private preview of Alexa Conversations, a deep learning-based approach for creating natural voice experiences on Alexa with less effort, fewer lines of code, and less training data than ever before. Hand coding of the dialog flow is replaced by a recurrent neural network that automatically models the dialog flow from developer provided input. With Alexa Conversations, it’s easier for developers to construct dialogue flows for their skills.

Source: Transcription of the presentation, 9:48am - 9:50 am

So, that seems to be Alexa Conversations in the narrow sense: A tool to build Skills by training an AI on conversations, instead of manually coding the flow of the conversation. Before we move on, let’s look at this in some more detail.
The Alexa Developer Blog post gets into some more depth:

Alexa Conversations combines an AI-driven dialog manager with an advanced dialog simulation engine that automatically generates synthetic training data. You provide API(s), annotated sample dialogs that include the prompts that you want Alexa to say to the customer, and the actions you expect the customer to take. Alexa Conversations uses this information to generate dialog flows and variations, learning the large number of paths that the dialogs could take.

So, it sounds to me like this brings an architectural change in how voice apps are built: So far Alexa Skills consist of the two parts NLU (to which the developer contributes the language model) and conversation + fulfillment logic (provided entirely by the developer). Apparently Alexa Conversations separate the conversation (structure) logic from the fulfillment (content) logic, with only the latter being provided entirely by the developer, and the conversation logic being trained in the similar way to the language model? Dialog Management, which has been around for a while and which is propagated by Alexa developers as best practice, is already one step into that direction.

The blog article continues:

In the past, developers scripted every potential turn, built an interaction model, managed dialog rules, wrote back-end business logic, and analyzed logs to test and iterate. […]
Now, you provide dialog samples and Alexa Conversations predictively models the dialog path using a deep, recurrent neural network. At runtime this neural network takes the entire session’s dialog history into account and predicts the optimal next action or step in the dialog […].

So far it kind of sounds similar to Dialog Management, but this is where it gets promising:

It [the neural network] is trained to interpret dialog context in order to handle multiple user workflows, accommodate natural user input (like out-of-sequence information or corrections), address common business transaction errors, and proactively recommend additional API functionality.

So, the promise is that it can handle user input far beyond the happy path, even up to error cases.
And all of that while enabling the developer to focus on the fulfillment:

For example, the Atom Tickets skill used 5,500 lines of code and nearly 800 training examples.
The Atom Tickets skill built with Alexa Conversations shrank almost 70%, to just 1,700 lines or code, and needed only 13 customer dialog samples.

So, it sounds like something big upcoming, right? If you want to participate in Amazon’s developer preview, you can apply here.

Of course this is only Alexa Conversations in the narrow sense, and I will soon add another post with the bigger picture. But I’m already curious to hear your thoughts on this… Are you optimistic about the capabilities of conversational AI to model you Skill’s conversational structure? Do you think it will make sense for some Skills (maybe what Vasili from Invocable described as integrations: "the brain of this voice app is somewhere else — in a restaurant’s CRM, Uber’s backend, or NYT’s content management system. "), bot not for others (such as voice games and interactive stories)?
Let me know what you think!

Florian · 2019-06-06 10:59:05 UTC

Me again!

So, Alexa Conversations as described above is already a pretty big announcement, in so far as it has the potential to drastically change the way voice developers build Skills. This post is about Alexa Conversations in the broader sense (if you have a good suggestion of how to call it, please let me know!), which has the potential to change the way customers use Alexa.
If you have thoughts on which of them is the bigger or more exciting issue, I’m curious to know!

Again, let’s start off by examining how Rohit Prasad presented this feature at re:MARS (9:51am):

[…]customers don’t limit their conversations to a single topic or a single skill. Imagine a customer that begins a conversation by asking Alexa about movie showtimes and tickets, but her true intent is to plan a night out with her family. With today’s AI assistants, she would reach her goal by organizing the tasks across independent, discrete skills. In such a setting all the cognitive burden is with the customer.
Now, we have advanced our machine learning capabilities such that Alexa can predict customer’s true goal from the direction of the dialogue, and proactively enable the conversation flow across skills.

This is where is becomes apparent how this is linked with Alexa Conversations in the narrow sense: Because Alexa Conversation Skills ‘only’ provide the fulfillment logic (via API endpoints), Alexa is not restricted to using the Skill’s conversation logic and is free to model a continuous conversation across Skills in order to get a more complex job done that involves invoking multiple Skills.

Rohit Prasad describes it like this (9:56 am, highlights mine):

[Alexa] stitches together the capabilities of multiple providers at once – in this case movie recommendation and ticket booking; restaurant recommendation and reservation; and a rideshare.
The approach is different from other dialogue systems in that it models the entire system end-to-end: the system takes spoken text as its input, and delivers actions as its output.
Each skill that is built through Alexa Conversations capability has a deep recurrent neural network to predict different dialog actions within a skill. We now have another recurrent neural network that acts as a cross-skill predictor. At any given turn, this cross-skill action predictor determines whether it should hand off the dialogue control to another skill or keep the control with the current skill. The cross-skill action predictor is also trained on simulated dialogues just like within-skill dialogue action predictor.

Here’s the famous demo video where you can see the system in action by VentureBeat:

Pretty impressive, right? I especially liked how the utterance “Show me the trailer!” at the end was resolved as expected, despite being so far removed (temporally) from the action of booking the movie tickets.

This highlights another feature upon which the Alexa Developer Blog article goes into some more detail: The Alexa Conversation dialog model can be both reactive, as in the case of the user’s questions “How long is it?” or “Show me the trailer!”, and proactive as in the case of Alexa asking “Will you be eating out…?” or “Would you like a cab to Mott 32?”:

Image source: Alexa Developer Blog

Especially the proactive actions are a major feat, since it requires a model of what the context of the user’s dialog actions so far was:

Alexa will predict a customer’s latent goal from the direction of the dialog and proactively enable the conversation flow across topics and skills.

An interesting question is how much of such a richly modelled conversation will be persisted beyond the ongoing conversation - For example, if the session in the demo video was finished and the user would ask “Alexa, at what time does the movie start?” (or the request to show the trailer), could it still be resolved? This has the potential to bring the usefulness of Alexa to an entirely new level.

So, looking forward to hear your thoughts on this. Is Alexa Conversations in the broad sense a big deal? In which of your everyday situations would you already wish for Alexa to have such capabilities? And strategically, how do you think the impact of this feature is compared to Google Duplex?
Looking forward to hear your thoughts!

Marko_Arezina · 2019-06-06 22:12:40 UTC

Great post @Florian!

I think Alexa conversations will be a good step towards making voice assistants more practical for complex interactions. If the ability to stitch multiple skills works well, I could imagine a future where you have something that resembles a web of skills. You might start on one skill, but ask a question which links you to another skill an so on.

At first this sounds like it would help the discoverability issue with voice apps, but there are a few questions that come to mind.

If Alexa’s model is generating a continuous conversation across skills what factors would it use to decide which skills get invoked to fulfill the next step in the conversation? For the example of restaurant reservations, Alexa decided to use Open Table for the reservation without asking the user. Would this just be the default provider for restaurant reservations, leaving other skills that are capable of restaurant reservations with little to no exposure?

Would the providers be selected based on sponsorship deals made with Amazon? This might become a new revenue source as companies bid to be selected as the providers for certain tasks like hotel bookings, plane ticket and so on.

Florian · 2019-06-07 12:51:49 UTC

Hi @Marko_Arezina, thanks for sharing your thoughts on this exciting new technology!

I absolutely agree! What I find particularly intriguing is the term latent goal that they use in the Alexa AI article. In the demo the latent goal of planing a movie night with dinner is super obvious, but I can think of a number of latent goals that Alexa might detect and work towards, especially with proactive actions:

You ask Alexa which sound a dog makes, but your actual goal (instead of refreshing your memory on animal vocalizations ) is to entertain a toddler, and Alexa could work towards it by triggering the Baby Shark song or the Pikachu talk Skill
You ask how the weather is in a different city, but your actual goal is to prepare a short-term visit in that city on the same day. Alexa could assist towards that goal by asking you whether you go by car, train or bike (maybe even outside a Skill?) and then provide timetables, info on gas prizes or traffic congestions, or tell you about events in that city
You ask Alexa about a Chuck Norris joke, but your actual goal is to show off Alexa to your friends and have a good time. Alexa could help by triggering a surprising, casual and fun activity.

Concerning the question of how Alexa would choose the right Skill for the job… Great question! I can think of two approaches:

With the manual curation approach, folks at Amazon curate such across-skill actions, like “Do you need a ride to the event that you just booked?”, and select managed partners for the fulfillement
With a machine learning approach, Alexa could spot common sequences of Skills used among all users, and use this for predictions about which Skill would be a good candidate for a proactive or reactive action at each step. Thinking ahead, Amazon could vary which follow-up action to suggest, in order to prevent locking into suggesting the most popular one.

Regarding discoverability, I agree that this will be super big! Like with canFullfil, it makes it possible for the user to invoke a Skill without knowing it exists. One new challenge then is to brand the experience and building retention, so that our Skills don’t end up like in the video where the Skill that made the chinese restaurant reservation remained anonymous.

jan · 2019-06-07 13:05:11 UTC

Great discussion! I agree that this feature is going to be a big step towards better conversational experiences.

One thing I noticed while I watched the demo video from the quote above though: This interaction feels so natural because many of Alexa’s responses don’t have real prompts:

Alexa: Here are some Chinese restaurants in …
User: Tell me more about Mott 32.
Alexa: Here is more information about Mott 32.
User: That looks good. Get me a table at 6pm.
Alexa: [repeating the details]. Should I book it?

Only the explicit confirmation has the Should I book it? prompt, all the others just display information and then let the user decide where to go from there.

Do you think we will see more experiences like this, especially with Alexa Conversations, where the dialog is more driven by the user? I can’ think of an Alexa Skill that currently works like this (they wouldn’t pass certification).

Florian · 2019-06-07 13:22:33 UTC

Aha, great observation @jan! Alexa Conversations both in the broad and in the narrow sense not only have the potential to provide more value to the user, they also enable the more dynamic way of interaction you pointed out.
Right now, the voice developers are taught to build their voice apps to lead the conversation, which in turn trains users to be reactive when using voice apps and to limit what they say to a Skill. Leaving the conversation structure to Alexa and providing actions and APIs really has the potential to enable users to interact more naturally. One neat example for that is “That looks good. Get me a table!”, which contains two intents, a confirmation and a request for a reservation.

Andre · 2019-06-14 11:01:11 UTC

Really interesting! An important question to me seems how Alexa is evaluating the performance of a skill inside this chain. Because that is what enables machine learning algorithms to decide for one or another direction, right (or which connections should be stronger/weaker)? At the moment the easiest way (like @Florian said) is to look which skills are used in a sequence. But this behavior can be faked - specially by bots. I am curious which way Amazon will take to get good data and if they are able to push this feature fastly.

David_Foster · 2019-06-16 23:16:10 UTC

When I see an attempted advance in one platform ( Alexa) and we’re using a tool like Jovo engineered to build cross-platform bots; to what extent can a developer take advantage of these features before they have created an app that is not cross-platform.

jan · 2019-06-17 08:57:04 UTC

Hi @David_Foster, great question! Right now, Jovo supports all platform-specific features by Alexa and Google Assistant. Many of the developers on our platform use Jovo because of the ease of use and integrations and see cross-platform more as an added benefit (see the case study with Gal Shenar, for example).

As we’re not seeing Jovo as a “common denominator” tool, we will work hard to offer a seamless integrations into the upcoming features that platforms like Alexa are launching. After all, the official Alexa SDK needs to add support as well.