[Docs] Alexa AudioPlayer Skills


Learn more about how to build Alexa AudioPlayer Skills with the Jovo Framework.

AudioPlayer Skills can be used to stream long-form audio files like music or podcasts. The audio file must be hosted at an Internet-accessible HTTPS endpoint. The supported formats for the audio file include AAC/MP4, MP3, and HLS. Bitrates: 16kbps to 384 kbps. More information can be found here at the official reference by Amazon.

This is a companion discussion topic for the original entry at https://www.jovo.tech/docs/amazon-alexa/audioplayer


Hi, can I add lyrics to the music? Like the one is see when playing songs with amazon unlimited. I know I can display text, but how do I sync a text with the sound?


Hi @frienerVogle,

as far as I know this is not yet possible. You could achieve this with APL and the delay property for APL commands by showing the respective subtitle in the right time slot, however I would not recommend this, as it can be very complicated and tedious to time it right.


Yeah, I think the Skill that we tested here uses an embedded video:


Hi @jan is it possible to initiate AUDIOPLAYER say with a Launch intent, but then still be able to use other intents to say for example change the track, or sign up for updates? From what I discovered so far, as soon as the AUDIPLAYER starts playing audio file, the only intents that Alexa responds to are those built-in intents, and not those coming from the app. Or maybe I’m doing something wrong?


Hi @MarekMis. I ran into the same problem a while back. The only way to work around this issue I see at the moment is indirect/name-free invocation of a skill. Currently that is on available in the en-US locale though.

Here you can find more on the topic: https://developer.amazon.com/docs/custom-skills/understanding-how-users-invoke-custom-skills.html#indirectly-invoke-a-skill-with-name-free-interaction

Please correct me if I am wrong.


thanks @jonathanmuth So this is actually Alexa/Google Assistant platform limitation. That’s fine. A workaround is to use SSML responses and break down the audio file to max. 240s long snippets, with added interactivity. Not ideal solution, but a workable one :slight_smile: