AudioPlayer Overview

The Alexa Voice Service (AVS) is comprised of interfaces that correspond to foundational client-side (or product) functionality, like audio playback, volume control or text-to-speech (TTS). Typically, these interfaces have a one-to-many relationship with built-in Alexa capabilities and third-party skills developed using the Alexa Skills Kit (ASK). For example, Amazon Music, Flash Briefing, Audible, TuneIn, and audio streaming via Alexa skills all rely on the AudioPlayer Interface to manage, control, and report on streaming audio content.

AVS sends directives to your client instructing it to take action (for example, to play a stream), and expects events to be returned in a specific order as these actions are performed. It's important that you implement the AudioPlayer Interface correctly to ensure that all streaming services that leverage AudioPlayer work as designed, and that you prepare your product to pass media certification. This page provides conceptual information, definitions, and sequence diagrams to help you as you develop, integrate, test, and troubleshoot.

A Simple Example

Let's start with a simple example to illustrate the expected interaction between your client and AVS. Imagine that you're in the kitchen cooking a pasta dinner – hands full, water boiling – and rather than reach for your phone to play some music, you say, "Alexa, play some music." Here's what happens under the hood.

A Recognize event, including a binary audio attachment (captured speech) is sent to AVS. The captured audio is processed and translated by Alexa into a series of directives (and potentially corresponding audio attachments), which are then sent to your client instructing it to take action.

In this scenario, your client receives two directives. The first, a Speak directive instructs your client to playback Alexa speech. For example, "Shuffling your music". The second, a Play directive instructs your client to start playback of your music.

Before acting on the Play directive, AVS expects your client to handle the Speak directive and send a series of events to AVS. In this case, a SpeechStarted event is sent when your client starts playback of Alexa speech, and a SpeechFinished is sent when playback of Alexa speech finishes. At this point, your client begin playback of the stream included in the Play directive.

When playback begins your client sends a series of lifecycle events to AVS:

PlaybackStarted is sent when playback begins. The offsetInMilliseconds sent to AVS should match the offset provided in the Play directive.

PlaybackNearlyFinished is sent when your client is ready to buffer/download the next stream in your playback queue. Many implementations send this event shortly after PlaybackStarted to start buffering and reduce lag between playback of streams.

PlaybackStopped is sent if/when your client receives a Stop directive, and stops playback.

These events notify Alexa playback has started, request the next stream and provide progress reporting information to AVS and music service providers.

In the following sections we'll cover these events and when you must send them. The most important thing for now is that you're finishing up up that Bolognese sauce while smooth jazz fills your kitchen.

AudioPlayer Directives

Play: Instructs your client to begin playback of audio originating from the cloud. In addition to providing a URI or audio attachment, each Play directive includes information like playBehavior, offsetInMillisecondsstreamFormat, expiryTime and progressReport, which tell your client which lifecycle events must be sent to Alexa.

Stop: Instructs your client to stop playback of an audio stream. Your client may receive this directive as the result of a voice request or physical control ( seePlaybackController).

ClearQueue: Instructs your client to clear the current playback queue. The ClearQueue directive has two behaviors: CLEAR_ENQUEUED, which clears the queue and continues to play the currently playing stream; and CLEAR_ALL, which clears the entire playback queue and stops the currently playing stream (if applicable).

Your client should be designed to handle all properties provided by the API and should not break when unexpected fields/properties are encountered. For example, if you are using Jackson (JSON parser), FAIL_ON_UNKNOWN_PROPERTIES should be set to false.

Recommended Media Support

Play directives will provide audio in a variety of formats, containers and bitrates. See Recommended Media Support for codecs, containers, streaming formats, and playlists that your product should support to provide a familiar Alexa experience to your customers.

Use the playBehavior in the payload of each Play directive to adjust or maintain your client's queue.

Match the active stream's token with the expectedPreviousToken of the stream being added to the queue.
Note: If the tokens don't match, the stream should be ignored. However, if no expectedPreviousToken is returned, the stream should be added to the queue.

Dissecting a Play Directive

Let's return to our simple example. You'll remember that after asking Alexa to play music, a Play directive was returned instructing your client to start playing an audio stream (or binary audio attachment). The directive's payload supplies important information like the stream URL, when the stream URL expires, the expected playback behavior, and progress reporting requirements. In this section we're going to dissect that Play directive.

The first thing we encounter is playBehavior, which provides information about how this Play directive impacts your local playback queue. Three behaviors are supported:

REPLACE_ALL: Instructs your client to immediately begin playback of the stream included in the payload and replace any enqueued streams in your local playback queue.

ENQUEUE: Instructs your client to add the stream contained in the Play directive to the end of your current playback queue.

REPLACE_ENQUEUED: Instructs your client to replace all streams in your local playback queue. This does not impact the currently playing stream.

In the sample above, playBehavior is set as REPLACE_ALL. As such, your client must clear its local playback queue and immediately start playback of the audio stream included in the payload.

Next is the audioItem object, which includes audioItemId and stream.

audioItemId: an opaque token that identifies the audio stream.

stream: an object that provides specific information about the audio stream, including:

url: identifies the location of the audio content. If the audio content is a binary audio attachment, the value will be a unique identifier for the content formatted with the following prefix: cid:.

streamFormat: identifies the format of the audio stream.

offsetInMilliseconds: identifies the offset from which your client is expected to start playback of the audio stream.

expiryTime: a timestamp for when the stream will become invalid (date and time in ISO 8601 format).

progressReport: an object that contains information about the progress reports required by the content provider. progressReport supports progressReportIntervalInMilliseconds and progressReportDelayInMilliseconds. In this example, only both are required.

progressReportDelayInMilliseconds: the offset for when the initial progress report must be sent. This event is only sent once at the interval specified in the Play directive.

progressReportIntervalInMilliseconds: the offset for when progress reports must be periodically sent. This is event each time the offset elapses from the start of the track.

token: an opaque token that represents the current audio stream.

The payload provides your client with all the information needed to successfully handle an audio stream and add it to your local playback queue.

Make sure that you keep track of the offsetInMilliseconds, progressReportDelayInMilliseconds, and progressReportIntervalInMilliseconds. These parameters provide progress reporting information to your client, and may contain values that need to be returned to AVS.

Note: For a complete listing of directives/events and associated behaviors, please see the AudioPlayer Interface.

Progress Reporting

If progressReportDelayInMilliseconds and/or progressReportIntervalInMilliseconds are present in a Play directive's payload, it's the content providers way of telling your client that progress reporting is required for this specific stream.

When these parameters are present, your client must send the corresponding lifecycle events:

ProgressReportDelayElapsed: The ProgressReportDelayElapsed event must be sent to AVS if progressReportDelayInMilliseconds is present in the Play directive. The event must be sent once at the specified interval from the start of the stream (not from the offsetInMilliseconds). For example, if the Play directive contains progressReportDelayInMilliseconds with a value of 20000, the ProgressReportDelayElapsed event must be sent 20,000 milliseconds from the start of the track. However, if the Play directive contains an offsetInMilliseconds value of 10000 and progressReportDelayInMilliseconds value 20000, the event must be sent 10,000 milliseconds into playback. This is because the progress report is sent from the start of a stream, not the Play directive’s offset.

ProgressReportIntervalElapsed: The ProgressReportIntervalElapsed event must be sent to AVS if progressReportIntervalInMilliseconds is present in the Play directive. The event must be sent periodically at the specified interval from the start of the stream (not from the offsetInMilliseconds). For example, if the Play directive contains progressReportIntervalInMilliseconds with a value of 20000, the ProgressReportIntervalElapsed event must be sent 20,000 milliseconds from the start of the track, and every 20,000 milliseconds until the stream ends. However, if the Play directive contains an offsetInMilliseconds value of 10000 and a progressReportIntervalInMilliseconds value of 20000, the event must be sent 10,000 milliseconds from the start of playback, and every 20,000 milliseconds after that until the stream ends. This is because the interval specified is from the start of the stream, not the Play directive’s offset.

Sequence Diagrams

The following diagrams illustrate lifecycle events your client is expected to send in response to directives sent from Alexa (and subsequently actions taken by your product). In conjunction with logs produced by the AVS Device SDK these diagrams can be used to troubleshoot development and certification issues.

Scenario 1: "Alexa, play rock music from iHeartRadio."

In this scenario, a user makes a request to play rock music from iHeartRadio. The diagram below provides the appropriate sequencing of events sent to and directives expected from AVS.

PLEASE NOTE: In this example, the first stream plays until completion and the client sends a PlaybackFinished event.

Click to enlarge

Scenario 2: Stop and resume an audio stream

In this scenario the user plays a song, and approximately 45 seconds into playback the user says, "Alexa, stop." Approximately 10 seconds later, the user says, "Alexa, resume." This scenario is used to ensure that your device is sending the correct progress reports from the origination of a stream. It also highlights the use of channels, a concept used to govern how a client should prioritize audio outputs, in this case audio playback and Alexa speech.

PLEASE NOTE: In this example, the user makes a request to stop audio playback. When the user barges-in (interrupts audio playback), audio playback on the client is temporarily paused while the Dialog channel is active and in the foreground. When this occurs, your client must send PlaybackPaused. After Alexa has identified your request, StopCapture and Stop directives are sent, that instruct your client to close the microphone and to stop audio playback on the Content channel respectively. In response to the Stop directive, a PlaybackStopped event must be sent. This is different than the previous example, where a PlaybackFinished was sent when the stream played to completion.

Important: The difference between PlaybackPaused and PlaybackStopped is critical. PlaybackPaused must only be sent when audio is temporarily paused to accommodate higher priority content. In the example above, the higher priority content is the user request on the Dialog channel. PlaybackStopped is sent in response to a Stop directive.

Click to enlarge

Scenario 3-A: Use a physical control to navigate to the next stream in your playback queue

In this scenario the user plays a song, and approximately 15 seconds into playback the user presses the next button located on the device to skip to the next stream.

PLEASE NOTE: This example is for local controls, not actions taken performed on the Amazon Alexa app.

Click to enlarge

Scenario 3-B: Use voice to navigate to the next stream in your playback queue

In this scenario the user plays a song, and approximately 15 seconds into playback the user says, "Alexa, next".

Click to enlarge

Scenario 4: Music playback is interrupted by a sounding alarm

In this scenario, a user asks an AVS device to play music. During playback a previously set alarm goes off, which is then stopped by the user. It highlights the use of channels, a concept used to govern how a client should prioritize audio outputs, in this case audio playback and a sounding alarm.

The diagram below provides the appropriate sequencing of events sent to and directives expected from AVS.

Click to enlarge

Scenario 5: "Alexa, what movies are playing by me?"

In this scenario, a user makes a request for movies nearby. The diagram below provides the appropriate sequencing of events sent to and directives expected from AVS.