Software apps and online services

The host must support HTTPS with an Amazon-trusted SSL certificate. We used GitHub Pages.

UNIX-like development environment

We used Mac OS X 10.10.5 and installed the software we needed using Homebrew.

audio processing software

We used FFmpeg.

Story

The Pianist is a skill to assist musicians with their everyday tasks. It can give you a pitch when need to tune your instrument. For singers, it can provide a vocal warmup that goes as low as C3 (130.81 Hz) and as high as G6 (1568.0 Hz).

Lean Approach

We took a lean approach to creating this skill -- using the build, measure, learn, repeat cycle. We delivered the skill in three iterations. The first release of the skill could only play an A and do a simple ascending warmup. An A is sufficient to tune most instruments, and the simple warmup was comprehensive enough for many singers.

To measure the results of each release, we got a quantitative measurement of usage, using AWS CloudWatch metrics for the λ. We also got qualitative feedback on the skill through the reviews in the Alexa app.

After the initial release, we observed that people were willing to use the skill and that it was worthy of further development. Subsequent iterations introduced the ability to play the other 11 notes in a 12-tone chromatic scale and the ability to continue warming up by going higher, lower, back up, back down, and repeating.

Record Sound Clip

The first step is to record a sound clip using the piano and sound recorder. The total playback time of the audio, plus any synthesized speech, cannot exceed 90 seconds. It was for this reason that the warmup had to be split into multiple chunks. This technical limitation ended up making the skill more versatile -- allowing singers to customize their warmup routine.

Upload to web host

The audio file needs to be hosted on a publicly-accessible host via HTTPS using an Amazon-trusted SSL certificate. Again, more details are available in the documentation.

Challenges

We encountered several unexpected challenges while developing the skill.

Pronunciation

First, while handling the boundary cases for the warmups, we noticed that the Echo had trouble pronouncing foreign names like Popoli di Tessaglia and Die Entfürung aus dem Serail. Fortunately, the Alexa Skills Kit supports a sub set of the International Phonetic Alphabet (IPA). Although it does not support the full set of phones necessary to correctly pronounce the original Italian and German, we managed to synthesize an acceptable American pronunciation of the names using:

po.po.li di tɛ.ˈsɑljə

ɛnt.ˈfuɹʊŋ aus dem sɛˈɹaɪ

Unexpected Slot Values

Finally, but observing invocation errors in the CloudWatch metrics and digging into the CloudWatch logs, we were surprised to find that sometimes the slot values provided to the λ do not exactly match any of the custom slot values we defined. For example, we have a custom slot type that has 21 possible values that represent the 12 notes along with their alternative names (e.g. we have both "D Sharp" and "E Flat"). However, sometimes the slot value included punctuation (e.g. "f."). Sometimes it had invalid note names that did not correspond to any of the defined slot values (e.g. "scale"). Finally, sometimes it had extraneous articles (e.g. "a c" and "a c sharp"). To address these issues, we added unit tests to our suite that replicated these problems and then modified the code accordingly.

Cold Starts

One thing we noticed while testing the skill was that sometimes the Echo would take a long time to respond. Subsequent invocations would be much faster. This was corroborated by the CloudWatch metrics. Although most of the λ invocations completed in under 1s, there were many outliers that took on the order of 5.5s and one that took up to 7s. This is an unacceptable user experience.

Since we had decided to build the λ in Java, we thought that the culprit might be a combination of copying the archive, decompressing it, and loading classes into memory. To address class loading, we removed unnecessary uses of library classes such as the string utilities and validators from Apache commons-lang. To improve the time to handle the archive, we used ProGuard to shrink the size of the jar by removing unnecessary classes. It took many iterations to get the right classes excluded without breaking essential functionality. In the end, this is the configuration we used:

We managed to shrink the archive from 3.1mb to 1.8mb... This had absolutely no effect on the cold start problem. When testing the λ in the AWS console after uploading it, we still observed 5+ second invocations.

Finally, we increased the amount of memory available to the λ, although the maximum memory it ever uses is 35mb. We raised the available memory from 128mb to 512mb. By increasing the memory, we increase the share of the underlying hardware's compute resources allocated to the λ.

This had a tangible effect. In the AWS Console, the initial test invocation of the λ dropped to 2.5 seconds. In addition, the impact was obvious from the CloudWatch metrics as pictured below. Prior to increasing the RAM, there were outlier invocation times from 5.5 seconds to over 7 seconds. Similarly, the 6 hour moving average duration peaked at 2.6 seconds. After increasing the RAM, the outlier invocation times dropped to at most 1.7 seconds. The 6 hour moving average duration dropped to a peak of 674 milliseconds. This makes for a much more seamless experience for the musician.