Group
Report: Overcoming Roadblocks in the Quest for Interactive Audio

Participants:

Charles Robinson, Dolby Labs

Jocelyn Daoust, Ubisoft

Stephen Kay, Karma Lab

Karen Collins, University of Waterloo

Nicholas Duveau, Ubisoft

Barry Threw

Guy Whitmore, Microsoft

Tracy Bush , NC Soft

Jennifer Lewis, Relic

Kurt Larson, Slipgate Ironworks

Simon Ashby, AudioKinetic

Scott Snyder, The Edge of Reality

Tom White, MIDI Manufacturers Association

Facilitator: Aaron Higgins, Facilitator

Currently, game audio often does not provide a highly variable, emotionally charged experience that is context-driven and sufficiently integrated into the game...

There was a time when the idea of richly interactive audio was just a flicker in the minds of the gaming community. Visionary composers, sound designers, game designers and players dreamt of a time when the score of an interactive media work could be intimately intertwined with the user’s experience in an endearing and engaging way. Through experiment, innovation and sheer creative will these pioneers lit a path through the dark void of repetition and stasis; they overcame restrictive hardware, non-existent standards, poor funding and a lack of compositional tools to carry the torch forth into this new frontier. We have built frameworks and tools, we have composed breathtaking original works, we have agreed upon standards – indeed, we have come far toward the realization of our dream. It is a good time to look back and feel amazed at the long road we have traveled, but also a good time to look around to ascertain where we are.

When we survey the landscape of game audio today what we find is that audio for interactive entertainment and games isn’t taking full advantage of the creative potential offered by the available platforms and hardware. We find many game scores that are only loosely tied to the actions and situations on screen rather than tightly integrated with game-play. We find repetition, we find stasis. With play lengths approaching epoch time spans, and plot lines becoming ever more open-ended it is more important than ever to encourage dynamic interactivity in our game audio. We feel that supporting our games with highly adaptive scores would increase the depth of a game’s immersion, widen the range (and increase the appropriateness) of a game’s emotions, and could help drive the game-play itself. With the dream of our forerunners held firmly in mind we have identified the largest obstacles blocking our progress and illuminated some possible paths around them. It is our hope that this will reignite efforts to pursue the highly elusive interactive audio score.

The Problems

Game audio has made some large strides in the past decade and a half, but our priorities have diverged from rich interactivity. Specifically, the production values of the audio assets themselves have improved dramatically. This was made possible by CD, DVD and HDD streaming capabilities, more available memory, and faster processor speeds. Gone is the (however comforting) 8bit audio of old, while the full production capabilities used to create film scores are being brought to bear on game audio; including live orchestras, high end studios, and the best DAW hardware, software and DSP.

However, game audio has lost something important in this quest for fidelity; adaptability. When it comes to context-driven interactive audio, we’ve actually taken some steps backwards in our drive towards pristine audio assets. Many in our group pine for days past when music could be controlled at the instrument, note, or phrase level, rather than just by the three-to-five minute cue. In recent years sound designers and composers have had to compromise interactivity for the sake of the perceived quality of individual assets, both because of insufficient platform capability and limited funding. This bias toward audio resolution was actually the will of many composer/sound designers, the demand of our producers, and the wish of the people who play our games.

But now in 2008, the trade-off of tight audio integration for high asset fidelity is no longer necessary. From the late 1990s through to the early 2000s, game platforms, game PCs, and audio tools were in a state in which it was difficult to have your cake (high production values) and eat it too (high interactivity). But, PCs and game consoles have evolved to the point where the needed power is under the hood to both support the highest quality assets and the most dynamic gameplay integration. Therefore our message to the industry is ‘no more excuses’! We have noticed many factors impeding the advance of rich interactive in game audio.

Lack of will in content creators.

There are many composer/sound designers who love creating and integrating audio content for games, and will go the extra mile to make it work well in context. But even for this dedicated group, creating adaptive content is a difficult slog; from convincing management of an idea, to working within a suboptimal production pipeline. And let’s face it, there are many in our profession who don’t want to do more than throw audio over the fence to a developer. New paradigms of audio creation are just emerging and most don’t want to jump on until those paradigms are established, it becomes more efficient to produce, and there’s money to be made!

Lack of intuitive and comprehensive compositional tools.

Game audio tools have also made tremendous strides in recent years. We may be on the cusp of a golden era of game audio tools, and so much of what we need is already available in third party audio engines, as well as some proprietary solutions. Even so, there are engine features the group defined that aren’t available in these engines (to be discussed later), and many audio designers/composers don’t take full advantage of the features that are there for various reasons. The foundations for the necessary composition technologies have been developed, but a full integration of all the disparate parts has not been achieved.

Lack of interest by game producers and purse holders.

In addition to technological issues, the group also pointed out problems with convincing management of the benefits of more adaptive game scores. Many purse-holders are reluctant to invest in new concepts and technology if they don’t see the immediate monetary value, i.e. “How many more units will this sell?” Without the ability to quantify the value of adaptive audio, many companies fail to recognize the difference that emotionally effective adaptive audio can have on the impact and enjoyment of their game.

Some Solutions

To little surprise, the problems presented are deep and have many facets, any specific portion of which deserves a dedicated working group. We discussed many areas at a high level such as tools, education, prototyping, various interactive audio techniques, game genres and their specific needs, and the anatomy of an audio system.

However, our most promising direction lies in two parts, tool building and education. First, we define and refine the various components of an interactive audio system to reveal the underpinnings and potential of such a system. Second, we take steps to foster a forum and community of audio professionals in which case studies can be shared, and conversation and discussion encouraged. These things we hope will accelerate the evolution and adoption of advanced interactive audio techniques and greater range of expression within this fledgling medium.

As part of the initial discussions, Guy Whitmore diagrammed the potential components and data-flow of a runtime interactive audio system (Figure A). In the vast majority of today’s systems, the only components are the Audio AI fed by calls from the game, Wave banks, and the mixing routing section. In the 80s and 90s (and currently on handheld devices, phones, and the Wii) systems commonly used MIDI data with wave-table synthesis. There’s a basic principle that the more granular your content, the higher the potential for adaptability. For example, music broken into 4-measure phrases (as waves) is more granular than a continuous 5-minute cue (wave), and music in the MIDI format is granular to the note and instrument level, and therefore has much more potential for adaptability.

The first question addressing is what features our ideal system would contain. Our workgroup spent a great deal of time discussing what functionality we would like to see in an interactive audio system, and delineating this features greatly informed our direction of progress. However, the question of a minimal set of requirements in terms of allocation and resources still remained. A post-workgroup survey of composers by Kurt Larson shed some light on what the least amount of resources required would be (Appendix C).

We decided to focus our limited time on what we dubbed the Audio AI portion of the system, as this is sort of the ‘conductor’ and coordinator for the other components. In the simplest terms, the Audio AI receives data calls from the game engine and decides what to do with that data. The most common and basic of these is a cue call (e.g. ‘wood door opens’), and the Audio AI determines the proper cue to play, which then triggers an associated wave file. Although the scope of what the Audio AI engine includes is a bit amorphous, we define it primarily as an information, as opposed to audio, processing entity.

The Audio AI Engine, like any artificial intelligence or expert system, is a complex entity that requires a broad range of functionality to work correctly. Although much of this functionality has been achieved in disparate projects, the integration of them in an intuitive and robust way has not yet been achieved. While the task of creating an endearing and musical computer system is a seemingly opaque and insurmountable problem, a breakdown of the potential components will serve the ongoing discussion and development of adaptive audio. While details of implementation are largely outside the scope of this document, the following suggestions of modular components are informed by the desired abilities of the system, and by experiments with prototype systems (Figure 2).

Routing, allocation, and scheduling

The routing system is the interface from the Game Engine to all other components to the Audio AI system, and the interface of the system components to each other. As the point of entry for data from the Game Engine, it is responsible for querying, polling, or updating information on the game state. It also reports information about the audio state to the game engine. It is responsible for the passing of messages, control data, or audio data, between other system components. To address these tasks the Routing system must have knowledge of all components and parameters in the game, which necessitates a reporting scheme to delineate the available parameters. Keeping all data routed through a central authority in this manner has the advantages of creating a situation for mass storage of all parameters.

This component also provides a hardware abstraction layer, to ideally keep compositions transportable across platforms. The allocation system would keep track of the necessary system specific resources to reproduce the audio, and assign these resources where they are most needed. This would also allow a composition to be scalable to less powerful systems, as the piece would be automatically conformed to available memory, storage, and voices.

Additionally this foundation level would contain a clocking system to keep all system events in sync. Control, event and audio rate clocks should be present to allow for consistent timing of all game state information.

Conditionals and logic

While seemingly simple, a system for creating and nesting "if-then" and "and, or, not" types of statements and making decisions based on them allows for a very complex set of interactions. A system such as this should allow arbitrary chaining and nesting of logic groups. All parameters of game and audio engine state should be available for decision making, and for action once a decision has been made.

Algorithmic processes

Algorithmic generations would create streams of audio or control information that could be used for decisions, note generation, or mixing. Processes such as random generators, fractals, attractors, Markov chains, flocking simulators, particle generators or physicals models could generate a wide range of different input data and help fend off pure repetition in compositional sequences.

Detection and information retrieval

With such a huge range of data available to the system, it is helpful to interpret sets of data to make smarter decisions based on meta data. Components such as beat tracking, phrase matching, pitch matching, and harmony and key matching can be useful compositionally to make decisions based on current and prior musical results.

Prediction

The most difficult of components, the system should have the ability to intelligently interpret data on its own based on previous compositional structure and current data. Systems such as neural nets and fuzzy logic systems can be trained over the course of a game to make decisions that keep the musical material vibrant over the course of a many hour game. If built correctly these systems could also add indeterminate but intelligent aspects to the composition without having to specify every interaction in a decision or logic tree.

UI and storage

An intuitive way to author to all of these components, save the composition, and reload it on any conforming implementation in the future is necessary to give a system such as this a long life, and to allow compositions to retain their integrity over time.
This proved to be a very challenging exercise despite the amount of audio expertise in the work group. It may be that thinking and speaking of this functionality in the abstract is difficult and awkward. It may also be that much of what we were trying to define hasn’t yet been defined! After going around and around with this we decided to take a different tack. Rather than begin with the abstracted concepts, we’d start with practical real/virtual world game-play scenarios, then create interactive audio designs based on those scenarios. And from those designs, the needed features and functionality of the audio engine would develop.

Having made some progress in terms of the compositional tools problem, we turned our focus toward the issue of composer and producer education. We resolved to use an approach from a previous Bar-B-Q workgroup, the Adaptive Audio Now initiative to set up a web presence to create a space to collect and aggregate interactive audio case-studies, post mortems, blogs, and other articles. This site will be set up to encourage dialogue and community among audio professionals and over time the number of articles will grow until the site becomes a invaluable destination for those looking to learn about interactive audio. Its home can currently be found at http://www.iasig.org/wiki/index.php?title=Adaptive_Audio_Now%21_Case_Studies. To help get the site content ball rolling we created two case studies, Appendices A and B.

The philosophy here is one of emergence. That is, the evolution of game audio will largely come from the bottom up, rather than the top down. ‘Bottom up’ in this context refers to thousands of composers and sound designers who will experiment with ways of implementing adaptive audio, with the most successful techniques surviving and being elaborated upon. ‘Top down’ would be a group (like this work group) deciding what functionality future game audio systems would use and pushing those ideas on the industry. The Audio AI web presence is meant to foster and facilitate a natural bottom up approach, with the goal of improving game audio, its tools, and the environment we work in.

Having defined our optimal tools, and created a forum for further discussion on the issues at hand, we feel we have made good first steps toward the furthering of our goals. Certainly many of the issues we have outlined need to be covered more in depth by additional workgroups and organizations. However, outlining our obstacles is the first step in toward overcoming them, and we now feel the path ahead is shown with much greater clarity.

Figure 1: Block Diagram of a Interactive Audio Studio

Figure 2: Block Diagram of Audio AI Engine Functionality

Appendix A:Case study for music and SFX interactivity in an Action Gameby Jennifer Lewis

The next step of the process is to create potential adaptive audio solutions based on the following scenario, both creative and technical.

The frame:

We’re in a video game, in an active village. One character approaches and passes another. They are cowboys. One is the player, the other a non-player character, an NPC. There is a third non-player character which may enter the scenario, the sheriff.

Also known:

There is music playing. One or more audio streams with multiple chords and key changes.

There are 4 scenarios.

#1 – the player walks past the NPC. No interaction occurs.

On approach, pass by and away, we would however like to play a theme for that character, over top of the existing music. In time, with a specific instrument assigned to the NPC’s personality (as it’s an important part of the story). However, there are other times when the theme needs to play differently for varied contexts. The theme must be variable (respond to control sources) but always fit in perfectly with whatever music happens to be playing at the moment.

#2 - the PLAYER decides to turn and Unholster their weapon. The NPC draws its weapon too.

Crows stop cawing the towns folk stop chatting. The town quiets. Overall, the microphone goes to shotgun mode and all other sounds but that of the player and NPC come into extreme focus.
Meanwhile, the music enters a “hold your breath” mode. The orchestration changes seamlessly, perhaps to a minor key. The music doesn’t resolve and stays in a variable, input awaiting state. Suddenly the Player decides not to pick a fight and holsters their weapon. The tension continues, several instruments in the music subtly increasing their complexity as we wait for the NPC to make its call.

#3 – REWIND: the PLAYER decides to turn and Unholster their weapon. The NPC draws its weapon too.

The same music effects occur as above but with variation, as the sound engine is aware that this has occurred before. Still minor and still tense.
Only sounds of the PLAYER AND NPC are heard when suddenly, a shot rings out of nowhere. It’s the sheriff! The music seamlessly enters a resolution mode and the sheriff’s theme is added on top of the resolution bed.

#4 – REWIND AGAIN: the PLAYER AND NPC have weapons drawn. The same music effects have occurred, with variation.

This time the PLAYER shoots and a fire fight begins. Tension mode is immediately transitioned into combat. The tempo jumps up naturally, different instruments and tracks come and go, perhaps the key changes. The arrangement re-arranges itself. As events happen in the fire fight, say for example each time the player fires a bullet, a trail of musically correct notes is triggered in an ascending sequence.

Appendix B:Case study for music and SFX interactivity in a Massively-Multiplayer game
by Kurt Larson

This is a description of the main forms of audio interactivity we should reasonably expect to see in a current-generation massively-multiplayer game. It provides accounts of a single player's experiences and how the audio interactivity supports those experiences. A case is made that live-rendered music will serve the needs of the game better than pre-recorded music.

Our player logs in to her favorite massively-multiplayer game. There is some musical support for the game intro and character select screen, but the in-game music will be the focus of this study. She selects her favorite avatar named "Thalya". Thalya emerges into the game world inside her own private house. She walks over to her in-home music-playing system and turns it on. Music from places in the world to which she has been begins to play in a non-interactive form. She switches to another track she likes more. After a few minutes, she leaves the house, leaving the home-music system playing. As she emerges into the game world outside her home, the in-home music fades out. A streamed track of outdoor ambience sounds plays, and also, some 3-D-positioned sound emitters add more to that track. After a few moments, the game world music begins.

The world music is constructed in real time ("Live-rendered") during gameplay. As such, the individual sounds can be randomly assembled to completely eliminate looping repetition. Short instrumental themes are heard as often as was desired by the composer, but the music is generalized, amorphous. It evokes a mood, like a Brian Eno ambient album, rather than telling a musical story, like a movie soundtrack. Since it is being live-rendered, changes to the music can be made on the fly in response to various game states. Some of the changes possible are:

For example, Thalya walks from her house on the edge of town towards the center of town to visit the markets and meet with friends. Along the way, she passes a building where a group of NPC's is playing music. This music is representative of their culture, as opposed to the world music which is presenting the world to the player. Since these two musics do not sound good together, the world music is faded out as the player approaches the NPC music. Unlike the world music, the NPC music is 3-D positioned and so appears to be emanating from the building in which the NPC's are playing. As she walked past and away from this building, the NPC music fades. At a certain point, the world music slowly returns. At first, only one sound (instrument) is heard, and over a time (10 to 100 seconds) the other sounds join in.

Our player reaches the central market square of the town. Since this is a special location, the world music fades out slowly and is replaced by a specific-location theme for the market. This music is more linear and traditional in its structure. The market music represents the culture of the inhabitants of the town. (This is especially important in a game which represents multiple cultures in different locations.) The background SFX track is replaced by a new one representing the sounds of the market: People talking, vehicles traveling by, birds, other animals, children playing, etc. On top of the streamed SFX track, 3-D-positioned sounds emanate from a handful of specific locations to provide a more immersive experience.

Thalya conducts her business in the market. She buys things from NPC merchants, meets up (physically) with a few friends doing the same, and chats with friends both near and far. After about 5 minutes have passed, the market music has concluded. Rather than repeat, the market music concludes. For the remainder of her stay in the market, no music is heard, allowing Thalya to experience the sounds of the market itself.

Finally Thalya and the three friends she met in the market decide to travel to another city to obtain a quest. They head off to the north.

As they leave the market, the world music slowly returns, as does the default SFX background for the town area. As they get to the outskirts of town, they set off over late-summer wheat fields towards a distant forest. Once they are a hundred meters outside the city, the world music and the ambient SFX change to an entirely new set of assets. The SFX track crossfades seamlessly without a gap. Shortly thereafter, new 3-D positioned sound emitters add immersion to the wheat-fields, just like in the town. The music, however, fades out before reaching the SFX transition point. There is a 50-meter gap between the end of one music set and the beginning of another. (Alternately, there could be a much larger gap, so that a longer break from the world music is experienced.) The group enters the new music zone and the new music assets begin to be rendered to them. Although equally pleasant-sounding, the wheat-fields music sounds more open and free-form, representing the spacious feel of the area, as opposed to the densely-populated town.

After a time of walking towards the distant forest, Thalya notices an easily defeat-able enemy ("mob", short for "mobile object") a few dozen meters off the path. Since she knows this particular creature often produces a valuable item as loot, she casually tosses a fireball spell at it, and it politely keels over and dies. The music and background sounds do not respond in any way. Why? Because the music of this game, unlike a movie, is not telling Thalya how to feel. It is attempting to intelligently support how we expect that she WILL feel. Although she technically engaged in "combat", we as composers and game designers know that she did not experience any serious threat or excitement. This was a casual kill done speculatively in the hopes of a minor reward. The music system does, however, make note of the fact that she has engaged in a minimal level of combat. More on that soon.

Along the way, one of Thalya's friends needs to take a phone call in real life. (IRL!) He could set his character to auto-follow Thalya, but that is somewhat unreliable, so the group decides to stop and chat for a while during his absence. Combined with the somewhat-long walking time through the wheat-fields area, our group is now spending a significant amount of time listening to the same music. Luckily for them, the audio designers and composers of this game were well-versed in the needs of MMO game audio. The music is being live-rendered on the fly, and so never repeats. It always sounds consistently of the area's look and feel, but never do the players minds experience that little warning that says "All of this has happened before, and it is happening again!" regarding the music.

Thalya's friend returns, and our group heads on towards the not-so-distant forest. They are not attacked by any of the roving potential enemies because our game system classifies these player-characters as too high-level to be attacked by enemies of the relatively low level found in the wheat fields. Since the players refrain from engaging them in combat, as they are intent on reaching their destination, the music continues in the same state, as do the ambient SFX.

As the group approaches the forest, at about 200 meters out, the music fades out to silence. Upon running under the first trees, the background sounds change to forest sounds. Again, there is a streamed track and also 3-D-positioned sounds to supplement it. After about 100 meters, the forest music begins to gradually take form. Like the town music and wheat fields music, it is live-rendered and randomized to avoid repetition. Again, individual sounds are played, rather than entire pieces of pre-recorded music.

In the forest, somewhat higher-level enemies abound. Some attack the group. Together, the group is vastly more powerful than their adversaries, and so the enemies are dispatched with minimal effort. The game's music AI system keeps track of their combat encounters. Every time the group engages an enemy, each player's combat tally is raised. The amount raised depends on the toughness of the enemy. For example, a single player engaging an enemy of the same level as the player will increase the tally by one. An enemy 5 levels above the player may raise that tally by two or more. Every few minutes or so, that tally is decreased by one. If the tally reaches a high-enough number, (that number being authored by the game's music director) the music begins to respond. Since at this point, our group's combat-tally number is fairly low, we do not expect that our players are in grave danger, but rather are simply eliminating enemies who briefly stand in their way. Therefore we only alter the music to a small degree. We increase the tempo a bit. A new part is brought in, high-pitched and strident, but quietly triumphant. Our group is merely doing what is needed to get to their destination, and the music acknowledges merely that they are taking care of business.

Towards the other side of the forest, the enemies thin out, and the group goes for several minutes without killing an enemy. Their combat-tally number drops below the threshold and the music returns to its default state. Again, about 200 meters before reaching the outer edge of the forest area, the music fades to silence.

Emerging from the forest, our group encounters another open expanse, this time of rocky hills. Again, the SFX ambience shifts from forest sounds to open-area windy sounds. Again, after traveling about 100 meters into the zone, new music appropriate to this zone begins to play. By this time, the sun is setting. As dusk approaches, the music very slowly fades to near silence. The ambient SFX begin to gradually change from wind and bird cries to crickets and the occasional howling wolf. As our group continues walking, darkness falls, and the music returns to full force, but with significant changes. The music's pitch has dropped, and a more melancholy tone has been adopted. Occasional short melodic themes, which have been prevalent since Thalya began playing, are now absent. A continuous, soothing droning undertone is present, and the tempo has become very slow.

Walking along under the moon, the travelers encounter an inn. They decide to go inside to buy certain items and to sell loot they accumulated in the forest. As they enter the inn, the world music and outdoor SFX ambience fade out quickly. Inside, NPC music is playing, positioned in 3-D. A reverb is applied to all SFX. Voices and clinking glasses replace the outdoor crickets and wolves.

At this point, Thalya realizes she is late for work and must log out. She leaves her friends to continue on to the second town.

60 composers were asked what they felt would be the minimum resources required in a live-rendered, "GigaStudio-In-A-Box" system built into a gaming machine such that they could compose music which sounded not identical to, but AS GOOD AS fully-mixed dead streams. About 20 responded. Here is a summary of their responses:

1. Minimum RAM/Storage required?

Average: 568
Range: 50 - 1000Recommendation: 256
Comments: Composers seemed to fall into two groups: Those who have had experience creating small sample set and those who have not. The latter group basically just described their current setups for making linear streamed music. They tended to suggest 1-2 gigs. The former group ranged from 50 megs to 256 megs. I believe the former group's recommendations are more relevant to this topic.

2. Minimum number of tracks required?

Average: 49
Range: 15-75Recommendation: 48
Comments: None

3. Minimum number of voices required?

Average: 156
Range: 64 - 256Recommendation: 256
Comments: With the tendency to virtualize voices, this may not be an issue. It is nonetheless good to have this specification, as we cannot predict what form our music engine would take.

Other notable responses:
- To some extent, we may be able to use instrument modeling to replace some amount of sampled sounds.
- We may run into a significant problem with sample licensing issues. In theory, you'd end up shipping your licensed sounds in a game, and we all know what a legal headache that can be.
- Streaming sounds off the HD like Gigastudio would be best. If HD streaming during gameplay is possible (Doubtful IMO), we may want to measure storage on disc rather than RAM.
- There could be a need for a hardware MP3 (or other compression like OGG) decoder.