Introduction

Bird watchers must be good bird listeners, for birds are usually heard before they are seen, if they are seen at all.

The sounds that an animal makes often reveal their sex, age, health, species, emotional state, individual identity, and where they are from. But to the human ear, they may seem little more than "quack" or "bark". We are working to change that.

For an experienced birder listening to a common local bird, audio identification might be straightforward, but to those new to bird watching, this task is difficult. The task is also difficult for software, because of the great complexity and plasticity in bird song. Take a robin and ask him to sing the same song repeatedly, and every single occurrence will differ a bit from every other one. Take two robins and ask them to sing the same song, and they can't do it. Their songs unavoidably reveal their age, health, individual identity, and regional accent. The human ear may think they sound the same, but their recorded song reveals many differences:

Each bird speaks with its own personal dialect within a regional dialect for the species, allowing other birds to identify it individually from voice alone. Such individual variations might make it easier for one bird to identify another, but they make it more difficult for software to identify the bird at even the species level.

In addition, a bird's song or call can convey information about where they are from, their age, sex, health, and other qualities. Each quality produces variations that might escape human notice, are obviously important to birds, and confusing to recognition software.

While the words might all sound like "honk" or "quack", in fact each bird has dozens of "words" that have very particular meanings. A Bluejay's announcement call denotes whether it has found food, a fox, or a snake. Each of these calls sound different to other jays, different to the trained ear, and different to a computer trying to identify the sound.

Birds process auditory information far faster than humans can, and so can "speak faster" -- packing more information into a few seconds. A five second recording contains as many as 5,000 meaningful elements.

Birds combine the many sounds they can make in variable sequences, often inventing small variations in these sounds as they construct songs. Algorithms must be able to handle small differences and still find similarities.

Although researchers have been trying to identify species from their vocalizations for years, products benefiting from their insights are usually unable to make correct identifications.

Our project is code-named AudiOh!™[1], and its mission is to disentangle the information embedded in a quack or bark, and reveal it. AudiOh's design solves the problems that developers in this area have faced.

Database: Brawny and Clever

To ensure accuracy, a database must include all species that might occur where a recording is made. AudiOh's database covers 1,502 species in the U.S., 1,862 in North America, and 7,258 worldwide. Other products try to identify only 30-50 species.

To address the problems of individual and regional variation, the reference database must include many recordings of each species it will successfully identify. AudiOh draws from a database of more than 100,000 quality recordings, an average of 14 recordings per species.

To increase the chances of correct identification, algorithms must assume that birds most likely found where the recording is made are those most likely to have made the sound. AudiOh uses a database of 250 million observations, and focuses on birds found near the observer at the current time of year.

When a bird makes a song, it can choose from a number of "words" to construct "phrases", as when a chickadee sings "Chicka dee dee dee" or, in the case of greater danger, "Chicka dee dee dee dee dee". Successful song match requires a database of words or syllables, not merely a database of phrases or songs. AudiOh draws on such a reference database of millions of analyzed segments.

All of this would bury a cell phone in data as it does with one product, that uses 233 Mb on your cell phone to try to identify 50 species. AudiOh's identification database is 2.25 Gb in size, but that database is at the server, not in your cell phone. In your cell phone, AudiOh only needs 12.8 Mb.

Algorithms: Layered and Sophisticated

Analysis begins by segmenting into syllables, using the same algorithms that produced the reference segments in the database. Recordings often contain background noise, and the task of identifying meaningful syllables takes effort.

Different filters are then used to efficiently eliminate unlikely candidates. For instance, species not found near where the recording was made at the time of year it was made are first eliminated (Proximity information draws on tables with 250 million observations.) Then test segments are compared with reference segments again and again, each pass eliminating some, and looking at more detail than the previous pass.

Finally, the user is presented with a short list of species that might have made the sound, ordered by probability. Links to more information are provided, so that the user can review more information about each suggested singer.

The algorithms are CPU-intensive, and require far more horsepower than will be found in any smart phone. So they run at the server, near the database, in a very fast machine. AudiOh's analyzer can be extended to multiple copies of the code, running on multiple machines, as demand for AudiOh grows.

Satisfying Results

Once a list of likely singers is identified, users will want to look up any unfamiliar birds. AudiOh links seamlessly to ZipcodeZoo's 60,000 pages on birds, with info on vernacular names, identification, behavior, diet, reproduction, habitat and ecology, taxonomy, distribution, conservation, and more. On these pages of bird info you'll find 220,000 audio recordings, 104,000 photos, and 8,000 videos. With all this information, you'll be confident about whose recording you just made.

Results are often delivered faster than you can play back the recording, and along the way, you'll get status reports.

You'll also get a copy of your recording and your results, sent to your email address, so you can look at what you have with your desktop computer.
Because the database, the algorithms, and the bird information all reside on servers, the part of AudiOh in the smart phone never needs updating.

The AudiOh Approach

Two cell phone applications which purport to identify North American birds by sound can do everything but make correct identifications.:

Bird Song ID: USA is a cell phone application offered by IsoPerla. This product claims to identify 30 birds. Its identification consists of a list of 26 guesses, ordered by probability. In one of our tests, the correct bird appeared on the list in the 12th position; in all other tests, the correct bird did not appear in the list of results at all.

Twigle Birds is a cell phone application offered by Avelgood Apps which claims to identify 50 North American birds. Its identification consists of a list of 5 guesses, ordered by probability. In none of our tests did the correct bird appear in the list of 5 guesses.

AudiOh takes a very different approach than others have taken, and our patent application has been filed. All prior approaches appear to suffer from a number of common problems:

Most have begun by trying to distinguish between a few mono-syllabic vocalizations, only to find that their performance degrades as the number of species or the complexity of the vocalizations increase. Bird vocalizations range from those with narrowband whistled vocalizations with few distinctive spectral properties to broadband with complex spectral properties. Vocalizations may range for 100Hz to 10,000Hz, and may last several seconds or just a fraction of a second. A truly useful automated identifier must be able to handle any wildlife sounds, no matter how complex.

Many researchers have been lured by the purity of the Macaulay Library at the Cornell Ornithology Lab. The recordings in this library are correctly identified, and most contain no background noise and no sounds of other species. Some researchers have chosen other libraries, only to discover that those libraries had once been Macaulay samples. When their algorithms encounter real-world recordings of birds – faint and cluttered with natural sounds such as wind and the sounds of other birds, passing airplanes, and the sounds of people talking – the algorithms break down. Human speech recognition assumes that the speech is into a microphone; a bird recording attempts to focus on a subject at some distance. What begins as a perfect signal at the bird's syrinx is muted and muffled and masked by intervening trees and these other sounds. Effective algorithms must devote considerable attention to the task of distinguishing signal from noise.

Algorithms must use the correct unit of analysis, the smallest unit that is meaningful to a bird. If a bird can combine the words or syllables “A”, “B”, and “C” into phrases like “AABC”, “ABBCCC”, and “ACABC”, then the unit of analysis must be those words, not those phrases. If a bird has freedom in a song to combine known words into new combinations, and to extend the song to any length, the song cannot be the unit of analysis as Chu and Blumstein (2011) have demonstrated. This is most true for “plastic songs”, but careful examination suggests that most instances of “stereotyped songs” show some plasticity.

Most algorithms do not seem able to handle variations within a species. Anyone who lives with a small flock of birds as pets can distinguish individuals from the way they pronounce various words, which Da Silva et al (2006) believe is “evidence of vocal learning and creative capacity.” Birds use sensory feedback to adjust their song in different contexts, for instance singing louder in noisy urban settings. And all birds appear to have subtle regional dialects, and during migration, the geese feeding in a Virginia pasture in winter might include some that summer in Canada, some non-migratory locals from Virginia, and some that spend their winters anywhere in between. So if there are two ways to pronounce the word “A”, then the database should contain samples of each, and the algorithms must not insist that a test word perfectly match them both. And since there will likely be hundreds of slightly different ways to pronounce “A”, the algorithms must be able to combine partial matches to produce a confident conclusion.

The recordings used for the reference database should be of the highest possible quality, made with a high quality parabolic microphone, rather than an omnidirectional microphone.

The reference database must be generated by fully automated processes if it is to correctly identify all sounds made by all species under all circumstances. Many of the approaches in the literature depend on a manually generated database, something that is not imaginable if we are to identify the sounds of the 1.5 million species of birds, mammals, amphibians, and insects that still exist.

Implementations of recognition software for a cell phone must not expect that the cellphone has adequate processing power to evaluate a recording, or adequate capacity to store the reference database. If field recordings are to be identified in real-time, they will either need a reasonably capable self-contained laptop or they will need to use a cellphone or other recording device coupled to a remote server.

Despite great efforts, great algorithms are not enough to completely substitute for the expert birder or wildlife biologist in identifying species by sound. All cases of machine identification should be available for public review and comment, and the feedback should be used to focus on the weaknesses of the algorithms and sometimes to use a submitted sample as an additional reference sound when machine and citizen scientists agree on an identification.

AudiOh! running in a portable Android device is used to record a bird song. The user plays it back with the app, and trims it to their satisfaction, so that it provides what they consider a reasonably representative sample of the species of interest. A click uploads the recording to an AudiOh! server, along with geolocation info. The App invokes a web page in which the user optionally provides Family and appearance information. This, too, is submitted. The analysis, all done at the server, begins.

Segmentation

Bird song consists of syllables, spectrally discrete sound elements within a song, lasting 5 ms or more, separated by a min of 5 ms of silence. At the server, the recording is broken into segments. Segments are the basis of analysis, and amount to a phrase of a song, or a specific short call or word. Such words may be combined by a bird in most any way, and may be repeated any number of times, so our focus is simplification: identify the source of the words, and the source of the phrases will become evident.

The process of segmenting involves complex algorithms that determine what is noise, and what is signal, within a recording.

In most animal sounds, segments may occur in various sequences. Whether singing or speaking, a bird may change the sequence of segments used from time to time.

Segmenting proves useful in identifying the various singers in the springtime morning chorus.

Segmentation has created well over 8 million reference segments from our 100,000 reference recordings.

Reference Sounds

Reference sounds are those that have been identified by trusted experts, such as the Cornell Ornithology Lab. All reference recordings must contain a minimum of background noise, and particularly no other animal sounds. Our collection of reference sounds includes over 100,000 records of over 8,000 species.

Each reference recording, typically of 30 second duration or less, is decomposed into segments. For a quick comparison with a submitted sound, a statistical comparison is made of each of the sound's segments against information in a reference segments table with over 8 million records. For more thorough comparison with the most likely candidate species, the comparison is made against data files on a solid state drive.

Statistical Summary

A variety of algorithms are applied to both the segments submitted by an AudiOh! user, and to our reference segments. Elements analyzed include amplitude, frequency, duration, and the like. A number of new statistical approaches are developed to create a summary profile or "fingerprint" of the reference segment.

To identify a test sound, analysis compares the fingerprint of its segments with fingerprints stored in the reference segments table.

Analytic Stages

Analysis consists of these stages:

Select a subset of reference species based on where and the month the test recording was made. Portable devices can all provide latitude, longitude, and date/time information, and the database already has tables showing probability of encountering any species in any location in any month.

Further refine the subset of reference segments to be used for comparison, using statistical properties such as frequency, duration, variability.

Using this reduced subset, take the segments of the test sample, one at a time, and compare them with the segments of the selected reference samples. Bayesian inference is used: the initial comparisons will cast a wide net, but as each comparison changes the probabilities, the focus sharpens. Example: Assume a test segment is of a crow. The test segment might be compared with some number of samples of bluebirds, crows, and mockingbirds, based on the probability of finding these 3 birds at the time and place of the recording. But no bluebird sample will have a fingerprint suggesting a crow, most crow fingerprints will bear some similarity, and some mockingbird segments will as well. So the statistical approach rules out bluebirds, and focuses on the match between the test segments and the crow and mockingbird reference samples. As soon as a reasonable certainty is reached, the comparisons may be concluded and the user notified of the conclusions.

Review user feedback on the conclusions. Recompute an overall accuracy score. Mark that species for further research (perhaps more reference samples are needed, etc.)

Performance is important. A user will want to know in seconds, not minutes or hours. With the current version of the back end code, a full statistical summary of an MP3 file can often be completed in less time than it takes to play the recording. A statistical summary of a single segment takes less than a second. Extraction and fingerprinting of the first few segments can be completed within one or two seconds of receipt of the MP3 .

Design Considerations

Minimal product updates. The smart phone application need only record and transmit a captured sound, then display the results. Product revision occurs at the server, in the lab. Such revisions in the lab include changes to the database, to algorithms used in analysis, and to the interface that the app's user has after submitting a sample.

Independence of processes. To prevent backlogs in the processing of either new reference samples or new test samples, most processes have been designed to be independent. Multiple staff are able to update the reference database while multiple processes (many threads, many instances of the application, running on many machines) examine test samples and multiple processes make comparisons.

Protection of intellectual property. All important IP remains in the lab.

Additional precision possible with identification. In some cases, the analysis should be able to determine age, sex, or geographic dialect of the voice. (Yes, birds from New Jersey sound different than those from California.) And in most cases, the identification will be able to determine if the recording is song or call and other qualities that a birder might know (eg., "warning call in flight" or "Twitter calls of chicks")

Easy re-examination of reference database in the event that problems are found in any analytic stage.

Get Yours Now!

We spent years developing, testing, and revising our identification algorithms, and building and rebuilding our databases. All is now ready for release, although database development continues aggressively. The result meets our goals: it is light enough to run in a cell phone, clever enough to accurately identify bird sounds, and nimble enough to do this in real time. Along the way, we hope to contribute to the science of animal sounds. Finally, you can retrieve your own copy for your Android device at the Google Play Store now.