Audio processing pipeline

Introduction

Red Hen Lab's Summer of Code 2015 students worked mainly on audio. Graduate student Owen He has now assembled several of the contributions into an integrated audio processing pipeline, to process the entire NewsScape dataset. This is a description of the current pipeline along with some design instructions; for the code itself, see our github account.

The first three are in production; the task is to create the fourth. Dr. Jungseock Joo and graduate student Weixin Li have started on the fifth. The
new audio pipeline is largely based on the audio parsing work done
over the summer, but Owen He has also added some new code. The pipeline may be extended to allow video analysis and
text to contribute to the results. The data is multimodal, so we're
aiming to eventually develop fully multimodal pipelines.

Audio pipeline design

Candidate extensions

Temporal windows could usefully be combined with speech to text. For some of our video files, especially those digitized from tapes, the transcript is very poor. We could use a dictionary to count the proportion of valid words, and run speech to text on passages where the proportion falls below a certain level.

Current implementation by Owen He

1. Python Wrapper: All the
audio processing tools from last summer are wrapped into a
Python Module called "AudioPipe". In addition, I fixed an
almost undetectable bug in the Diarization code, leading to an
efficiency improvement by 3 times.

2. Shared Preprocessing: the preprocessing part of the
pipeline (media format conversion, feature extraction, etc.) is also wrapped as python modules (features, utils) in "AudioPipe".

3. Data Storage: Data output is stored in this
folder (https://github.com/RedHenLab/Audio/tree/master/Pipeline/Data),
where you can also find the results from testing the pipeline on
a sample video (media files are .gitignored since their sizes
are too large). Note that the speaker recognition algorithm is
now able to detect imposters (tagged as "Others"). The subfolder
"Model" is the Model Zoo, where future machine learning
algorithms should store their model configurations and README
files. The result data are stored in RedHen format (but the meta
data for computation are in .json).

4. Data Managing: For manipulating the data, we use abstract
syntax specified in the data managing module. The places where
data are stored are abstracted as "Node", and the computation
processes are abstracted as "Flow" from one Node to the other.

5. Main Script: As you can see in the main script(https://github.com/RedHenLab/Audio/blob/master/Pipeline/Main.ipynb), by deploying the data managing
module, the syntax becomes so concise that every step in the
pipeline boils down to only 2 lines of code. This will be very
convenient for the non-developers to use the audio pipeline.

Design targets

The new audio processing pipeline will be implemented on Case Western
Reserve University's High-Performance Computing Cluster. Design
elements:

Archived videos -- around 330,000, or 250,000 hours (will take months to complete)

Extensible architecture that facilitates the addition of new functions, perhaps in the form of conceptors and classifiers

The pipeline should have a really clear design, with an overall
functional structure that emphasizes core shared functions and a bunch
of discrete modules. For instance, we could think of a core system that
ingests the videos and extracts the features needed by the different
modules. Or a 'digestive system' type approach where each stage
contributes to the subsequent stage.

The primary focus for the first version of the pipeline is an automated
system that ingests all of our videos and texts and processes them in
ways that yield acceptable quality output with no further training or
user feedback. There shouldn't be any major problems completing this core
task, as the code is largely written and it's a matter of creating a
good processing architecture.

Audio pipeline modules

The audio processing pipeline should tentatively have at least these modules, using the code from our GSoC 2015:

Forced alignment (Gentle, using Kaldi)

Speaker diarization (Karan Singla)

Gender detection (Owen He)

Speaker identification (Owen He -- a pilot sample and a clear procedure for adding more people)

Paralinguistic signal detection (Sri Harsha -- two or three examples)

Emotion detection and identification (pilot sample of a few very clear emotions)

Acoustic fingerprinting

The video, audio and feature files are not pushed to GitHub due to their large sizes, but they are part of the pipeline outputs as well.

An alternative tool for audio fingerprinting is the open-source tool dejavu on github.

Training

We may have to do some training to complete the sample modules. It would
be very useful if you could identify what is still needed to complete a
small number of classifiers for modules 4-6, so that we can recruit
students to generate the datasets. We can use Elan, the video coding interface developed at the MPI in Nijmegen, to code some emotions (see Red Hen's integrated research workflow).

We have several thousand tpt files, and I suggest we use them to build a library of trained models for recurring speakers. The tpt files must first be aligned; they inherit their timestamps from the txt files, so they are inaccurate. We can
then

read the tpt file for
boundaries

extract the speech
segments for every speaker

concatenate the segments
from the same speakers, so that we can have at least 2
minutes training data for everyone

feed these training data
to the speaker recognition algorithm to get the models we
want

This way, the entire
training process can be automated.

A
simple, automated method to select which speakers to train for
would be to extract the unique speakers from each tpt file and
then count how often they recur. I did this in the script
cartago:/usr/local/bin/speaker-list; it generates this output:

So the /tmp/$YEAR-Recurring-Speakers.tpt files list how many
shows a person appears in, by year. If we want more granularity,
we could run this by month instead, to track who moves in and
out of the news. The script tries to clean up the output a bit,
though we may want to do more.

If you look at the top speakers so far in 2016, it's a
fascinating list:

You see Sanders as Sen. Bernie Sanders (I-vt), Sen. Bernie Sanders
(Vt-i), Sen. Bernie Sanders, and SANDERS, Trump as Donald Trump
and TRUMP, so let's include multiple names for the same person in
extracting the training data.

For systematic disambiguation, it may be possible to use the Library of Congress Name Authority File (LCNAF). It contains 8.2 million name authority records (6 million personal, 1.4 million corporate, 180,000 meeting, and 120,000 geographic names, and .5 million titles). As a publicly supported U.S. Government institution, the Library generally does not own rights in its collections and what is posted on its website."Current guidelines recommend that software programs submit a total of no more than 10 requests per minute to Library applications, regardless of the number of machines used to submit requests. The Library also reserves the right to terminate programs that require more than 24 hours to complete." For an example record, see:

The virtue of using this database is that it is likely accurate. However, the records are impoverished relative to Wikipedia; arguably, it's Wikipedia that should be linking to into LCNAF. It is also unclear if the LCNAF has an API that facilitates machine searches; see LoC SRW for leads.

Pete Broadwell writes on 17 April 2016,

In brief, I think the best way to disambiguate named persons would be to set up our own local DBpedia Spotlight service:

We’ve discussed Spotlight briefly in the past; it’s
trivial to set up a basic local install via apt, but (similar to
Gisgraphy), I think more work will be necessary to download and
integrate the larger data sets that would let us tap into the
full potential of the software.

In any case, this is something Martin and I have planned
to do for the library for quite some time now. I suggest that we first
try installing it on babylon, with the data set and index files stored
on the Isilon (which is what we do for Gisgraphy)
— we could move it somewhere else if babylon is unable to handle the
load. We can also see how well it does matching organizations and places
(the latter could help us refine the Gisgraphy matches), though of
course places and organizations don’t speak.

I share your suspicion that the LCNAF isn’t necessarily
any more extensive, accurate or up-to-date than DBpedia/Wikipedia,
especially for people who are in the news. It also doesn’t have its own
API as far as I can tell; the suggested approach
is to download the entire file as RDF triples and set up our own Apache
Jena service (http://jena.apache.org/) to index them. Installing Spotlight likely would be a better use of our time.

If we set the cutoff at speakers who have appeared in at least 32 shows, we would get a list of a hundred common
speakers. But it may be useful to go much further. Even people who appear in
a couple of shows could be of interest; I recognize a lot of the
names. That would give us thousands of speakers:

It's likely 2007 is high simply because we have a lot of tpt files
from that year. Give some thought to this; the first step is to
get the alignment going. Once we have that, we should have a large
database of recurring speakers we can train with.

Efficiency coding

The PyCASP
project makes an interesting distinction between efficiency coding and
application coding. We have a bunch of applications; your task is to
integrate them and make them run efficiently in an HPC environment.
PyCASP is installed at the Case HPC if you would like to use it. The
Berkeley team at ICSI who developed it also have some related projects;
we have good contacts with this team:

Please asses if the efficiency coding framework could be useful in the
pipeline design. It's important to bear in mind that we want this design
to be clear, transparent, and easy to maintain; it's possible that
introducing the PyCASP infrastructure will make it more difficult to
extend, in which case we should not use it.

Integrating the training stage

To the extent there's time, I'd also like us to consider a somewhat more
ambitious project that integrates the training stage. Could you for
instance sketch an outline of how we might create a processing
architecture that integrates deep learning for some tasks and conceptors for others?
There are a lot of machine learning tools out there; RHSoC2015 used SciPy
and Kaldi. Consider Google's project TensorFlow -- this is a candidate deep learning approach for integrated multimodal
data. We see this as a longer-term project.

Red Hen Audio Processing Pipeline Guide

Dependencies:

In order to run the main processing main script, the following modules should be loaded on the HPC Cluster:

module load boost/1_58_0

module load cuda/7.0.28

module load pycasp

module load hdf5

module load ffmpeg

And Python related modules are installed in a virtual environment, which can be activated by the following command:

. /home/hxx124/myPython/virtualenv-1.9/ENV/bin/activate

Python Wrapper:

Although we welcome audio processing tools implemented in any language, to make it easy for the integration of several audio tools into one unified pipeline, we strongly recommend the developers can wrap their code as a Python module, so that the pipeline can include their work by simply importing the corresponding module. All audio-related works from GSoC 2015 have been wrapped into a Python Module called "AudioPipe". See below for some examples:

To specify a new pipeline, one can use the abstract syntax defined in the Data Management Module, where a Node is a place to store data and a Flow is a computational process that transforms input data to output data.

which first creates a node for audio outputs, and then flow a specific file from the video node to the audio node through the computational process called video2audio. Name specifies which file exactly is going to be processed and it will also be used as the output file name. The last argument of .Flow() is a list of arguments required by the computational process(video2audio in this case).

It takes the video and stores the fingerprints of it; takes the audio and produces diarization results; from the audio it identifies the gender of each speaker based on the boundary information provided by the diarization results; similarly from the audio it recognizes the speakers.