Community Tutorials

Google Cloud Platform Community tutorials submitted from the community do not
represent official Google Cloud Platform product documentation.

Oftentimes, the raw data you've gathered is not in a form that is directly
explorable using the data exploration tools at your disposal. Making it usable
may require converting the format, extracting the information type you're
seeking, or adding metadata to further structure the data.

In this tutorial, you'll write several functions to perform various
transformations and extraction to turn raw audio files into structured,
queryable data. These functions can then be easily combined together into a
reusable data ingestion pipeline, as described in the
preprocessing tutorial.

Objectives

This data extraction pipeline example can be described as a series of discrete
steps:

Convert the file

The file as downloaded from LibriVox is in a zip archive, which we'll need to
convert into a format that the API accepts. In order to do this, we'll first
define a function that unzips the archive, producing each file successively:

Now that we have the mp3 contents of the zip archive, we must transcode that
to a format the API accepts. Currently, for audio longer than 1 minute, the
audio must be in raw, monoaural, 16-bit little-endian format. We'll also make an
attempt to preserve the original sample rate.

We've now processed our source data into a format ready to be consumed by the
Speech API.

Transcribe the audio using the Speech API

To extract text data from our prepared audio file, we issue an asynchronous
request to the Google Cloud Speech API, then poll the API until it
finishes transcribing the file.

Upload the audio file to Google Cloud Storage

Because the audio we're transcribing is longer than a minute in length, we must
first upload the raw audio files to Cloud Storage, so the Speech API
can access it asynchronously. We could use the
gsutil tool to do this manually, or we could
do it programatically from our code. Because we'd like to eventually
automate this process in a pipeline,
we'll do this in code:

Make the Speech API call

All calls to the Speech API must be authenticated, so make sure you've set up
your service account correctly, as mentioned in the
prerequisites. In the following code, we'll use the API's
[client library][speech-client] to create an authenticated service object, which
we'll use to make the API call.

You can find the complete file [here][transcribe.py]. Running it produces:

$ python transcribe.py --rate=24000 gs://data-science-getting-started/fables_01_02_aesop_64kb.raw --size=3214080
(0.982679188251): this is a LibriVox recording all LibriVox recordings are in the public domain for more information or to volunteer please visit librivox.org
(0.950583994389): Aesop's Fables the goose that laid the golden egg
(0.941175699234): a man and his wife had the Good Fortune to possessive which laid the golden egg everyday lucky though they were they soon begin to think that they were not getting rich fast and and Imagining the bird must be made of gold inside they decided to kill it in order to secure the whole store of precious metal at 1 but when they cut it open a found it was just like any other Goose this thing either got rich all at once as they had hoped you enjoyed any longer the daily addition to their well much once more and Luther
(0.80675303936): Inns of the goose that laid the golden egg

Analyze the syntax

A text transcription of audio is fine and good, but natural language is hard to
glean meaningful insight from, since it's difficult for machines to glean its
structure. For this, we can leverage the
Cloud Natural Language API to extract the syntax from the
text.

With the Natural Language API, parsing the syntax of the text is a simple API
call: