Experimenting

An important part of the learning experience associated with this course (and 40% of the grade) comes from experimenting with the algorithms presented in class. This page describes what is expected from the students. Feel free to ask questions below.

The objectives are the following

get hands-on experience with some of the algorithms presented in the course

practice the writing of an experimental journal (e.g. on a blog dedicated to your experiments for this course), describing your ideas, experimental plans, experimental results, and discussions of potential conclusions (i.e., the stuff that eventually ends up in papers)

practice the use of collaborative tools for writing code, using a repository dedicated to your experimental work (e.g., with github)

the work of each student (in the code repository and in the blog) is available to the others to build upon, thus speeding up the overall rate of progress of the group

each student is encouraged to re-use the ideas, results, tricks, and code from other students but MUST properly cite and acknowledge these inputs (plagiarism without citation would be severely punished)

each student competes to obtain good results on common benchmarks, but can take advantage of the good ideas of the others, hence the collaborative competition.

An important part of the grade will come from having been the first to do something useful and publicize it on your blog (possibly posting here announcements with links to the blog). The more this contribution is useful to advancing each other’s progress, the more points will be given. This should provide an incentive to do things quickly that may otherwise look boring but that could be useful to others.

For now we will get started by playing with the TIMIT dataset and use it to experiment with the task of speech synthesis, i.e., mapping a sequence of symbols (phonemes or words) to an acoustic sequence (e.g. audio samples). Information about the speaker could also be used (so that eventually we could use such a model to imitate someone’s voice and make him or her say something else than what is available in a recording).

More information about the dataset will soon be added here. For now you can find a page that gives information about the data and previous papers there:

As a starting exercise, I suggest to consider just training a simple model (linear, feedforward neural net) with squared error and a single scalar output (next acoustic sample), given a fixed window of past inputs (acoustic samples that precede).

Hi, I’m trying to isolate data for training and testing a model. I analysed both your numpyfile and wave (readable) raw file. Can you specify what transform you applied on data ? For exemple the sample amplitude in numpy array and raw data are not the same. To bound all samples (which have various period time in function of sentence, I believe) in a matrix of fixed shape what is you origine time point for shifting the wave ?

I would add that one important feature of a representation, for the application of speech synthesis, is that a simple and invertible mapping exists that allows us to recover the acoustic signal from the representation, with small enough loss.

Ideally, a good representation is also one that is ‘compressible’, so that it can be ‘controlled’ by generating less real numbers per second than the acoustic signal itself.

An interesting option that I would like to consider is also to *learn* a representation, which means that we directly produce the acoustic signal but we design an “output layer” that maps a more compact internal representation to the acoustic signal, somehow.

I mentioned during last class the Blizzard Challenge, which is a yearly speech synthesis challenge. Their datasets are available for free for non-commercial use. This is the Challenge website: http://www.festvox.org/blizzard/

Data and tools can be downloaded from this site: http://www.cstr.ed.ac.uk/projects/blizzard/ (you have to accept their license to be able to create an account). I did not check all the datasets but the “roger” voice has phone labels and hand-annotated prosodic labels (which can be used to generate intonation/stress in sentences).

Performance is measured through listening tests, where subjects evaluate synthesizers by rating synthesized speech on different scales (overall quality, naturalness, pauses, pleasantness, intonation, emotion, …). The results are published as a paper for each year the challenge took place (the first paper in each year’s page in the first link I posted).