Description

请翻到底看中文版简介和分任务1：无伴奏中文流行歌曲 (subtask 1)介绍。

Automatic lyrics-to-audio alignment algorithm can be useful for Karaoke lyrics display and lyrics alignment of music videos. It is also a pre-processing step for singing voice synthesis and joint analysis of audio and lyrics [Fujihara, H., & Goto, M. (2012)]. Most of the previous works use forced alignment technique stemmed from Automatic Speech Recognition field [Loscos, A. et al. (1999), Mesaros, A. and Virtanen, T. (2008), Fujihara, H. et al. (2011)]. To improve the alignment accuracy, additional musical side information extracted from the musical score is also used in many works, such as chord information [Mauch, M. et al. (2012)], note length duration [Iskandar, D. (2006)] and syllable/phoneme duration [Kruspe, A. (2015), Dzhambazov, G. and Serra, X. (2015), Gong, R. et al. (2015), Pons, J. (2017)]. However, an open-source and fully-automatic alignment system without using any musical side information still hasn't been realized. The possible reasons could be:

Non-availability of a large annotated and publicly available singing voice dataset

Influences of accompanied music

The complexity of music structure and the lack of clear singing phrase boundaries.

The MIREX task of automatic lyrics-to-audio alignment has as a goal of synchronizing between an audio recording of singing (a cappella or mixed) and its corresponding lyrics (syllable or word level). The onset and offset timestamps of a lyric unit can be estimated on different granularities, such as phoneme, syllable, word, phrase. For this task, syllable or word-level alignment is required.

This task contains two subtasks:

A cappella Mandarin Chinese pop songs

Mixed English pop songs

The participants can submit their algorithms for either one of the subtasks or both according to their interest and time arrangement. The participants can use any external training dataset or modify/augment the dataset/annotation provided below.

Subtask 1 algorithm receives two inputs - a cappella singing audio and its corresponding lyrics in pinyin format, outputs the onset and offset timestamps (second) of each pinyin syllable. Due to time constraints, for the training datasets, we are not able to provide the word-level annotation and the lexicon in simplified or traditional Mandarin format. Thus, we don't accept the submissions which receive the lyrics input in simplified or traditional Mandarin format. If you are willing to contribute for verifying the word-level annotation and building the lexicon, please check Ask for contribution section.

Training Datasets

MIR-1k Dataset

The original MIR-1k dataset can be download here. It contains 1000 song clips which the musical accompaniment and the clean singing voice are recorded at left and right channels, respectively. The duration of each clip ranges from 4 to 13 seconds, and the total length of the dataset is 133 minutes. The original dataset also contains the corresponding lyrics in traditional Mandarin Chinese characters. We automatically converted the lyrics into pinyin format. Here is the link. The pinyin lyrics are manually corrected.

Jingju a cappella singing Dataset

Jingju (also known as Peking or Beijing opera) is a form of Chinese opera which combines music, vocal performance, mime, dance and acrobatics. The language used in jingju is a combination of Beijing Mandarin and Jiangsu, Anhui, Hubei dialects. The jingju a cappella singing dataset has 3 parts. Each contains annotation (annotation_txt files) at phrase-level in pinyin format:

Evaluation Datasets

The dataset contains 10 Mandarin Chinese pop music songs collected at the same time as the MIREX 2018 Mandarin pop song training dataset. 5 songs are sung by amateur singers and another 5 songs are source-separated from the mixed recordings.

Phonetization

We provide the pinyin lexicon at syllable-level and phoneme lexicon which correspond to the lyrics annotations of the training datasets.

Training Dataset

The DAMP dataset contains a large number (34 000+) of a cappella recordings from a wide variety of amateur singers, collected with the Sing! Karaoke mobile app in different recording conditions, but generally with good audio quality. A carefully curated subset DAMPB of 20 performances of each of the 300 songs has been created by (Kruspe, 2016). Here is the list of recordings.

Evaluation Datasets

Hansen's Dataset

The dataset contains 9 pop music songs in English with annotations of both beginnings- and ending-timestamps of each word. The ending timestamps are for convenience (copies of next word's beginning timestamp) and are not used in the evaluation. Sentence-level annotations are also provided.
The audio has two versions: the original with instrumental accompaniment and a cappella singing voice only one. An example song can be seen here

Mauch's Dataset

The dataset contains 20 pop music songs in English with annotations of beginning-timestamps of each word. Non-vocal sections are not explicitly annotated (but remain included in the last preceding word). We prefer to leave it this way, to enable comparison to previous work, evaluated on this dataset.
The audio has instrumental accompaniment. An example song can be seen here.

Audio Format

Evaluation

The submitted algorithms for both subtasks will be evaluated at the word boundaries for the original mixed songs (a cappella singing + instrumental accompaniment). Evaluation metrics only on the a cappella singing will be reported as well, for the sake of getting insights on the impact of instrumental accompaniment on the algorithm, but will not be considered for the ranking.

Average absolute error/deviation Initially utilized in Mesaros and Virtanen (2008), the absolute error measures the time displacement between the actual timestamp and its estimate at the beginning and the end of each lyrical unit. The error is then averaged over all individual errors. An error in absolute terms has the drawback that the perception of an error with the same duration can be different depending on the tempo of the song.
Here is a test of using this metric.

Percentage of correct segments The perceptual dependence on tempo is mitigated by measuring the percentage of the total length of the segments, labeled correctly to the total duration of the song. This metric is suggested by Fujihara et al. (2011), Figure 9.
Here is a test of using this metric.

Submission Format

Submissions to this task will have to conform to a specified format detailed below. Submissions should be packaged and contain at least two files: The algorithm itself and a README containing contact information and detailing, in full, the use of the algorithm.

Input Data

Participating algorithms will have to receive the following input format:

Audio in wav, 44.1kHz, subtask1: mono, subtask2: stereo.

Lyrics in .txt file where each word is separated by a space, each lyrics phrase is separated by a line break mark (\n).

where \t denotes a tab, \n denotes the end of the line. The < and > characters are not included. An example output file would look something like:

0.000 5.223 label1
5.223 15.101 label2
15.101 20.334 label3

where label is Mandarin syllable pinyin for subtask 1 and English word for subtask 2.

NOTE: the offset timestamps column is utilized only by the percentage of correct segments metric. Therefore skipping the second column is acceptable, and could result in degraded performance of this respective metric only.

Command line calling format

The submitted algorithm must take as arguments .wav file, .txt file as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input_audio; the lyrics .txt file as %input_txt and the output file path and name as %output, a program called foobar could be called from the command-line as follows:

README File

A README file accompanying each submission should contain explicit instructions on which subtask to participate and how to run the program (as well as contact information, etc.). In particular, each command line to run should be specified, using %input for the input sound file and %output for the resulting text file.

Packaging submissions

Please provide submissions as a binary or source code.

Time and hardware limits

Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed.
A hard limit of 24 hours will be imposed on analysis times. Submissions exceeding this limit may not receive a result.

Submission opening and closing dates

Closing date: August 11th 2018

Ask for contribution

Your contribution will make a better task next year. Two types of contribution occurred to us:

Annotation

You can help us to verify and correct the word segmentation in MIR-1k and MIREX 2018 Mandarin pop song datasets.

You can help us to add the missing words into the existing open-source Mandarin lexicons, such as thchs and Aishell.

Dataset

You can provide any training or evaluation singing voice dataset with a publicly available permission.

Another idea for contribution? Please open an issue in the GitHub repo or send us an email.