深層学習を用いた画像変換に基づく会話からの音声抽出

Speech extraction from conversation based on image-to-image translation using deep neural networks

シンソウ ガクシュウ オ モチイタ ガゾウ ヘンカン ニ モトズク カイワ カラノ オンセイ チュウシュツ

Author

氏名

高市 晃佑

ヨミ

タカイチ コウスケ

別名

TAKAICHI Kosuke

氏名

片上 敬雄

ヨミ

カタガミ ヨシオ

別名

KATAGAMI Yoshio

氏名

黒澤 義明

ヨミ

クロサワ ヨシアキ

別名

KUROSAWA Yoshiaki

氏名

目良 和也

ヨミ

メラ カズヤ

別名

MERA Kazuya

氏名

竹澤 寿幸

ヨミ

タケザワ トシユキ

別名

TAKEZAWA Toshiyuki

Abstract

We aim to separate sound sources by deep neural networks which has been active in recent years. We attempt to extract a certain human voice from usual conversation using the networks. We focus on image-to-image translation: pix2pix. The algorithm of pix2pix bases on purely procedure of the image processing. Therefore, we need an additional procedure, that is, we convert voice to spectrogram once. After that we perform to learn the networks to separate human voice, we especially pay attention to segmentation between the same sex and opposite sex. Form this point of view, we conducted two experiments using the sounds overlapped both sexes in this paper. Structure-Similarity (SSIM) index and color map representation were used as evaluation criteria. As a result, we confirmed the good extraction of the female voice from the one synthesized both sexes. However, we did not extract the female voice from same sex. Although we reached the conclusion that the separation did not work well, the generated voice seemed to be played naturally. This is not objective judgment. For this reason, it is our future work.