Identifying the active speaker in a video of a distributed meeting can be very
helpful for remote participants to understand the dynamics of the meeting. A
straightforward application of such analysis is to stream a high resolution video
of the speaker to the remote participants. In this paper, we present the
challenges we met while designing a speaker detector for the Microsoft RoundTable
distributed meeting device, and propose a novel boosting-based multimodal speaker
detection (BMSD) algorithm. Instead of separately performing sound source
localization (SSL) and multi-person detection (MPD) and
subsequently fusing their individual results, the proposed algorithm fuses audio
and visual information at feature level by using boosting to select features from
a combined pool of both audio and visual features simultaneously. The result is a
very accurate speaker detector with extremely high efficiency. In experiments
that includes hundreds of real-world meetings, the proposed BMSD algorithm
reduces the error rate of SSL-only approach by 24.6{\%}, and the SSL and MPD fusion
approach by 20.9{\%}. To the best of our knowledge, this is the first real-time
multimodal speaker detection algorithm that is deployed in commercial
products.