A persistent controversy in language evolution research has been whether language
emerged in the gestural-visual or in the vocal-auditory modality. A “dialectic” solution
to this age-old debate has now been gaining ground: language was fully multimodal
from the start, and remains so to this day. In this paper, we show this solution to be too
simplistic and outline a more specific theoretical proposal, which we designate as
pantomime-first. To decide between the multimodal-first and pantomime-first
alternatives, we review several lines of interdisciplinary evidence and complement it
with a cognitive-semiotic experiment. In the study, the participants saw – and then
matched to hand-drawn images – recordings of short transitive events enacted by 4
actors in two conditions: visual (only body movement), and multimodal (body
movement accompanied by nonlinguistic vocalization). Significantly, the matching
accuracy was greater in the visual than the multimodal condition, though a follow-up
experiment revealed that the emotional profiles of the events enacted in the multimodal
condition could be reliably detected from the sound alone. We see these results as
supporting the proposed pantomime-first scenario