Abstract : Seeing the articulatory gestures of the speaker significantly enhances auditory speech perception. A key issue is whether cross-modal speech interactions only depend on well-known auditory and visual modalities or, rather, might also be triggered by other sensory sources less common in speech communication. The present electro-encephalographic EEG and functional magnetic resonance imaging fMRI studies aimed at investigating cross-modal interactions between auditory, haptic, visuo-facial and visuo-lingual speech signals during the perception of other’s and our own production. In a first EEG study n=16, auditory evoked potentials were compared during auditory, audio-visual and audio-haptic speech perception through natural dyadic interactions between a listener and a speaker. Shortened latencies and reduced amplitude of early auditory evoked potentials were observed during both audio-visual and audio-haptic speech perception compared to auditory speech perception, providing evidence for early integrative mechanisms between auditory, visual and haptic information. In a second fMRI study n=12, the neural substrates of cross-modal binding during auditory, visual and audio-visual speech perception in relation to either facial or tongue movements of a speaker recorded by a camera and an ultrasound system, respectively were determined. In line with a sensorimotor nature of speech perception, common overlapping activity was observed for both facial and tongue-related speech stimuli in the posterior part of the superior temporal gyrus-sulcus as well as in the premotor cortex and in the inferior frontal gyrus. In a third EEG study n=17, auditory evoked potentials were compared during the perception of auditory, visual and audio-visual stimuli related to our own speech gestures or those of a stranger. Apart from a reduced amplitude of early auditory evoked potentials during audio-visual compared to auditory and visual speech perception, a self-advantage was also observed with shortened latencies of early auditory evoked potentials for self-related speech stimuli.Altogether our results provide evidence for bimodal interactions between auditory, haptic, visuo-facial and visuo-lingual speech signals. They further emphasize the multimodal nature of speech perception and demonstrate that multisensory speech perception is partly driven by sensory predictability and by the listener’s knowledge of speech production.