I agree with earlier correspondents that much of the confusion comes
from thinking of signals in Fourier terms. The ear performs a
spectral analysis but the analysis is not properly represented by a
windowed FFT, or spectrogram.

The cochlea performs a wavelet transform which is better simulated
with an auditory filterbank (e.g. Unoki at al., 2006). The output of
each filter is encoded by auditory nerves that phase lock at speech
frequencies. So there is no question that phase information gets
into the auditory system. A summary of the early literature is
presented in Patterson (1987). The paper summarizes our
understanding of monaural phase perception and provides new data
supporting earlier theories which basically say

The auditory system preserves phase changes that change the
envelope of the wave coming out of an individual auditory filter (within
channel changes). Reverberation can produce this kind of
change in a speech signal.

The auditory system loses most of the phase information
that defines time delays between channels. These global
phase shifts are encountered in signal transmission.

So one answer is to assess the phase changes you are concerned about
by passing them through an auditory filterbank and checking to see
whether there are within channel differences that MFCCs do not
preserve.

Subsequent experiments, like that of Gockel et al. (2002) suggest,
as Laszlo Toth intuited, that phase changes that disrupt glottal
pulse integrity reduce detectability in noise, and the effect is
greater when the glottal pulse rate is lower.

I've been confused about the role of "phase" information of the
sound (eg speech) signal in speech recognition and more generally
human's perception of audio signals. I've been reading conflicting
arguments and publications regarding the extent of importance of
phase information. if there is a border between short and
long-term phase information that clarifies this extent of
importance, can anybody please introduce me any convincing
reference in that respect. In summary I just want to know what is
the consensus in the community about phase role in speech
recognition, of course if there is any at all.