The mechanisms responsible for the integration of sensory information from different modalities have become a topic of intense interest in psychophysics and neuroscience. Many authors now claim that early, sensory-based cross-modal convergence improves performance in detection tasks. An important strand of supporting evidence for this claim is based on statistical models such as the Pythagorean model or the probabilistic summation model. These models establish statistical benchmarks representing the best predicted performance under the assumption that there are no interactions between the two processing paths. Following this logic, when observed detection performances surpass the predictions of these models, it is often inferred that such improvement indicates early cross-modal convergence. We present a theoretical analyses scrutinizing some of these models and the statistical criteria most frequently used to infer early cross-modal interactions during detection tasks. Our current analysis shows how common misinterpretations of these models lead to their inadequate use and, in turn, to contradictory results and misleading conclusions. To further illustrate the latter point, we introduce a model that accounts for detection performances in multimodal detection tasks, but for which surpassing of the Pythagorean or probabilistic summation benchmark can be explained without resorting to early sensory interactions. Finally, we report three experiments that put our theoretical interpretation to the test, and further clarify how to adequately measure multimodal interactions in audio-tactile detection tasks.