Faces and voices of familiar people are mutually informative, i.e. hearing a familiar person's face allows the observer to infer the speaker's face and vice-versa. Development of this cross-modal knowledge may be due to simple associative pairing or may represent a specialized process in which faces and voices are bound into an ‘identity’. Here, we present two experiments suggesting that binding into an identity is essential to efficiently learning face-voice pairs. In both experiments we compared how well people learned to match faces and voices across three types of face-voice pairs: when the faces an voices werae recorded from the same individual (‘True Voice’), when they belonged to different individuals of the same gender (‘Gender Matched’), and when they belonged to individuals of different gender (‘Gender Mismatched’). In Experiment 1, where the faces and voices were presented statically, subjects showed much better performance in the Gender Matched vs. Mismatched conditions, as well as a smaller advantage for the True Voice vs. Gender-Matched condition. These results suggest that when faces and voices are congruent– and are thus likely to be bound into an identity– learning is improved relative to when they are incongruent. In Experiment 2, we introduced a dynamic condition, where the audio of the false voices (both Gender Matched and Gender Mismatched) was dubbed onto the video of the paired face. Performance for the Gender-Mismatched pairs showed strong improvement in the dynamic condition relative to the static condition. No such difference between static and dynamic conditions was found for the other, congruent, face-voice pair conditions. These results suggest that that the dubbing of the incongruent face-voice pairs ‘forced’ them to be bound into an identity, improving learning. We conclude that that binding into an identity is a critical factor in developing cross-modal knowledge of faces and voices.