Visual information is transformed from early sensory formats into increasingly abstract representations of its content. We probed this abstraction by exploring the convergence in the neural responses to pictures of objects and their spoken names, taking a broad look at several major semantic divisions between object categories. Our aim was to explore which neural regions show reliable responses that are specific to visual pictures, specific to auditory words, or common between these two modalities, using a data-driven clustering approach. Using fMRI, we measured neural response patterns to objects from 18 broad semantic categories, presented as both pictures and auditory words, in 16 participants. We used a clustering technique to group together voxels by their response profile similarity over these 18 categories, in a way that is agnostic to where the voxels are located and whether they reflect responses from the visual or auditory modality. They key advantage of this procedure is that it simultaneously discovers common and unique structure across both modalities without presupposing any regional boundaries in advance. This analysis identified several regions with similar neural profiles to pictures and words (parahippocampal, transverse occipital sulcus, retrosplenial cortex), which primarily had a response preference for inanimate categories of objects. In contrast, other neural regions were only reliably modulated by pictorial stimuli (lateral occipital, fusiform), and these regions primarily had a response preference to pictures of animate entities. Taken together, these results demonstrate a surprising and currently unexplained link between the neural organization of broad object domains and activations by different modalities.