Feedforward models of visual processing provide human-level object-recognition performance and state-of-the-art predictions of temporally averaged neural responses. However, the primate visual system processes information through dynamic recurrent signaling. Here we characterize and model the representational dynamics of visual processing along multiple areas of the human ventral stream by combining source-reconstructed magnetoencephalography data and deep learning. Our analyses of the empirical data revealed neural responses that traverse distinct encoding schemes across time and space, in line with signatures of recurrent signaling. Next, we estimated the ability of different deep network architectures to capture the neural dynamics by using neural representational trajectories as space- and time-varying target functions. Feedforward models, with units that ramp-up their activity over time, predicted nonlinear representational dynamics, but failed to account for the neural effects. Recurrent models of matched parametric complexity significantly better explained the held-out data. We next optimised the recurrent networks for a classification objective only. While performing significantly better than random networks, the variance explained fell short of the architecture’s capacity. This paves the way for the search for additional objectives that the ventral stream may optimize, including category-orthogonal objectives, noise, occlusion, manipulability, and semantics.