We present a novel hierarchical, distributed model for unsupervised learning of invariant spatio-temporal features from video. Our approach builds on previous deep learning methods and uses the convolutional Restricted Boltzmann machine (CRBM) as a basic processing unit. Our model, called the Space-Time Deep Belief Network (ST-DBN), alternates the aggregation of spatial and temporal information so that higher layers capture longer range statistical dependencies in both space and time. Our experiments show that the ST-DBN has superior performance on discriminative and generative tasks including action recognition and video de-noising when compared to convolutional deep belief networks (CDBN’s) applied on a per-frame basis. Simultaneously, the ST-DBN has superior feature invariance properties compared to CDBN’s and can integrate information from both space and time to fill in missing data in video.