Abstract The visual systems of human and animals respond to a range of actions very quickly. This fast and efficient process includes action detection, recognition, and classification. Extensive studies on action recognition have been performed in the areas of machine learning and computer vision. A key issue is to seek efficient encoding units of natural actions. Current global encoding schemes depend heavily on video segmentation while local encoding schemes lack descriptive power. In this work, natural action structures (NAS) were proposed. NAS are multi-size, multi-scale, spatial-temporal concatenations of local features and function as the basic encoding units of actions. Our approach included patch sampling, independent component analysis, Gabor fitting and clustering, feature space mapping, and NAS constructing. Two improvements over an earlier model were made in the approach. First, in the process of sampling a large number of sequences of circular patches at multiple spatial-temporal scales, a machine learning approach was developed to select interest points based on spatial-temporal features. Second, another machine learning approach with cross-validation was developed to select informative NAS for each action. The performance this NAS-based model of action recognition on several widely used datasets was better than that of the start-of-the-art models, including a biologically motivated system. In conclusion, the proposed NAS are a set of good encoding units of natural actions and the NAS-based action recognition scheme provides important insights to natural action understanding. Key Words: Action recognition, Natural Action Structures, Action encoding