Photo aesthetics assessment is challenging. Deep
convolutional neural network (ConvNet) methods have recently shown promising
results for aesthetics assessment. The performance of these deep ConvNet
methods, however, is often compromised by the constraint that the neural
network only takes the fixed-size input. To accommodate this requirement,
input images need to be transformed via cropping, scaling, or padding, which
often damages image composition, reduces image resolution, or causes image
distortion, thus compromising the aesthetics of the original images. In this
paper, we present a composition-preserving deep ConvNet method that directly
learns aesthetics features from the original input images without any image
transformations. Specifically, our method adds an adaptive spatial pooling
layer upon the regular convolution and pooling layers to directly handle
input images with original sizes and aspect ratios. To allow for multi-scale
feature extraction, we develop the Multi-Net Adaptive Spatial Pooling
ConvNet architecture which consists of multiple sub-networks with different
adaptive spatial pooling sizes and leverage a scene-based aggregation layer
to effectively combine the predictions from multiple sub-networks. Our
experiments on the large-scale aesthetics assessment benchmark (AVA)
demonstrate that our method can significantly improve the state-of-the-art
results in photo aesthetics assessment.