Data Compression Conference (DCC), pp.341, 4-7 April 2017

Computational analyses of the growing corpus of three-dimensional (3D)
structures of proteins have revealed a limited set of recurrent substructural
themes, termed super-secondary structures. Knowledge of super-secondary
structures is important for the study of protein evolution and for the
modeling of proteins with unknown structures. Characterizing a comprehensive
dictionary of these super-secondary structures has been an unanswered
computational challenge in protein structural studies. This paper presents an
unsupervised method for learning such a comprehensive dictionary using the
statistical framework of lossless compression on a database comprised of
concise geometric representations of protein 3D folding patterns. The best
dictionary is defined as the one that yields the most compression of the
database. Here we describe the inference methodology and the statistical
models used to estimate the encoding lengths. An interactive website for
this dictionary is available at
[lcb.infotech.monash.edu.au/proteinConcepts/scop100/dictionary.html].