Bottom Line:
Here, we systematically discover several DoS and experimentally validate three of them, named the αC1, αC3, and APE-7 residues.We demonstrate that DoS form sparse networks of non-conserved residues spanning distant regions.Our results reveal a likely role for inter-residue allostery in specificity and an evolutionary decoupling of kinase activity and specificity, which appear loaded on independent groups of residues.

fig2: Overview of the KINspect AlgorithmThe KINspect workflow is designed to identify the specificity mask that best describes the importance of the different residues for specificity. Different combinations of contributions to specificity by different kinase domain residues are collected as specificity masks (top left), where a score between 0 and 1 is given to each position within the kinase domain. Originally, the specificity masks are initialized with random values to then follow a machine-learning procedure that will ensure the masks with the highest predictive power toward specificity are selected for and optimized. This procedure, known as a learning classifier system, is divided into three separate steps.In step 1, for each specificity mask the system loops over all query kinases and, using a kinase domain alignment, compares the query kinase to all other kinases (except those belonging to the same kinase family, which are excluded only at this stage to avoid over-fitting) at the sequence level, generating a similarity vector. This vector is combined with the specificity mask, so that similarity in high-scoring positions of the mask is reinforced and similarity in low-scoring position of the mask is silenced, effectively producing a mask-weighted similarity vector and sum score for each kinase. These values are subsequently used to integrate the different observed PSSMs into a combined predicted PSSM for the query kinase (as further explained by the equations and text in Supplemental Experimental Procedures section and in Zhang et al., 2009).In step 2, after a predicted kinase has been generated for all the kinases in our set, fitness is computed as the median of all the differences between the predicted and the experimentally determined PSSM for all the kinases obtained from the NetPhorest repository (Miller et al., 2008).In step 3, the best-performing specificity masks are kept (“elite”), and new ones are generated by mutation (changing the value of a given position in the mask) and cross-over of the elite sequences (combining two segments of two other masks), as typically done in genetic algorithms. Once a new set of masks has been generated, the whole procedure (prediction, fitness evaluation, and generation of new masks) is repeated iteratively until fitness (defined as median error between predicted and observed specificity profiles) cannot be improved any further (i.e., convergence is reached).Residues scoring high in the optimized specificity masks will be considered candidate DoS. For further details on this procedure, please refer to Supplemental Experimental Procedures.

Mentions:
When investigating the relationship between kinases at the domain primary sequence similarity level and at the substrate sequence motif similarity level (using specificity profiles or PSSMs derived from Positional Scanning Peptide Library or PSPL experiments, see Experimental Procedures and Figure S1), it is apparent that, when considering the domain in its entirety, no strong linear correlation between these exists (Figure S1). We hypothesized that this lack of correlation could indicate that substrate specificity is not encoded by the domain as a whole. Instead, we hypothesized that a limited number of residues contribute to specificity, and that those that do contribute, are likely to do so to different degrees. In order to capture this principle, we introduced the specificity mask as a fundamental entity in our approach. As depicted in Figures 1B and 2 (small box), a specificity mask is defined as a particular combination of contributions to specificity from the different residues in the kinase domain. For example, an extreme hypothesis where all residues within the kinase domain contribute equally to specificity would be represented by all entries in a mask with the same score (e.g., 0.5). Instead, a situation where a single residue, X, would drive specificity would be represented by all entries scoring 0.0 except position X scoring 1.0.

fig2: Overview of the KINspect AlgorithmThe KINspect workflow is designed to identify the specificity mask that best describes the importance of the different residues for specificity. Different combinations of contributions to specificity by different kinase domain residues are collected as specificity masks (top left), where a score between 0 and 1 is given to each position within the kinase domain. Originally, the specificity masks are initialized with random values to then follow a machine-learning procedure that will ensure the masks with the highest predictive power toward specificity are selected for and optimized. This procedure, known as a learning classifier system, is divided into three separate steps.In step 1, for each specificity mask the system loops over all query kinases and, using a kinase domain alignment, compares the query kinase to all other kinases (except those belonging to the same kinase family, which are excluded only at this stage to avoid over-fitting) at the sequence level, generating a similarity vector. This vector is combined with the specificity mask, so that similarity in high-scoring positions of the mask is reinforced and similarity in low-scoring position of the mask is silenced, effectively producing a mask-weighted similarity vector and sum score for each kinase. These values are subsequently used to integrate the different observed PSSMs into a combined predicted PSSM for the query kinase (as further explained by the equations and text in Supplemental Experimental Procedures section and in Zhang et al., 2009).In step 2, after a predicted kinase has been generated for all the kinases in our set, fitness is computed as the median of all the differences between the predicted and the experimentally determined PSSM for all the kinases obtained from the NetPhorest repository (Miller et al., 2008).In step 3, the best-performing specificity masks are kept (“elite”), and new ones are generated by mutation (changing the value of a given position in the mask) and cross-over of the elite sequences (combining two segments of two other masks), as typically done in genetic algorithms. Once a new set of masks has been generated, the whole procedure (prediction, fitness evaluation, and generation of new masks) is repeated iteratively until fitness (defined as median error between predicted and observed specificity profiles) cannot be improved any further (i.e., convergence is reached).Residues scoring high in the optimized specificity masks will be considered candidate DoS. For further details on this procedure, please refer to Supplemental Experimental Procedures.

Mentions:
When investigating the relationship between kinases at the domain primary sequence similarity level and at the substrate sequence motif similarity level (using specificity profiles or PSSMs derived from Positional Scanning Peptide Library or PSPL experiments, see Experimental Procedures and Figure S1), it is apparent that, when considering the domain in its entirety, no strong linear correlation between these exists (Figure S1). We hypothesized that this lack of correlation could indicate that substrate specificity is not encoded by the domain as a whole. Instead, we hypothesized that a limited number of residues contribute to specificity, and that those that do contribute, are likely to do so to different degrees. In order to capture this principle, we introduced the specificity mask as a fundamental entity in our approach. As depicted in Figures 1B and 2 (small box), a specificity mask is defined as a particular combination of contributions to specificity from the different residues in the kinase domain. For example, an extreme hypothesis where all residues within the kinase domain contribute equally to specificity would be represented by all entries in a mask with the same score (e.g., 0.5). Instead, a situation where a single residue, X, would drive specificity would be represented by all entries scoring 0.0 except position X scoring 1.0.

Bottom Line:
Here, we systematically discover several DoS and experimentally validate three of them, named the αC1, αC3, and APE-7 residues.We demonstrate that DoS form sparse networks of non-conserved residues spanning distant regions.Our results reveal a likely role for inter-residue allostery in specificity and an evolutionary decoupling of kinase activity and specificity, which appear loaded on independent groups of residues.