Authors:Hamid R. Marateb; Roya Kelishadi; Mohammad Reza Mohebian; Shaghayegh Haghjooy Javanmard; Amir Ali Tavallaei; Mohammad Hasan Tajadini; Motahar Heidari-Beni; Miguel Angel Mañanas; Mohammad Esmaeil Motlagh; Ramin Heshmat; Marjan MansourianAbstract: Publication date: Available online 2 March 2018 Source:Computational and Structural Biotechnology Journal Author(s): Hamid R. Marateb, Roya Kelishadi, Mohammad Reza Mohebian, Shaghayegh Haghjooy Javanmard, Amir Ali Tavallaei, Mohammad Hasan Tajadini, Motahar Heidari-Beni, Miguel Angel Mañanas, Mohammad Esmaeil Motlagh, Ramin Heshmat, Marjan Mansourian Dyslipidemia, the disorder of lipoprotein metabolism resulting in high lipid profile, is an important modifiable risk factor for coronary heart diseases (CHDs). It is associated with more than four million worldwide deaths per year. Half of the children with dyslipidemia have hyperlipidemia during adulthood, and its prediction and screening are thus critical. We designed a new dyslipidemia diagnosis system. The sample size of 725 subjects (age 14.66 ± 2.61 years; 48% male; dyslipidemia prevalence of 42%) was selected by multistage random cluster sampling in Iran. Single nucleotide polymorphisms (rs1801177, rs708272, rs320, rs328, rs2066718, rs2230808, rs5880, rs5128, rs2893157, rs662799, and Apolipoprotein-E2/E3/E4), and anthropometric, life-style attributes, and family history of diseases were analyzed. A framework for classifying mixed-type data in imbalanced datasets was proposed. It included internal feature mapping and selection, re-sampling, optimized group method of data handling using convex and stochastic optimizations, a new cost function for imbalanced data and an internal validation. Its performance was assessed using hold-out and 4-fold cross-validation. Four other classifiers namely as supported vector machines, decision tree, and multilayer perceptron neural network and multiple logistic regression were also used. The average sensitivity, specificity, precision and accuracy of the proposed system were 93%, 94%, 94% and 92%, respectively in cross validation. It significantly outperformed the other classifiers and also showed excellent agreement and high correlation with the gold standard. A non-invasive economical version of the algorithm was also implemented suitable for low- and middle-income countries. It is thus a promising new tool for the prediction of dyslipidemia.

Authors:Wipawadee Suwannapan; Pramote Chumnanpuen; Teerasak E-kobonAbstract: Publication date: Available online 2 March 2018 Source:Computational and Structural Biotechnology Journal Author(s): Wipawadee Suwannapan, Pramote Chumnanpuen, Teerasak E-kobon This study aimed to investigate the conserved FAD-binding region of the L-amino acid oxidase (LAAO) genes in twelve gastropod genera commonly found in Thailand compared to those in other organisms using molecular cloning, nucleotide sequencing and bioinformatics analysis. Genomic DNA of gastropods and other invertebrates was extracted and screened using primers specific to the conserved FAD-binding region of LAAO. The amplified 143-bp fragments were cloned and sequenced. The obtained nucleotide sequences of 21 samples were aligned and phylogenetically compared to the LAAO-conserved FAD-binding regions of 210 other organisms from the NCBI database. Translated amino acid sequences of these samples were used in phylogenetics and pattern analyses. The phylogenetic trees showed clear separation of the conserved regions in fungi, invertebrates, and vertebrates. Alignment of the conserved 47-amino-acid FAD-binding region of the LAAOs showed 150 unique sequences among the 231 samples and these patterns were different from those of other flavoproteins in the amine oxidase family. An amino acid pattern analysis of five sub-regions (bFAD, FAD, FAD-GG, GG, and aGG) within the FAD-binding sequence showed high variation at the FAD-GG sub-region. Pattern analysis of secondary structures indicated the aGG sub-region as having the highest structural variation. Cluster analysis of these patterns revealed two major clusters representing the mollusc clade and the vertebrate clade. Thus, molecular phylogenetics and pattern analyses of sequence and structural variations could reflect evolutionary relatedness and possible structural conservation to maintain specific function within the FAD-binding region of the LAAOs in gastropods compared to other organisms.

Authors:Aloysius Wong; Xuechen Tian; Chris Gehring; Claudius MarondedzeAbstract: Publication date: Available online 27 February 2018 Source:Computational and Structural Biotechnology Journal Author(s): Aloysius Wong, Xuechen Tian, Chris Gehring, Claudius Marondedze Plants are constantly exposed to environmental stresses and in part due to their sessile nature, they have evolved signal perception and adaptive strategies that are distinct from those of other eukaryotes. This is reflected at the cellular level where receptors and signalling molecules cannot be identified using standard homology-based searches querying with proteins from prokaryotes and other eukaryotes. One of the reasons for this is the complex domain architecture of receptor molecules. In order to discover hidden plant signalling molecules, we have developed a motif-based approach designed specifically for the identification of functional centers in plant molecules. This has made possible the discovery of novel components involved in signalling and stimulus-response pathways; the molecules include cyclic nucleotide cyclases, a nitric oxide sensor and a novel target for the hormone abscisic acid. Here, we describe the major steps of the method and illustrate it with recent and experimentally confirmed molecules as examples. We foresee that carefully curated search motifs supported by structural and bioinformatic assessments will uncover many more structural and functional aspects, particularly of signalling molecules.

Authors:Jessica D. Forbes; Natalie C. Knox; Christy-Lynn Peterson; Aleisha R. ReimerAbstract: Publication date: Available online 27 February 2018 Source:Computational and Structural Biotechnology Journal Author(s): Jessica D. Forbes, Natalie C. Knox, Christy-Lynn Peterson, Aleisha R. Reimer Clinical metagenomics (CMg) is the discipline that refers to the sequencing of all nucleic acid material present within a clinical specimen with the intent to recover clinically relevant microbial information. From a diagnostic perspective, next-generation sequencing (NGS) offers the ability to rapidly identify putative pathogens and predict their antimicrobial resistance profiles to optimize targeted treatment regimens. Since the introduction of metagenomics nearly a decade ago, numerous reports have described successful applications in an increasing variety of biological specimens, such as respiratory secretions, cerebrospinal fluid, stool, blood and tissue. Considerable advancements in sequencing and computational technologies in recent years have made CMg a promising tool in clinical microbiology laboratories. Moreover, costs per sample and turnaround time from specimen receipt to clinical management continue to decrease, making the prospect of CMg more feasible. Many difficulties, however, are associated with CMg and warrant further improvements such as the informatics infrastructure and analytical pipelines. Thus, the current review focuses on comprehensively assessing applications of CMg for diagnostic and subtyping purposes.

Authors:Sen Liang; Anjun Ma; Sen Yang; Yan Wang; Qin MaAbstract: Publication date: Available online 25 February 2018 Source:Computational and Structural Biotechnology Journal Author(s): Sen Liang, Anjun Ma, Sen Yang, Yan Wang, Qin Ma With the rapid accumulation of gene expression data from various technologies, e.g., microarray, RNA-sequencing (RNA-seq), and single-cell RNA-seq, it is necessary to carry out dimensional reduction and feature (signature genes) selection in support of making sense out of such high dimensional data. These computational methods significantly facilitate further data analysis and interpretation, such as gene function enrichment analysis, cancer biomarker detection, and drug targeting identification in precision medicine. Although numerous methods have been developed for feature selection in bioinformatics, it is still a challenge to choose the appropriate methods for a specific problem and seek for the most reasonable ranking features. Meanwhile, the paired gene expression data under matched case-control design (MCCD) is becoming increasingly popular, which has often been used in multi-omics integration studies and may increase feature selection efficiency by offsetting similar distributions of confounding features. The appropriate feature selection methods specifically designed for the paired data, which is named as matched-pairs feature selection (MPFS), however, have not been maturely developed in parallel. In this review, we compare the performance of 10 feature-selection methods (eight MPFS methods and two traditional unpaired methods) on two real datasets by applied three classification methods, and analyze the algorithm complexity of these methods through the running of their programs. This review aims to induce and comprehensively present the MPFS in such a way that readers can easily understand its characteristics and get a clue in selecting the appropriate methods for their analyses.

Authors:Dimitrios Zafeiris; Sergio Rutella; Graham Roy BallAbstract: Publication date: Available online 21 February 2018 Source:Computational and Structural Biotechnology Journal Author(s): Dimitrios Zafeiris, Sergio Rutella, Graham Roy Ball The field of machine learning has allowed researchers to generate and analyse vast amounts of data using a wide variety of methodologies. Artificial Neural Networks (ANN) are some of the most commonly used statistical models and have been successful in biomarker discovery studies in multiple disease types. This review seeks to explore and evaluate an integrated ANN pipeline for biomarker discovery and validation in Alzheimer's disease, the most common form of dementia worldwide with no proven cause and no available cure. The proposed pipeline consists of analysing public data with a categorical and continuous stepwise algorithm and further examination through network inference to predict gene interactions. This methodology can reliably generate novel markers and further examine known ones and can be used to guide future research in Alzheimer's disease.

Authors:Susanna K.P. Lau; Jade L.L. Teng; Tsz Ho Chiu; Elaine Chan; Alan K.L. Tsang; Gianni Panagiotou; Shao-Lun Zhai; Patrick C.Y. WooAbstract: Publication date: Available online 15 February 2018 Source:Computational and Structural Biotechnology Journal Author(s): Susanna K.P. Lau, Jade L.L. Teng, Tsz Ho Chiu, Elaine Chan, Alan K.L. Tsang, Gianni Panagiotou, Shao-Lun Zhai, Patrick C.Y. Woo In Hong Kong, cattle were traditionally raised by farmers as draft animals to plough rice fields. Due to urbanization in the 20th century, they were gradually abandoned and became wild cattle straying in suburban Hong Kong. Recently, these cattle were observed to have become omnivorous by eating leftover barbeque food waste in country parks. Microbiome analysis was performed on fecal samples of the omnivorous cattle using deep sequencing and the resulting microbiome was compared with that of traditional herbivorous cattle in Southern China. A more diverse gut microbiome was observed in the omnivorous cattle, suggesting that microbiota diversity increases as diet variation increases. At the genus level, the relative abundance of Anaeroplasma, Anaerovorax, Bacillus, Coprobacillus and Solibacillus significantly increased and those of Anaerofustis, Butyricimonas, Campylobacter, Coprococcus, Dehalobacterium, Phascolarctobacterium, rc4.4, RFN20, Succinivibrio and Turicibacter significantly decreased in the omnivorous group. The increase in microbial community levels of Bacillus and Anaerovorax likely attributes to the inclusion of meat in the diet; while the decrease in relative abundance of Coprococcus, Butyricimonas, Succinivibrio, Campylobacter and Phascolarctobacterium may reflect the reduction in grass intake. Furthermore, an increased consumption of resistant starch likely resulted in the increase in abundance of Anaeroplasma. In conclusion, a significant change in the gut microbial community was observed in the omnivorous cattle, suggesting that diet may be one of the factors that may signal an adaptation response by the cattle to maintain feed efficiency as a consequence of the change in environment.

Authors:Pearl Chang; Moloya Gohain; Ming-Ren-Yen; Pao-Yang ChenAbstract: Publication date: Available online 15 February 2018 Source:Computational and Structural Biotechnology Journal Author(s): Pearl Chang, Moloya Gohain, Ming-Ren-Yen, Pao-Yang Chen The hierarchical organization of chromatin is known to associate with diverse cellular functions; however, the precise mechanisms and the 3D structure remain to be determined. With recent advances in high-throughput next generation sequencing (NGS) techniques, genome-wide profiling of chromatin structures is made possible. Here, we provide a comprehensive overview of NGS-based methods for profiling “higher-order” and “primary-order” chromatin structures from both experimental and computational aspects. Experimental requirements and considerations specific for each method were highlighted. For computational analysis, we summarized a common analysis strategy for both levels of chromatin assessment, focusing on the characteristic computing steps and the tools. The recently developed single-cell level techniques based on Hi-C and ATAC-seq present great potential to reveal cell-to-cell variability in chromosome architecture. A brief discussion on these methods in terms of experimental and data analysis features is included. We also touch upon the biological relevance of chromatin organization and how the combination with other techniques uncovers the underlying mechanisms. We conclude with a summary and our prospects on necessary improvements of currently available methods in order to advance understanding of chromatin hierarchy. Our review brings together the analyses of both higher- and primary-order chromatin structures, and serves as a roadmap when choosing appropriate experimental and computational methods for assessing chromatin hierarchy.

Authors:Daisuke Komura; Shumpei IshikawaAbstract: Publication date: Available online 9 February 2018 Source:Computational and Structural Biotechnology Journal Author(s): Daisuke Komura, Shumpei Ishikawa Abundant accumulation of digital histopathological images has led to the increased demand for their analysis, such as computer-aided diagnosis using machine learning techniques. However, digital pathological images and related tasks have some issues to be considered. In this mini-review, we introduce the application of digital pathological image analysis using machine learning algorithms, address some problems specific to such analysis, and propose possible solutions.

Authors:ChangAbstract: Publication date: Available online 6 February 2018 Source:Computational and Structural Biotechnology Journal Author(s): Chang Xu Detection of somatic mutations holds great potential in cancer treatment and has been a very active research field in the past few years, especially since the breakthrough of the next-generation sequencing technology. A collection of variant calling pipelines have been developed with different underlying models, filters, input data requirements, and targeted applications. This review aims to enumerate these unique features of the state-of-the-art variant callers, in the hope to provide a practical guide for selecting the appropriate pipeline for specific applications. We will focus on the detection of somatic single nucleotide variants, ranging from traditional variant callers based on whole genome or exome sequencing of paired tumor-normal samples to recent low-frequency variant callers designed for targeted sequencing protocols with unique molecular identifiers. The variant callers have been extensively benchmarked with inconsistent performances across these studies. We will review the reference materials, datasets, and performance metrics that have been used in the benchmarking studies. In the end, we will discuss emerging trends and future directions of the variant calling algorithms.

Authors:Oliver Buß; Jens Rudat; Katrin OchsenreitherAbstract: Publication date: Available online 3 February 2018 Source:Computational and Structural Biotechnology Journal Author(s): Oliver Buß, Jens Rudat, Katrin Ochsenreither Improving protein stability is an important goal for basic research as well as for clinical and industrial applications but no commonly accepted and widely used strategy for efficient engineering is known. Beside random approaches like error prone PCR or physical techniques to stabilize proteins, e.g. by immobilization, in silico approaches are gaining more attention to apply target-oriented mutagenesis. In this review different algorithms for the prediction of beneficial mutation sites to enhance protein stability are summarized and the advantages and disadvantages of FoldX are highlighted. The question whether the prediction of mutation sites by the algorithm FoldX is more accurate than random based approaches is addressed. Graphical abstract

Authors:Sailen BarikAbstract: Publication date: Available online 30 December 2017 Source:Computational and Structural Biotechnology Journal Author(s): Sailen Barik The two classical immunophilin families, found essentially in all living cells, are: cyclophilin (CYN) and FK506-binding protein (FKBP). We previously reported a novel class of immunophilins that are natural chimera of these two, which we named dual-family immunophilin (DFI). The DFIs were found in either of two conformations: CYN-linker-FKBP (CFBP) or FKBP-3TPR-CYN (FCBP). While the 3TPR domain can serve as a flexible linker between the FKBP and CYN modules in the FCBP-type DFI, the linker sequences in the CFBP-type DFIs are relatively short, diverse in sequence, and contain no discernible motif or signature. Here, I present several lines of computational evidence that, regardless of their primary structure, these CFBP linkers are intrinsically disordered. This report provides the first molecular foundation for the model that the CFBP linker acts as an unstructured, flexible loop, allowing the two flanking chaperone modules function independently while linked in cis, likely to assist in the folding of multisubunit client complexes. Graphical abstract

Authors:Jerzy Krupinski; Caty Carrera; Elena Muiño; Nuria Torres; Raid Al-Baradie; Natalia Cullell; Israel Fernandez-CadenasAbstract: Publication date: Available online 9 December 2017 Source:Computational and Structural Biotechnology Journal Author(s): Jerzy Krupinski, Caty Carrera, Elena Muiño, Nuria Torres, Raid Al-Baradie, Natalia Cullell, Israel Fernandez-Cadenas Epigenetic modifications are hereditable and modifiable factors that do not alter the DNA sequence. These epigenetic factors include DNA methylation, acetylation of histones and non-coding RNAs. Epigenetic factors have mainly been associated with cancer but also with other diseases and conditions such as diabetes or obesity. In addition, epigenetic modifications could play an important role in cardiovascular diseases, including stroke. We review the latest advances in stroke epigenetics, focusing on DNA methylation studies and the future perspectives in this field.

Authors:Mary Q. Yang; Dan Li; William Yang; Yifan Zhang; Jun Liu; Weida TongAbstract: Publication date: Available online 10 October 2017 Source:Computational and Structural Biotechnology Journal Author(s): Mary Q. Yang, Dan Li, William Yang, Yifan Zhang, Jun Liu, Weida Tong Clear cell renal cell carcinoma (ccRCC) is the most common and most aggressive form of renal cell cancer (RCC). The incidence of RCC has increased steadily in recent years. The pathogenesis of renal cell cancer remains poorly understood. Many of the tumor suppressor genes, oncogenes, and dysregulated pathways in ccRCC need to be revealed for improvement of the overall clinical outlook of the disease. Here, we developed a systems biology approach to prioritize the somatic mutated genes that lead to dysregulation of pathways in ccRCC. The method integrated multi-layer information to infer causative mutations and disease genes. First, we identified differential gene modules in ccRCC by coupling transcriptome and protein-protein interactions. Each of these modules consisted of interacting genes that were involved in similar biological processes and their combined expression alterations were significantly associated with disease type. Then, subsequent gene module-based eQTL analysis revealed somatic mutated genes that had driven the expression alterations of differential gene modules. Our study yielded a list of candidate disease genes, including several known ccRCC causative genes such as BAP1 and PBRM1, as well as novel genes such as NOD2, RRM1, CSRNP1, SLC4A2, TTLL1 and CNTN1. The differential gene modules and their driver genes revealed by our study provided a new perspective for understanding the molecular mechanisms underlying the disease. Moreover, we validated the results in an independent ccRCC patient dataset. Our study provided a new method for prioritizing disease genes and pathways.

Authors:Francesca Loren; Reyes Tom Michoel Anagha Joshi Guillaume DevaillyAbstract: Publication date: Available online 26 August 2017 Source:Computational and Structural Biotechnology Journal Author(s): Pía Francesca Loren Reyes, Tom Michoel, Anagha Joshi, Guillaume Devailly Functional annotation transfer across multi-gene family orthologs can lead to functional misannotations. We hypothesised that co-expression network will help predict functional orthologs amongst complex homologous gene families. To explore the use of transcriptomic data available in public domain to identify functionally equivalent ones from all predicted orthologs, we collected genome wide expression data in mouse and rat liver from over 1500 experiments with varied treatments. We used a hyper-graph clustering method to identify clusters of orthologous genes co-expressed in both mouse and rat. We validated these clusters by analysing expression profiles in each species separately, and demonstrating a high overlap. We then focused on genes in 18 homology groups with one-to-many or many-to-many relationships between two species, to discriminate between functionally equivalent and non-equivalent orthologs. Finally, we further applied our method by collecting heart transcriptomic data (over 1400 experiments) in rat and mouse to validate the method in an independent tissue.

Authors:Andrian Yang; Michael Troup; Joshua W.K. HoAbstract: Publication date: Available online 20 July 2017 Source:Computational and Structural Biotechnology Journal Author(s): Andrian Yang, Michael Troup, Joshua W.K. Ho This review examines two important aspects that are central to modern big data bioinformatics analysis – software scalability and validity. We argue that not only are the issues of scalability and validation common to all big data bioinformatics analyses, they can be tackled by conceptually related methodological approaches, namely divide-and-conquer (scalability) and multiple executions (validation). Scalability is defined as the ability for a program to scale based on workload. It has always been an important consideration when developing bioinformatics algorithms and programs. Nonetheless the surge of volume and variety of biological and biomedical data has posed new challenges. We discuss how modern cloud computing and big data programming frameworks such as MapReduce and Spark are being used to effectively implement divide-and-conquer in a distributed computing environment. Validation of software is another important issue in big data bioinformatics that is often ignored. Software validation is the process of determining whether the program under test fulfils the task for which it was designed. Determining the correctness of the computational output of big data bioinformatics software is especially difficult due to the large input space and complex algorithms involved. We discuss how state-of-the-art software testing techniques that are based on the idea of multiple executions, such as metamorphic testing, can be used to implement an effective bioinformatics quality assurance strategy. We hope this review will raise awareness of these critical issues in bioinformatics.

Authors:Raunaq Malhotra; Manjari Jha; Mary Poss; Raj AcharyaAbstract: Publication date: Available online 19 July 2017 Source:Computational and Structural Biotechnology Journal Author(s): Raunaq Malhotra, Manjari Jha, Mary Poss, Raj Acharya We propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of k-mers from the reads of a viral population to train a Random forest classifier, called MultiRes, that classifies k-mers as erroneous or rare variants. Our algorithm is rooted in concepts from signal processing and uses a frame-based representation of k-mers. Frames are sets of non-orthogonal basis functions that were traditionally used in signal processing for noise removal. We define discrete spatial signals for genomes and sequenced reads, and show that k-mers of a given size constitute a frame. We evaluate MultiRes on simulated and real viral population datasets, which consist of many low frequency variants, and compare it to the error detection methods used in correction tools known in the literature. MultiRes has 4 to 500 times less false positives k-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their de-novo assembly. It has high recall of the true k-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs) and fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is available freely from the GitHub link https://github.com/raunaq-m/MultiRes.

Authors:John R. Stevens; Todd R. Jones; Michael Lefevre; Balasubramanian Ganesan; Bart C. WeimerAbstract: Publication date: Available online 6 July 2017 Source:Computational and Structural Biotechnology Journal Author(s): John R. Stevens, Todd R. Jones, Michael Lefevre, Balasubramanian Ganesan, Bart C. Weimer Microbial community analysis experiments to assess the effect of a treatment intervention (or environmental change) on the relative abundance levels of multiple related microbial species (or operational taxonomic units) simultaneously using high throughput genomics are becoming increasingly common. Within the framework of the evolutionary phylogeny of all species considered in the experiment, this translates to a statistical need to identify the phylogenetic branches that exhibit a significant consensus response (in terms of operational taxonomic unit abundance) to the intervention. We present the R software package SigTree, a collection of flexible tools that make use of meta-analysis methods and regular expressions to identify and visualize significantly responsive branches in a phylogenetic tree, while appropriately adjusting for multiple comparisons.

Authors:Anne GroveAbstract: Publication date: Available online 16 June 2017 Source:Computational and Structural Biotechnology Journal Author(s): Anne Grove Bacteria have evolved sophisticated mechanisms for regulation of metabolic pathways. Such regulatory circuits ensure that anabolic pathways remain repressed unless final products are in short supply and that catabolic enzymes are not produced in absence of their substrates. The precisely tuned gene activity underlying such circuits is in the purview of transcription factors that may bind pathway intermediates, which in turn modulate transcription factor function and therefore gene expression. This review focuses on the role of ligand-responsive MarR family transcription factors in controlling expression of genes encoding metabolic enzymes and the mechanisms by which such control is exerted. Prospects for exploiting these transcription factors for optimization of gene expression for metabolic engineering and for the development of biosensors are considered.

Authors:Jan Zaucha; Jonathan HeddleAbstract: Publication date: Available online 30 May 2017 Source:Computational and Structural Biotechnology Journal Author(s): Jan Zaucha, Jonathan Heddle Biological molecules, like organisms themselves, are subject to genetic drift and may even become “extinct”. Molecules that are no longer extant in living systems are of high interest for several reasons including insight into how existing life forms evolved and the possibility that they may have new and useful properties no longer available in currently functioning molecules. Predicting the sequence/structure of such molecules and synthesizing them so that their properties can be tested is the basis of “molecular resurrection” and may lead not only to a deeper understanding of evolution, but to production of artificial proteins with novel properties and even to insight into how life itself began.

Authors:Amornpan Klanchui; Supapon Cheevadhanarak; Peerada Prommeenate; Asawin MeechaiAbstract: Publication date: Available online 25 May 2017 Source:Computational and Structural Biotechnology Journal Author(s): Amornpan Klanchui, Supapon Cheevadhanarak, Peerada Prommeenate, Asawin Meechai In cyanobacteria, the CO2-concentrating mechanism (CCM) is a vital biological process that provides effective photosynthetic CO2 fixation by elevating the CO2 level near the active site of Rubisco. This process enables the adaptation of cyanobacteria to various habitats, particularly in CO2-limited environments. Although CCM of freshwater and marine cyanobacteria are well studied, there is limited information on the CCM of cyanobacteria living under alkaline environments. Here, we aimed to explore the molecular components of CCM in 12 alkaliphilic cyanobacteria through genome-based analysis. These cyanobacteria included 6 moderate alkaliphiles; Pleurocapsa sp. PCC 7327, Synechococcus spp., Cyanobacterium spp., Spirulina subsalsa PCC 9445, and 6 strong alkaliphiles (i.e. Arthrospira spp.). The results showed that both groups belong to β-cyanobacteria based on β-carboxysome shell proteins with form 1B of Rubisco. They also contained standard genes, ccmKLMNO cluster, which is essential for β-carboxysome formation. Most strains did not have the high-affinity Na+/HCO3 − symporter SbtA and the medium-affinity ATP-dependent HCO3 − transporter BCT1. Specifically, all strong alkaliphiles appeared to lack BCT1. Beside the transport systems, carboxysomal β-CA, CcaA, was absent in all alkaliphiles, except for three moderate alkaliphiles: Pleurocapsa sp. PCC 7327, Cyanobacterium stranieri PCC 7202, and Spirulina subsalsa PCC 9445. Furthermore, comparative analysis of the CCM components among freshwater, marine, and alkaliphilic β-cyanobacteria revealed that the basic molecular components of the CCM in the alkaliphilic cyanobacteria seemed to share more degrees of similarity with freshwater than marine cyanobacteria. These findings provide a relationship between the CCM components of cyanobacteria and their habitats.

Authors:Toshihiko Sugiki; Naohiro Kobayashi; Toshimichi FujiwaraAbstract: Publication date: Available online 13 April 2017 Source:Computational and Structural Biotechnology Journal Author(s): Toshihiko Sugiki, Naohiro Kobayashi, Toshimichi Fujiwara Nuclear magnetic resonance (NMR) spectroscopy is a powerful technique for structural studies of chemical compounds and biomolecules such as DNA and proteins. Since the NMR signal sensitively reflects the chemical environment and the dynamics of a nuclear spin, NMR experiments provide a wealth of structural and dynamic information about the molecule of interest at atomic resolution. In general, structural biology studies using NMR spectroscopy still requires a reasonable understanding of the theory behind the technique and experience on how to recorded NMR data. Owing to the remarkable progress in the past decade, we can easily access suitable and popular analytical resources for NMR structure determination of proteins with high accuracy. Here, we describe the practical aspects, workflow and key points of modern NMR techniques used for solution structure determination of proteins. This review should aid NMR specialists aiming to develop new methods that accelerate the structure determination process, and open avenues for non-specialist and life scientists interested in using NMR spectroscopy to solve protein structures.

Authors:Francesco CardarelliAbstract: Publication date: Available online 4 April 2017 Source:Computational and Structural Biotechnology Journal Author(s): Francesco Cardarelli Molecules are continuously shuttling across the nuclear envelope barrier that separates the nucleus from the cytoplasm. Instead of being just a barrier to diffusion, the nuclear envelope is rather a complex filter that provides eukaryotes with an elaborate spatiotemporal regulation of fundamental molecular processes, such as gene expression and protein translation. Given the highly dynamic nature of nucleocytoplasmic transport, during the past few decades large efforts were devoted to the development and application of time resolved, fluorescence-based, biophysical methods to capture the details of molecular motion across the nuclear envelope. These methods are here divided into three major classes, according to the differences in the way they report on the molecular process of nucleocytoplasmic transport. In detail, the first class encompasses those methods based on the perturbation of the fluorescence signal, also known as ensemble-averaging methods, which average the behavior of many molecules (across many pores). The second class comprises those methods based on the localization of single fluorescently-labelled molecules and tracking of their position in space and time, potentially across single pores. Finally, the third class encompasses methods based on the statistical analysis of spontaneous fluorescence fluctuations out of the equilibrium or stationary state of the system. In this case, the behavior of single molecules is probed in presence of many similarly-labelled molecules, without dwelling on any of them. Here these three classes, with their respective pros and cons as well as their main applications to nucleocytoplasmic shuttling will be briefly reviewed and discussed.

Authors:Martina Audagnotto; Matteo Dal PeraroAbstract: Publication date: Available online 31 March 2017 Source:Computational and Structural Biotechnology Journal Author(s): Martina Audagnotto, Matteo Dal Peraro Post-translational modifications (PTMs) occur in almost all proteins and play an important role in numerous biological processes by significantly affecting proteins structure and dynamics. Several computational approaches have been developed to study PTMs (e.g., phosphorylation, sumoylation or palmitoylation) showing the importance of these techniques in predicting modified sites that can be further investigated with experimental approaches. In this review, we summarize some of the available online platforms and their contribution in the study of PTMs. Moreover, we discuss the emerging capabilities of molecular modeling and simulation that are able to complement these bioinformatics methods, providing deeper molecular insights into the biological function of post-translational modified proteins.

Authors:Stefano Rensi; Russ B. AltmanAbstract: Publication date: Available online 24 March 2017 Source:Computational and Structural Biotechnology Journal Author(s): Stefano Rensi, Russ B. Altman Studying analog series to find structural transformations that enhance the activity and ADME properties of lead compounds is an important part of drug development. Matched molecular pair (MMP) search is a powerful tool for analog analysis that imitates researchers' ability to select pairs of compounds that differ only by small well-defined transformations. ion is a challenge for existing MMP search algorithms, which can result in the omission of relevant, inexact MMPs, and inclusion of irrelevant, contextually dissimilar MMPs. In this work, we present a new method for MMP search that returns approximate results and enables flexible control over abstraction of contextual information. We illustrate the concepts and mechanics of our method with a series of exemplar MMP queries, and then benchmark search accuracy using MMPs found by fragment indexing. We show that we can search for MMPs in a context dependent manner, and accurately approximate context independent fragment index based MMP search over a range of fingerprint and dataset conditions. Our method can be used to search for pairwise correspondences among analog sets and bolster MMP datasets where data is missing or incomplete.

Authors:Seanna Hewitt; Benjamin Kilian; Ramyya Hari; Tyson Koepke; Richard Sharpe; Amit DhingraAbstract: Publication date: Available online 18 March 2017 Source:Computational and Structural Biotechnology Journal Author(s): Seanna Hewitt, Benjamin Kilian, Ramyya Hari, Tyson Koepke, Richard Sharpe, Amit Dhingra Identification of genetic polymorphisms and subsequent development of molecular markers is important for marker assisted breeding of superior cultivars of economically important species. Sweet cherry (Prunus avium L.) is an economically important non-climacteric tree fruit crop in the Rosaceae family and has undergone a genetic bottleneck due to breeding, resulting in limited genetic diversity in the germplasm that is utilized for breeding new cultivars. Therefore, it is critical to recognize the best platforms for identifying genome-wide polymorphisms that can help identify, and consequently preserve, the diversity in a genetically constrained species. For the identification of polymorphisms in five closely related genotypes of sweet cherry, a gel-based approach (TRAP), reduced representation sequencing (TRAPseq), a 6k cherry SNParray, and whole genome sequencing (WGS) approaches were evaluated in the identification of genome-wide polymorphisms in sweet cherry cultivars. All platforms facilitated detection of polymorphisms among the genotypes with variable efficiency. In assessing multiple SNP detection platforms, this study has demonstrated that a combination of appropriate approaches is necessary for efficient polymorphism identification, especially between closely related cultivars of a species. The information generated in this study provides a valuable resource for future genetic and genomic studies in sweet cherry, and the insights gained from the evaluation of multiple approaches can be utilized for other closely related species with limited genetic diversity in the breeding germplasm.

Authors:Erling Mellerup; Gert Lykke MøllerAbstract: Publication date: Available online 10 March 2017 Source:Computational and Structural Biotechnology Journal Author(s): Erling Mellerup, Gert Lykke Møller In studies of polygenic disorders, scanning the genetic variants can be used to identify variant combinations. Combinations that are exclusively found in patients can be separated from those combinations occurring in control persons. Statistical analyses can be performed to determine whether the combinations that occur exclusively among patients are significantly associated with the investigated disorder. This research strategy has been applied in materials from various polygenic disorders, identifying clusters of patient-specific genetic variant combinations that are significant associated with the investigated disorders. Combinations from these clusters are found in the genomes of up to 55% of investigated patients, and are not present in the genomes of any control persons.

Authors:Alexey V. Sulimov; Dmitry A. Zheltkov; Igor V. Oferkin; Danil C. Kutov; Ekaterina V. Katkova; Eugene E. Tyrtyshnikov; Vladimir B. SulimovAbstract: Publication date: Available online 3 March 2017 Source:Computational and Structural Biotechnology Journal Author(s): Alexey V. Sulimov, Dmitry A. Zheltkov, Igor V. Oferkin, Danil C. Kutov, Ekaterina V. Katkova, Eugene E. Tyrtyshnikov, Vladimir B. Sulimov We present the novel docking algorithm based on the Tensor Train decomposition and the TT-Cross global optimization applied to the docking problem with flexible ligand and moveable protein atoms. The energy of the protein-ligand complex is calculated in the frame of MMFF94 force field in vacuum. The conformation space of the system coordinates is formed by translations and rotations of the ligand as a whole, by the ligand torsions and also by Cartesian coordinates of selected target-protein atoms. Mobility of protein and ligand atoms is taken into account in the docking process simultaneously and equally. The algorithm is realized in the novel parallel docking SOL-P program and results of its performance for a set of 30 protein-ligand complexes are presented. Dependence of docking positioning quality is investigated as a function of the docking algorithm parameters as well as the number of moveable protein atoms, and it is shown that mobility of protein atoms improves docking positioning quality. The program is able to perform docking of a flexible ligand into the active site of the target protein with several dozen of protein moveable atoms: up to 160 degrees of freedom. For example, the docking time of the native ligand (7 torsions) into the target protein (PDB ID 3CEN) with 26 moveable protein atoms is 1212 CPU*hour at the Lomonosov supercomputer.

Authors:Ashok Palaniappan; Eric JakobssonAbstract: Publication date: Available online 22 February 2017 Source:Computational and Structural Biotechnology Journal Author(s): Ashok Palaniappan, Eric Jakobsson Residue conservation is a common observation in alignments of protein families, underscoring positions important in protein structure and function. Though many methods measure the level of conservation of particular residue positions, currently we do not have a way to study spatial oscillations occuring in protein conservation patterns. It is known that hydrophobicity shows spatial oscillations in proteins, which is characterized by computing the hydrophobic moment of the protein domains. Here, we advance the study of moments of conservation of protein families to know whether there might exist spatial asymmetry in the conservation patterns of regular secondary structures. Analogous to the hydrophobic moment, the conservation moment is defined as the modulus of the Fourier transform of the conservation function of an alignment of related protein, where the conservation function is the vector of conservation values at each column of the alignment. The profile of the conservation moment is useful in ascertaining any periodicity of conservation, which might correlate with the period of the secondary structure. To demonstrate the concept, conservation in the family of potassium ion channel proteins was analyzed using moments. It was shown that the pore helix of the potassium channel showed oscillations in the moment of conservation matching the period of the α-helix. This implied that one side of the pore helix was evolutionarily conserved in contrast to its opposite side. In addition, the method of conservation moments correctly identified the disposition of the voltage sensor of voltage-gated potassium channels to form a 310 helix in the membrane.

Authors:Rosanna W. Peeling; Debrah I. Boeras; John NkengasongAbstract: Publication date: Available online 21 February 2017 Source:Computational and Structural Biotechnology Journal Author(s): Rosanna W. Peeling, Debrah I. Boeras, John Nkengasong Neglected Tropical Diseases (NTDs) affect an estimated 1 billion people in 149 countries. The World Health Organization (WHO) prioritised 17 NTDs for control and elimination by 2020 and defined a Road Map to help countries reach these goals. Improved diagnostics for NTDs are essential for guiding treatment strategies at different thresholds of control, interruption of transmission, elimination and post-elimination surveillance. While substantial progress has been made in the last decade with chemotherapy, the same cannot be said of diagnostics, largely due to the perceived lack of a commercially viable market for NTD diagnostics. New sample in-answer out nucleic acid amplification technologies that can be performed at the point-of-care offer improved performance over current technologies and the potential to test for multiple pathogens using a single specimen. Finding commonalities for different NTDs in terms of geographic overlap, sentinel populations and treatment strategy will allow NTD programs to leverage these innovations to build cost-effective multiplex surveillance platforms. Connectivity solutions linking data from diagnostic laboratories and POC test readers/devices provide opportunities for automated surveillance systems to make health systems more efficient, improving patient outcomes and assessing impact of interventions in real time. New models of public-private product development partnerships are critical in leveraging diagnostic innovation in other priority area for better diagnosis, control and elimination of NTDs.

Authors:Kathrin Ungru; Xiaoyi JiangAbstract: Publication date: Available online 16 February 2017 Source:Computational and Structural Biotechnology Journal Author(s): Kathrin Ungru, Xiaoyi Jiang Many applications in biomedical imaging have a demand on automatic detection of lines, contours, or boundaries of bones, organs, vessels, and cells. Aim is to support expert decisions in interactive applications or to include it as part of a processing pipeline for automatic image analysis. Biomedical images often suffer from noisy data and fuzzy edges. Therefore, there is a need for robust methods for contour and line detection. Dynamic programming is a popular technique that satisfies these requirements in many ways. This work gives a brief overview over approaches and applications that utilize dynamic programming to solve problems in the challenging field of biomedical imaging.

Authors:Seyed Morteza Najibi; Mehdi Maadooliat; Lan Zhou; Jianhua Z. Huang; Xin GaoAbstract: Publication date: Available online 8 February 2017 Source:Computational and Structural Biotechnology Journal Author(s): Seyed Morteza Najibi, Mehdi Maadooliat, Lan Zhou, Jianhua Z. Huang, Xin Gao Recently, the study of protein structures using angular representations has attracted much attention among structural biologists. The main challenge is how to efficiently model the continuous conformational space of the protein structures based on the differences and similarities between different Ramachandran plots. Despite the presence of statistical methods for modeling angular data of proteins, there is still a substantial need for more sophisticated and faster statistical tools to model the large-scale circular datasets. To address this need, we have developed a nonparametric method for collective estimation of multiple bivariate density functions for a collection of populations of protein backbone angles. The proposed method takes into account the circular nature of the angular data using trigonometric spline which is more efficient compared to existing methods. This collective density estimation approach is widely applicable when there is a need to estimate multiple density functions from different populations with common features. Moreover, the coefficients of adaptive basis expansion for the fitted densities provide a low-dimensional representation that is useful for visualization, clustering, and classification of the densities. The proposed method provides a novel and unique perspective to two important and challenging problems in protein structure research: structure-based protein classification and angular-sampling-based protein loop structure prediction.

Authors:C. Marks; C.M. DeaneAbstract: Publication date: Available online 1 February 2017 Source:Computational and Structural Biotechnology Journal Author(s): C. Marks, C.M. Deane Antibodies are proteins of the immune system that are able to bind to a huge variety of different substances, making them attractive candidates for therapeutic applications. Antibody structures have the potential to be useful during drug development, allowing the implementation of rational design procedures. The most challenging part of the antibody structure to experimentally determine or model is the H3 loop, which in addition is often the most important region in an antibody’s binding site. This review summarises the approaches used so far in the pursuit of accurate computational H3 structure prediction.

Authors:Katja Venko; Amrita Roy Choudhury; Marjana NovičAbstract: Publication date: Available online 31 January 2017 Source:Computational and Structural Biotechnology Journal Author(s): Katja Venko, Amrita Roy Choudhury, Marjana Novič The structural and functional details of transmembrane proteins are vastly underexplored, mostly due to experimental difficulties regarding their solubility and stability. Currently, the majority of transmembrane protein structures are still unknown and this present a huge experimental and computational challenge. Nowadays, thanks to X-ray crystallography or NMR spectroscopy over 3000 structures of membrane proteins have been solved, among them only a few hundred unique ones. Due to the vast biological and pharmaceutical interest in the elucidation of the structure and the functional mechanisms of transmembrane proteins, several computational methods have been developed to overcome the experimental gap. If combined with experimental data the computational information enables rapid, low cost and successful predictions of the molecular structure of unsolved proteins. The reliability of the predictions depends on the availability and accuracy of experimental data associated with structural information. In this review, the following methods are proposed for in silico structure elucidation: sequence-dependent predictions of transmembrane regions, predictions of transmembrane helix-helix interactions, helix arrangements in membrane models, and testing their stability with molecular dynamics simulations. We also demonstrate the usage of the computational methods listed above by proposing a model for the molecular structure of the transmembrane protein bilitranslocase. Bilitranslocase is bilirubin membrane transporter, which shares similar tissue distribution and functional properties with some of the members of the Organic Anion Transporter family and is the only member classified in the Bilirubin Transporter Family. Regarding its unique properties, bilitranslocase is a potentially interesting drug target.

Authors:Paul W. Bible; Hong-Wei Sun; Maria I. Morasso; Rasiah Loganantharaj; Lai WeiAbstract: Publication date: Available online 30 January 2017 Source:Computational and Structural Biotechnology Journal Author(s): Paul W. Bible, Hong-Wei Sun, Maria I. Morasso, Rasiah Loganantharaj, Lai Wei The structured vocabulary that describes gene function, the gene ontology (GO), serves as a powerful tool in biological research. One application of GO in computational biology calculates semantic similarity between two concepts to make inferences about the functional similarity of genes. A class of term similarity algorithms explicitly calculates the shared information (SI) between concepts then substitutes this calculation into traditional term similarity measures such as Resnik, Lin, and Jiang-Conrath. Alternative SI approaches, when combined with ontology choice and term similarity type, lead to many gene-to-gene similarity measures. No thorough investigation has been made into the behavior, complexity, and performance of semantic methods derived from distinct SI approaches. We apply bootstrapping to compare the generalized performance of 57 gene-to-gene semantic measures across six benchmarks. Considering the number of measures, we additionally consider whether these methods can be leveraged through ensemble machine learning to improve prediction performance. Results showed that the choice of ontology type most strongly influenced performance across all evaluations. Combining measures into an ensemble classifier reduces cross-validation error beyond any individual measure for protein interaction prediction. This improvement resulted from information gained through the combination of ontology types as ensemble methods within each GO type offered no improvement. These results demonstrate that multiple SI measures can be leveraged for machine learning tasks such as automated gene function prediction by incorporating methods from across the ontologies. To facilitate future research in this area, we developed the GO Graph Tool Kit (GGTK), an open source C++ library with Python (github.com/paulbible/ggtk).

Authors:Sarah Galleguillos; David Ruckerbauer; Matthias P. Gerstl; Nicole Borth; Michael Hanscho; Jürgen ZanghelliniAbstract: Publication date: Available online 28 January 2017 Source:Computational and Structural Biotechnology Journal Author(s): Sarah Galleguillos, David Ruckerbauer, Matthias P. Gerstl, Nicole Borth, Michael Hanscho, Jürgen Zanghellini Chinese hamster ovary cells have been in the spotlight for process optimization in recent years, due to being the major, long established cell factory for the production of recombinant proteins. A deep, quantitative understanding of CHO metabolism and mechanisms involved in protein glycosylation has proven to be attainable through the development of high throughput technologies. Here we review the most notable accomplishments in the field of modelling CHO metabolism and protein glycosylation.

Authors:Zied Gaieb; Dimitrios MorikisAbstract: Publication date: Available online 14 January 2017 Source:Computational and Structural Biotechnology Journal Author(s): Zied Gaieb, Dimitrios Morikis Structure and dynamics are essential elements of protein function. Protein structure is constantly fluctuating and undergoing conformational changes, which are captured by molecular dynamics (MD) simulations. We introduce a computational framework that provides a compact representation of the dynamic conformational space of biomolecular simulations. This method presents a systematic approach designed to reduce the large MD simulation spatiotemporal datasets into a manageable set in order to guide our understanding of how protein mechanics emerge from side chain organization and dynamic reorganization. We focus on the detection of side chain interactions that undergo rearrangements mediating global domain motions and vice versa. Side chain rearrangements are extracted from side chain interactions that undergo well-defined abrupt and persistent changes in distance time series using Gaussian mixture models, whereas global domain motions are detected using dynamic cross-correlation. Both side chain rearrangements and global domain motions represent the dynamic components of the protein MD simulation, and are both mapped into a network where they are connected based on their degree of coupling. This method allows for the study of allosteric communication in proteins by mapping out the protein dynamics into an intramolecular network to reduce the large simulation data into a manageable set of communities composed of coupled side chain rearrangements and global domain motions. This computational framework is suitable for the study of tightly packed proteins, such as G protein-coupled receptors, and we present an application on a seven microseconds MD trajectory of CC chemokine receptor 7 (CCR7) bound to its ligand CCL21.

Authors:Jorge Duitama; Lina Kafuri; Daniel Tello; Ana María Leiva; Bernhard Hofinger; Sneha Datta; Zaida Lentini; Ericson Aranzales; Bradley Till; Hernán CeballosAbstract: Publication date: Available online 14 January 2017 Source:Computational and Structural Biotechnology Journal Author(s): Jorge Duitama, Lina Kafuri, Daniel Tello, Ana María Leiva, Bernhard Hofinger, Sneha Datta, Zaida Lentini, Ericson Aranzales, Bradley Till, Hernán Ceballos Cassava is one of the most important food security crops in tropical countries, and a competitive resource for the starch, food, feed and ethanol industries. However, genomics research in this crop is much less developed compared to other economically important crops such as rice or maize. The International Center for Tropical Agriculture (CIAT) maintains the largest cassava germplasm collection in the world. Unfortunately, the genetic potential of this diversity for breeding programs remains underexploited due to the difficulties in phenotypic screening and lack of deep genomic information about the different accessions. A chromosome-level assembly of the cassava reference genome was released this year and only a handful of studies have been made, mainly to find quantitative trait loci (QTL) on breeding populations with limited variability. This work presents the results of pooled targeted resequencing of more than 1500 cassava accessions from the CIAT germplasm collection to obtain a dataset of more than 2000 variants within genes related to starch functional properties and herbicide tolerance. Results of twelve bioinformatic pipelines for variant detection in pooled samples were compared to ensure the quality of the variant calling process. Predictions of functional impact were performed using two separate methods to prioritize interesting variation for genotyping and cultivar selection. Targeted resequencing, either by pooled samples or by similar approaches such as Ecotilling or capture, emerges as a cost effective alternative to whole genome sequencing to identify interesting alleles of genes related to relevant traits within large germplasm collections.