Navigating the dynamic landscape of long noncoding RNA and protein-coding gene annotations in GENCODE

Our understanding of the transcriptional potential of the genome and its functional consequences has undergone a significant change in the last decade. This has been largely contributed by the improvements in technology which could annotate and in many cases functionally characterize a number of novel gene loci in the human genome. Keeping pace with advancements in this dynamic environment and being able to systematically annotate a compendium of genes and transcripts is indeed a formidable task. Of the many databases which attempted to systematically annotate the genome, GENCODE has emerged as one of the largest and popular compendium for human genome annotations.

An analysis of various versions of GENCODE by researchers at the CSIR-IGIB revealed that there was a constant upgradation of transcripts for both protein-coding and long noncoding RNA (lncRNAs) leading to conflicting annotations. The GENCODE version 24 accounts for 4.18 % of the human genome to be transcribed which is an increase of 1.58 % from its first version. Out of 2,51,614 transcripts annotated across GENCODE versions, only 21.7 % had consistency. They also examined GENCODE consortia categorized transcripts into 70 biotypes out of which only 17 remained stable throughout.

In this report, the researchers try to review the impact on the dynamicity with respect to gene annotations, specifically (lncRNA) annotations in GENCODE over the years. Their analysis suggests a significant dynamism in gene annotations, reflective of the evolution and consensus in nomenclature of genes. While a progressive change in annotations and timely release of the updates make the resource reliable in the community, the dynamicity with each release poses unique challenges to its users. Taking cues from other experiments with bio-curation, they propose potential avenues and methods to mend the gap.

A Sankey diagram depicting the dynamicity of GENCODE biotypes across all versions (V1 to V24)

The vertical lines represent the different versions as labeled on the top. The horizontal lines represent individual transcripts having any of the 72 biotypes. The biotype has been labeled as numbers as detailed in Table 4. The NA class of transcripts defined here represents the number of transcripts which were deleted in each of the versions or the number of transcripts which do not exist in each version and have represented with zero (0). The thickness of horizontal lines represents the number of transcripts having that particular biotype in individual versions