Pseudogenes are functionless relatives of genes that have lost their gene expression in the cell or their ability to code protein.[1] Pseudogenes often result from the accumulation of multiple mutations within a gene whose product is not required for the survival of the organism. Although not protein-coding, the DNA of pseudogenes may be functional,[2] similar to other kinds of noncoding DNA which can have a regulatory role.

Although some pseudogenes do not have introns or a promoter (these pseudogenes are copied from messenger RNA and incorporated into the chromosome and are called processed pseudogenes),[3] most have some gene-like features such as promoters, CpG islands, and splice sites. They are different from normal genes due to a lack of protein-coding ability resulting from a variety of disabling mutations (e.g. premature stop codons or frameshifts), a lack of transcription, or their inability to encode RNA (such as with ribosomal RNA pseudogenes). The term was coined in 1977 by Jacq et al.[4]

Because pseudogenes are generally thought of as the last stop for genomic material that is to be removed from the genome,[5] they are often labeled as junk DNA. A pseudogene can be operationally defined as a fragment of nucleotide sequence that resembles a known protein's domains but with stop codons or frameshifts mid-domain. Nonetheless, pseudogenes contain biological and evolutionary histories within their sequences. This is due to a pseudogene's shared ancestry with a functional gene: in the same way that Darwin thought of two species as possibly having a shared common ancestry followed by millions of years of evolutionary divergence, a pseudogene and its associated functional gene also share a common ancestor and have diverged as separate genetic entities over millions of years.

Pseudogenes are characterized by a combination of homology to a known gene and nonfunctionality. That is, although every pseudogene has a DNA sequence that is similar to some functional gene, they are nonetheless unable to produce functional final protein products.[6] Pseudogenes are sometimes difficult to identify and characterize in genomes, because the two requirements of homology and nonfunctionality are usually implied through sequence alignments rather than biologically proven.

Homology is implied by sequence identity between the DNA sequences of the pseudogene and parent gene. After aligning the two sequences, the percentage of identical base pairs is computed. A high sequence identity (usually between 40% and 100%) means that it is highly likely that these two sequences diverged from a common ancestral sequence (are homologous), and highly unlikely that these two sequences have evolved independently (see Convergent evolution).

Nonfunctionality can manifest itself in many ways. Normally, a gene must go through several steps to a fully functional protein: Transcription, pre-mRNA processing, translation, and protein folding are all required parts of this process. If any of these steps fails, then the sequence may be considered nonfunctional. In high-throughput pseudogene identification, the most commonly identified disablements are premature stop codons and frameshifts, which almost universally prevent the translation of a functional protein product.

Pseudogenes for RNA genes are usually more difficult to discover as they do not need to be translated and thus do not have "reading frames".

Processed (or retrotransposed) pseudogenes. In higher eukaryotes, particularly mammals, retrotransposition is a fairly common event that has had a huge impact on the composition of the genome. For example, somewhere between 30–44% of the human genome consists of repetitive elements such as SINEs and LINEs (see retrotransposons).[7][8] In the process of retrotransposition, a portion of the mRNA transcript of a gene is spontaneously reverse transcribed back into DNA and inserted into chromosomal DNA. Although retrotransposons usually create copies of themselves, it has been shown in an in vitro system that they can create retrotransposed copies of random genes, too.[9] Once these pseudogenes are inserted back into the genome, they usually contain a poly-A tail, and usually have had their introns spliced out; these are both hallmark features of cDNAs. However, because they are derived from a mature mRNA product, processed pseudogenes also lack the upstream promoters of normal genes; thus, they are considered "dead on arrival", becoming non-functional pseudogenes immediately upon the retrotransposition event.[10] However, these insertions occasionally contribute exons to existing genes, usually via alternatively spliced transcripts.[11] A further characteristic of processed pseudogenes is common truncation of the 5' end relative to the parent sequence, which is a result of the relatively non-processive retrotransposition mechanism that creates processed pseudogenes.[12] Processed pseudogenes are continuously being created in primates.[13] Human populations, for example, have distinct sets of processed pseudogenes across its individuals.[14]

Non-processed (or duplicated) pseudogenes. Gene duplication is another common and important process in the evolution of genomes. A copy of a functional gene may arise as a result of a gene duplication event and subsequently acquire mutations that cause it to become nonfunctional. Duplicated pseudogenes usually have all the same characteristics as genes, including an intact exon-intron structure and promoter sequences. The loss of a duplicated gene's functionality usually has little effect on an organism's fitness, since an intact functional copy still exists. According to some evolutionary models, shared duplicated pseudogenes indicate the evolutionary relatedness of humans and the other primates.[15] If pseudogenization is due to gene duplication, it usually occurs in the first few million years after the gene duplication, provided the gene has not been subjected to any selection pressure.[16] Gene duplication generates functional redundancy and it is not normally advantageous to carry two identical genes. Mutations that disrupt either the structure or the function of any one of the two genes are not deleterious and will not be removed through the selection process. As a result, the gene that has been mutated gradually becomes a pseudogene and will be either unexpressed or functionless. This kind of evolutionary fate is shown by population genetic modeling[17][18] and also by genome analysis.[16][19] According to evolutionary context, these pseudogenes will either be deleted or become so distinct from the parental genes so that they will no longer be identifiable. Relatively young pseudogenes can be recognized due to their sequence similarity.[20]

Various mutations can stop a gene from being successfully transcribed or translated, and a gene may become nonfunctional or deactivated if such a mutation becomes fixed in the population. This is the same mechanism by which non-processed genes become deactivated, but the difference in this case is that the gene was not duplicated before becoming disabled. Normally, such gene deactivation would be unlikely to become fixed in a population, but various population effects, such as genetic drift, a population bottleneck, or in some cases, natural selection, can lead to fixation. The classic example of a unitary pseudogene is the gene that presumably coded the enzyme L-gulono-γ-lactone oxidase (GULO) in primates. In all mammals studied besides primates (except guinea pigs), GULO aids in the biosynthesis of ascorbic acid (vitamin C), but it exists as a disabled gene (GULOP) in humans and other primates.[21][22] Another interesting and more recent example of a disabled gene links the deactivation of the caspase 12 gene (through a nonsense mutation) to positive selection in humans.[23]

Pseudogenes can complicate molecular genetic studies. For example, amplification of a gene by PCR may simultaneously amplify a pseudogene that shares similar sequences. This is known as PCR bias or amplification bias. Similarly, pseudogenes are sometimes annotated as genes in genome sequences.

Processed pseudogenes often pose a problem for gene prediction programs, often being misidentified as real genes or exons. It has been proposed that identification of processed pseudogenes can help improve the accuracy of gene prediction methods.[24]

It has also been shown that the parent sequences that give rise to processed pseudogenes lose their coding potential faster than those giving rise to non-processed pseudogenes.[5]

By definition, pseudogenes lack a functioning gene product. However, classification of pseudogenes generally relies on computational analysis of genomic sequences.[25] This has led to the misclassification of some coding genes as pseudogenes. Examples include

The Drosophilajingwei gene, a functional, chimeric gene which was once thought to be a processed pseudogene.[26]

siRNAs. Some endogenous siRNAs appear to be derived from pseudogenes, and thus some pseudogenes play a role in regulating protein-coding transcripts.[32][33]

piRNAs. Some Piwi-interacting RNAs (piRNAs) are derived from pseudogenes located in piRNA clusters. Those pseudogenes regulate their founding source genes via the piRNA pathway in mammalian testes.

PTENP1 and KRAS1P (KRASP1). The mRNA levels of tumour suppressor PTEN and oncogenicKRAS is affected by their homologous pseudogenes PTENP1 and KRASP1. A miRNA decoy function for this pseudogene in cancer has been identified.[34]

The PTEN pseudogene, PTENP1 is a processed pseudogene that is very similar in its genetic sequence to the wild-type PTEN gene, a known tumor suppressor. However, PTENP1 has a missense mutation in the starting methionine that blocks translation of the gene into the PTEN mRNA, and consequently prevents coding of the PTEN protein as well (Poliseno et al., 2010). Although PTENP1 cannot be transcribed, it may still play a role in development. The 3’ UTR functions as a decoy of PTEN targeting miRNAs due to its similarity to the PTEN gene, and overexpression of the 3’ UTR resulted in an increase of PTEN protein level.[35] That is, overexpression of the PTENP1 3’ UTR leads to increased regulation and suppression of cancerous tumors.

- The High-Mobility Group-1 family is composed of 4 proteins (HMGA1a, HMGA1b, HMGA1c, and HMGA2), and are involved in cell development. Low levels of HMGA proteins are expressed in adult tissues, and high levels of expression are seen in cells going through embryogenesis. Their importance was further confirmed by doing a knockout study in mice—when HMGA1 was knocked out, development of mice was severely affected (DeMartino et al., 2016). With the importance of HGMA1 in development established, 7 HGMA1 pseudogenes were identified and studied: HMGA1P1, HMGA1P2, HMGA1P3, HMGA1P4, HMGA1P5, HMGA1P6, HMGA1P7, and HMGA1-p. All 8 of the pseudogenes are processed pseudgenes, and found in humans only (DeMartino e al., 2016).

- The HMGA1P1 gene has a mutation at a Protein Kinase C (PKC) phosphorylation site, which results in a significant reduction in DNA binding affinity. HMGA1P2 also affects binding affinity, as well as protein-protein affinity due to an arginine 59 mutation that allows the HMGA1P2 to be methylated by PRMT6, which is another protein-encoding gene. Because of these pseudogenes’ mutations, they can code for competitor proteins for the wild-type HMGA1 gene. Modification of HGMA1 expression results in modified chromatin remodeling and protein-protein (DeMartino et al., 2016). HMGA1P3 lacks the C-terminal acidic tail, which is usually found in most functional and HMGA proteins. The C-terminal acidic tail may play a role in regulating protein-protein interactions and transcription factor activity (DeMartino et al., 2016). Since the C- terminal acidic tail is important in the HGMA protein, HMGA1P3 has reduced function.

- Unlike the HMGA1 pseudogenes described above, pseudogenes HMGA1P4 and HMGA1P5 have low homology with the HGMA1 gene, and is unrelated to the gene itself. HGMA1P4 is wholly untranslatable and can be called a “dead gen”, while HGMA1P5 codes for a peptide completely unrelated to the HMGA1 protein (DeMartino et al., 2016).

- HMGA1P6 contains a mutation in the stop codon, so the gene is extended several amino acids and the corresponding generated mRNA is unable to be translated. The HMGA1P7 is also unable to translate into mRNA, but because the pseudogene has a missense mutation at the start methionine codon (DeMartino et al., 2016). Both HMGA1P6 and HMGA1P7 functions as decoys for HMGA1 targeting miRNAs. By acting as a decoy for the wild-type gene, overexpression of HMGA1P6 and HMGA1P7 also increases levels of expression of the HMGA1 mRNA and protein (Fusco et al., 2016). Inversely, knockdown of the HMGA1P6 and HMGA1P7 pseudogenes result in leads to decreased levels of HMGA1 mRNA and protein (Esposito et al., 2014).

- Because HMGA1 plays an integral part in cell development and is most expressed during embryogenesis, overexpression leads to an overly prolific growth of cells. Overexpression of HMGA1P6 and HMGA1P7 results in overexpression of HMGA1, which has been seen to induce increased migration, invasiveness, and quicker division (Fusco et al., 2016). When transgenic mice containing overexpression of HMGA1P6 and HMGA1P7 were bred, the mice were observed to have significant cancerous activity, proving that overexpression of pseudogenes of HMGA1 are indeed functional and has clear cancerous outcomes. Mouse embryonic fibroblasts (also referred to as MEFs), derived from mice with HMGA1P6 and HMGA1P7 overexpression grew much faster than MEFs derived from wild-type mice (Fusco et al., 2016).

There an estimated 20,000 pseudo genes in mammalian genomes. A genome-wide survey has discovered functional pseudogenes that are conserved in more than one species.[36]

A bioinformatics analysis has shown that processed pseudogenes can be inserted into introns of annotated genes and be incorporated into alternatively spliced transcripts.[11] This analysis showed strong evidence for transcription of 726 such retrogenes. However, their function was not studied experimentally.

Quite a few pseudogenes can go through the process of transcription, either if their own promotor is still intact or in some cases using the promoter of a nearby gene; this expression of pseudogenes may be tissue-specific.[5] In the bacterium Mycobacterium leprae, 43% of its 1,133 pseudogenes are transcribed (as opposed to 49% overall and 57% of its ORFs[37]). However, that does not make them "functional" in the sense that these genes or proteins have an activity that benefits the organism.

The duplicated pseudogenic DNA can be resurrected to a functional protein in certain cases as a rare or occasional evolutionary event and may enable sampling of more sequence space for a protein or protein family.[20] The pseudogenes or parts of pseudogenes may be re-utilized once they have been drifted randomly without being subjected to selection pressure for certain period of evolution. Koch, for the first time, postulated an idea about such "untranslatable intermediates" in the evolution of protein.[38]

The repair of lesions could be achieved by the reinsertion of a deleted segment, the removal (in frame) of an inserted segment, or other events that are likely to be improbable like gene conversion.[39]

The large group of pseudogenes for olfactory receptors (ORs) in metazoans, where 60% of the ORs in the human genome are pseudogenic, are resurrectable may be due to gene conversion events. In a cluster of ORs which contains 16 OR genes and 6 OR pseudogenes on chromosome 17, gene conversion events may aid to bring diversity in binding capability at the odorant binding site.[41] A pseudogene in the chemosensory ionotropic glutamate receptor Ir75a of Drosophila sechellia bears a premature termination codon (PTC). However, the D. sechellia Ir75a locus produces a functional receptor, owing to efficient translational read-through of the PTC. Read-through is detected only in neurons and depends on the sequence downstream of the PTC.[42]