Background

- Annotation and analysis of a large cuticular protein family with the R&R Consensus in Anopheles gambiae

Arthropod cuticle consists predominantly of chitin fibers embedded in a protein matrix [1]. While chitin is a simple polymer of N-acetylglucosamine, there is a large number of cuticular proteins (see [2,3]
for review). The vast majority of cuticular protein sequences presently
available belong to a family with the R&R Consensus, first
identified by Rebers and Riddiford [4]. An extended version of the original Consensus has been shown to bind to chitin [5,6], and the conformation it may adopt has been modeled [7,8].
Throughout this paper, we will use the term, R&R Consensus, to
refer to the extended Consensus and CPR to refer to the family of
genes/proteins with this Consensus. The Consensus, with about 64 amino
acids, almost always begins near a triad of aromatic residues
(Y/F-x-Y/F/W-x-Y/F) and terminates shortly after a uniformly conserved
G-F/Y (Figure 1).

Figure 1Features of the R&R Consensus in An. gambiae and their relationship to the pfam00379 motif. The longest region that generally could be aligned across An. gambiae CPRs
is shown beginning with a proline or glycine two or three sites prior
to the aromatic triad and ending with a proline or glycine eight
positions C-terminal to the final invariant glycine. Our alignment of
this region was 65 positions long for aligned RR-2 genes, but the
complete alignment was 83 positions long due to length variation in
RR-1 genes. The pfam00379 alignment includes seven additional positions
N-terminal to our alignment. The region used in the phylogeny extended
further in the N-terminal and C-terminal directions as described in the
text. The shaded region on the top line was double-weighted because
this encompasses the principal features of the R&R Consensus that
are present in virtually all An. gambiae CPRs.

While the R&R Consensus is conserved across
arthropods, its location within a cuticular protein and the nature of
the regions that flank it are highly variable. Understanding of the
role of these proteins in forming the insect exoskeleton and other
cuticular structures will be facilitated by defining all of the
cuticular proteins of a single species. Accounts of the cuticular
proteins with the R&R Consensus have now been published for 28
proteins from Apis mellifera [9] and for 101 from Drosophila melanogaster [10]. Also 102 CPR proteins have been identified in the genome of Tribolium castaneum (Beeman
and Willis, unpublished observations). These annotations depended in
large part on computerized genome annotation and were not
systematically verified at the mRNA or protein level.

In the present study, we have carried out an exhaustive manual annotation of the CPR family of An. gambiae based on the whole genome sequence of the PEST strain [11]. These annotations are being facilitated and verified by a proteomics analysis of cuticles [[12], He unpublished observations] and accompanied by an analysis of gene expression with real-time RT-PCR [13].
In addition, ambiguous gene models have been confirmed or revised by
sequencing RT-PCR or RACE products. This work has identified 156 genes
coding for CPR proteins. Hence over 1% of the genes of An. gambiae are devoted to just this one family of cuticular proteins.

An investigation of cuticular proteins in An. gambiae carried out prior to whole genome sequencing was particularly informative for the present annotation study. Dotson et al. [14]
sequenced a 17.4 kb insert in a genomic library constructed from the
Sua strain. This region had three CPR genes that were at least 98%
identical in their coding regions, yet differed in 5' and 3' UTRs as
well as their introns. Hence, the lesson learned was that virtually
identical genes can reside in compact tandem arrays, yet can be
recognized as distinct and not an assembly artefact because of the
differences in the non-coding regions associated with them.

CPR proteins can be divided into groups according to which variant
of the extended R&R Consensus they possess. Two major groups have
been named RR-1 and RR-2 while a third group (RR-3) has been identified
but from only a small number of sequences [15,16].
It is unclear whether RR-3s are an evolutionarily distinct group; for
the present analysis we include RR-3 genes within the RR-1 class. A
Hidden Markov Model can be employed at the cuticle DB web server [17] to assign proteins as RR-1 or RR-2 [10]. Our analysis confirms that the bulk of RR-1 and RR-2 proteins form non-overlapping clades in An. gambiae,
separated by a small set of long-branch RR-1, RR-2, and RR-3 proteins
that are probably an artificial group. In addition to assembling
information that supports annotation, we have analyzed the structure of
these clades, examined patterns of molecular evolution, compared the
amino acid composition of the different proteins and identified
characteristics of each group. We now have further appreciation of the
complexity of the insect cuticle and clues about the need for so many
CPR genes.