Results of the PCA of whole chromosome sequences from Dm (○), Dy (△) and Dp (+). Chromosomes are colour-coded, as follows (according to the Dm numbering: black = X, yellow = 2, blue = 3 and red = 4). L and R stand for the left and right arms of the metacentric chromosomes, respectively. (A) Score plot (R2 = 0.87) of the non-normalised hexamer analysis. (B) Loading plot of the analysis in (A). (C) Score plot (R2 = 0.84) of the normalised tetramer analysis. (D) Loading plot of the analysis in (C). The colouring of the hexamers in (B) and (D) is proportional to the A/T content. Pink is all A/T and blue is all G/C.

Table 1

Original sequence length

% N

% A/T

% removed by Tandem Repeats Finder

% A/T after Tandem Repeats Finder masking

% Removed by RepeatMasker

% A/T after RepeatMasker masking

Dm

X

21780003

0.10

57.42

2.12

57.34

8.80

56.55

2L

22217931

0.01

58.08

0.81

58.07

6.59

57.54

2R

20302755

0.02

56.55

0.91

56.54

7.88

56.05

3L

23352213

0.05

57.92

0.87

57.93

6.77

57.36

3R

27890790

0.00

57.08

0.68

57.07

5.33

56.60

4(F)

1237870

0.08

64.71

1.17

64.53

26.70

64.58

Dy

X

21591847

3.28

56.73

3.45

56.58

2L

22678881

1.31

57.19

1.56

57.18

2R

21288905

1.39

56.74

1.33

56.75

3L

24977971

2.19

57.57

1.57

57.57

3R

29717196

1.88

56.81

1.47

56.82

4(F)

1395135

2.16

64.53

3.38

64.51

Dp

XL

24630256

4.10

54.22

2.86

54.33

4

26108043

3.64

55.99

2.31

56.02

3

19738113

3.41

53.52

1.50

53.61

XR

24186629

3.87

53.76

3.63

53.96

2

25998849

3.13

55.10

1.86

55.15

5(F)

849497

25.81

61.45

1.02

61.42

The length, number of N and A/T content of all chromosomes used in this study.

Analysis of the sequence motifs shows that the F-element separation is no longersolely explained by A/T motifs (Figure 1D). In the analyses using penta-and hexa-mer motifs the Dp F-element is more similar to the non F-element chromosomes and the Dm/Dy F-elements separates more from each other (data not shown). The reason for this becameclear when the different genomes were separately analysed. In all three species, the F-element separated from the other chromosomes along the first component, regardless of the motif length used (results of the hexamer analysis are shown in Figure 2). In Dm/Dy the X chromosome was separated from the other chromosomes by the second component, although lessmarkedlythan the F-element. Interestingly, the leftarm of chromosome X in Dp separates in the second component while the right arm clusterscloser to the other chromosomes. This is in agreement with the hypothesis that the right arm of Dp X is a later addition [15]. The left arms of Dm X, Dy X and the Dp X are separated by the same hexamers. Many of the motifs causing the strong separation of the F-elements are the same in all three species. The topscoring penta-and hexamers can easily be aligned into longer motifs (Figure 3 shows results from the Dm hexamer analysis), all of which are supported by hexamers in both sense and anti-sense orientation.

Figure 2

Results of the separate, normalised, whole chromosome PCA of the three genomes using hexamers. Chromosomes are colour-coded, as follows (according to the Dm numbering: black = X, yellow = 2, blue = 3 and red = 4). L and R stand for the left and right arms of the metacentric chromosomes, respectively. (A) Score plot (R2 = 0.97) of the Dm analysis. (B) Loading plot of the analysis in (A). (C) Score plot (R2 = 0.99) of the Dy analysis. (D) Loading plot of the analysis in (C). (E) Score plot (R2 = 0.92) of the Dp analysis. (F) Loading plot of the analysis in (E). The colouring of the hexamers in (B), (D) and (F) is proportional to the A/T content. Pink is all A/T and blue is all G/C.

Figure 3

Graph showing the 50 hexamers with the highest loadings in the normalised Dm PCA. The combination of the eight hexamers into the nonamer is shown. The two hexamers not included in the motif are indicated by open boxes.

One of the atypical features of the Dm F-element is the specific binding of the protein POF. To determine if the nonamers or nonamer pairs are correlated to the binding of POF to the F-element, we mapped POF binding sites on polytene chromosomes (Figure 4A,B). It is difficult to map polytene bandsbeyond cytological position 102E5 so we limited this analysis to the region 102A-102E5. Comparison of the sequence positions of the nonamer pairs (Figure 4C) with the staining pattern of POF protein on the polytene F-element (Figure 4B) showed that regions with few or no pairs correlatewell with regions lacking POF binding. The genomic sequence corresponding to the cytological regions that do not bind POF comprises 59% of the sequence from positions 1 to 830,000. 79% of the nonamer pairs and 61% of the nonamers are locatedoutside these regions. We tested the significance of these results in a simulation, repeated 10 million times, in which we randomised the positions of the nonamers and the nonamer pairs. In all of these simulations the number of nonamers or pairs was lower than the observed numbers in the POF-binding regions.

The separation of the X chromosome seen in Figure 2A,C,E is due to simple sequences such as An, Tn, C/An and G/Tn repeats in both the non-normalised and the normalised analysis. This finding is in agreement with in situhybridization data showing that C/An and G/Tn repeats are common on the X chromosome [19]. Positions of the hexamers that separate chromosome X show no clear correlation to the binding sites of the MSL complex defined by Demakova et al. [20] (data not shown).

In an effort to determine the origin of the sequences causing the chromosomal separation in Dm seen in both the non-normalised and normalised PCA we repeated the analysis on three additional data sets. To evaluate the contribution of simple sequence repeats we masked the genome using Tandem Repeats Finder [17] and to evaluate the contribution of both simple and more complex repeats we used RepeatMasker [18]. We also merged all exon sequences of the different chromosomes. We then analysed the fourdatasetssimultaneously, both with and without normalisation (Figure 5). The resultingplots show that the enrichment of simple A/T rich sequences on the F-element (seen in the non-normalised PCA, Figure 1B) cannot be explained by differences in repetitive elements. These sequence signatures were not removed by masking simple or more complex repetitive elements, implying that they are present in all non-exon sequences on the F-element (Figure 5A). Interestingly, the F-element exons do not share these sequences, but they still clearly separate from the exon sequences of the other chromosomes. Furthermore, the simple sequences that separate the X chromosome from the othersdistribute all over the non-exon sequences. In the PCA in which we accounted for differences in nucleotide composition, the separation was similar compared to the non-normalised analysis, except that the exons of the X chromosome separated from the exons of the other chromosomes (Figure 5B). It should be noted that the first component distinguishes between the exon sequences and the other sequences. The second component, however, separates all types of F-element sequences from the other chromosomal sequences. We conclude that the overrepresentation of some sequence signatures on the F-element cannot be attributed to either the high A/T content or the enrichment of repeated elements and that they are present in both exon and non-exon sequences. The generalpatterns we see are clearly not dependent on the type of sequence studied or differences in base composition.

In addition, we note that in the normalised PCA the RepeatMasked F-element separates more clearly from the original F-element sequence (Figure 5B) than in the non-normalised analysis. Many sequence signatures are shared by the F-elements in all four datasets. Examination of the top-scoring sequence motifs clearly shows that the RepeatMasked F-element lacks the nonamer motif described above (data not shown). We therefore studied the outputfile from RepeatMasker in further detail. According to RepeatMasker, 95.3% of the nonamer motifs residewithinDINE-1 elements, and thus seem to be closely linked to them. The DINE-1 element has previously been shown by in situ hybridisation to be enriched on the Dm F-element [21]. We also note that in the DINE-1 sequence defined in the Repbase Update [22,23] there is a duplication of approximately 60 base pairs, each of which contains a nonamer pair, and in both pairs the individual nonamers are separated by 29 base pairs.

We also masked the genomes of Dy and Dp using Tandem Repeats Finder. Since these genomes have not yet been annotated we could not use exons or RepeatMasker. The PCA results of the original sequences and the masked sequences in these species are virtually identical (data not shown).

Fragment analysis

In the whole chromosome analysis we identified sequence signatures that are enriched on different chromosomes, but we did not investigate their linear organisation along the chromosomes. Therefore, to find sequence signatures evenly distributed over the chromosomes that are capable of distinguishing one chromosome from the others, we fragmented each of the Dm, Dy and Dp genomes into 100 kb fragments. We then scored the positions of all possible di-, tri-, tetra-, penta-and hexa-mers in the 100 kb fragments of all chromosomes from each of the genomes. The first component of a PCA of these data mainlyreflects differences in nucleotide composition between the fragments. Since the nucleotide composition can vary both between chromosomes and within single chromosomes we need to remove this variation in the dataset. One possibility would be to exclude the first component, but some of the variation caused by A/T skewing could still remain in the higher ordercomponents. To specifically remove the influence of variations in the base composition we created a Partial Least Squares (PLS) model using the non-normalised hexamer scores and the A/T content as a single response. We then used the residualmatrix, after removing the variance described by the first component, for subsequent PCA analysis. The residual matrix is a normalised scoring matrix in which the variance in the data related to the base composition of the target sequence has been removed. The performance of the normalisation was evaluated by plotting the scorevalues of the first component against the base composition of the fragments. As expected, the scores showed an almost perfect correlation with the base composition of the fragments (data not shown).

PCA of the approximately 3600 fragments from all three species showed that the 33 F-element fragments cluster, and separate with minoroverlaps from the other chromosomal fragments in the second component (Figure 6 shows results from the hexamer analysis). In the tri-and tetra-mer analyses, the overlap with other chromosomes was more extensive than in the di-, penta-and hexa-mer analyses (data not shown). In the first component of the hexamer PCA, roughly a third of the Dp fragments cluster separately from other chromosomal fragments. The third component separates many of the Dm/Dy X chromosomal fragments from the others, but only when using penta-and hexa-mers (data not shown). The sequence signatures responsible for the separation of the F-element are not the same as in the whole chromosome analysis and cannot easily be combined into longer motifs. For a full listing of the loadings for all 4096 hexamers for the first two components in the PCA see Additional file 1. In conclusion, the fragment analysis showed the existence of F-element-specific sequences that not only have been conserved for approximately 54.9 Myr, but also are linearly distributed along the sequenced part of the F-elements in Dm, Dy and Dp. Based on this conservation we speculate that there are sequence signatures that have a function for F-element identity.

When we plotted the scores from the second component (which separates the F-elements) against the chromosomal position we find that on average the Dp fragments are shiftedtowards the F-element fragments (Figure 7 shows results from the hexamer analysis). The centromereproximal regions of the non F-element chromosomes in all species are shifted towards the F-element fragments and the distal regions in the oppositedirection. This pattern is not as clear in Dp as in Dm and Dy.

Figure 7

Scores from the second component in Figure 6 plotted against the linear order of the 100 kb fragments on the individual chromosomes from Dm (○), Dy (△) and Dp (+). Chromosomes are colour-coded according to the Dm/Dy numbering (black = X, yellow = 2, blue = 3 and red = 4). This should be noted when examining the Dp fragments. Proximal regions of Dm and Dy chromosomes are indicated by arrows.

In the same way as for the whole chromosome study, we repeated the fragment analysis on chromosomes from the three species after masking them by Tandem Repeats Finder. The results from this masked dataset did not differ in any significant way from the prior analysis (data not shown). For Dm, we also masked the fragmented genome using RepeatMasker. A combined PCA with the original data, Tandem Repeats Finder masked data and RepeatMasker masked data showed that the F-element signatures distributed over the entire chromosome are not connected to either simple or complex sequence repeats (Figure 8 shows results from the hexamer results). In this analysis many X chromosomal fragments separated from the other fragments.

Figure 8

The combined PCA of 100 kb fragments (n = 3399) of three Dm data sets based on hexamers. The three data sets used are: the original sequence (□), the Tandem Repeats Finder masked sequence (▽) and the RepeatMasker masked sequence (*). Chromosomal origins of the fragments are indicated by colour (black = X, yellow = 2, blue = 3 and red = 4).

Discussion

Sequence signature analysis

In this work, we separately counted all di-(16), tri-(64), tetra-(256), penta-(1024) and hexa-mers (4096) and studied their distribution in the chromosomes of three Drosophila genomes using PCA. Short motifs (up to tetramers) can be rapidly scored and analysed. However, the frequencies of such short motifs are strongly influenced by the abundance of simple sequence repeats. Motifs longer than tetramers are less affected by simple sequence repeats, but are computationally more demanding to analyse. Sometimes, when a group of sufficiently long sequences, e.g. hexamers, are found to be overrepresented in a genomic sequence, they overlap and form longer sequences with higher discriminativepower, thus increasing the chance of identifying longer and more complex sequences than if shorter sequences, e.g. trimers, are used.

The frequency of a sequence motif depends on both biological and stochasticfactors. The expected frequency of a specific motif depends on the base composition of the chromosome. If the four nucleotides do not have equal frequencies in all chromosomes, the results from a non-normalised analysis will reflect the effects of a mixture of biological and stochastic factors. It is often difficult to isolate the effects of such factors, but a large part of the stochastic component can be removed by dividing all motif frequencies by the expected frequencies in a normalisation step. Otherwise, biologicallyinteresting motifs may be masked by motifs that are common solely by chance. In this study, we used relativelybasic normalisation procedures to account for differences in base composition. However, our multivariate approach could easily be extended to account for differences related to sequence complexity [see e.g. [25]] or any kind of prior knowledge about the target sequence.

Whole chromosome analysis

In many respects, the F-element in Dm (the 4th chromosome) is an atypical chromosome. It has an overall length of ~5 Mb, 3–4 Mb of which consists of simple satellite repeats and does not contain any known genes [26]. The remaining portion (1.23 Mb) has been sequenced and covers the cytogenetic bands 101E-102F on polytene salivary gland chromosomes. However, the banded portion appears to be a mosaic of unique DNA interspersed with moderate and lowcopy repetitive DNA [21,27-30]. The F-element is largely heterochromatic in nature. The heterochromatic protein HP1 and the modifiedhistone, methylated H3Lys9, have been found to be associated with most of the F-element [31,32]. In accordance with its heterochromatic nature, the F-element has a higher A/T content compared to the other chromosomes. A high density of transposable elements (approximately six times higher than in the other chromosomes) is found in the Dm F-element [33]. Another interesting feature of the F-element is that it is decorated by the chromosome-specific protein, POF (Painting of fourth), which specifically "paints" the entire chromosome [6]. The F-element is an atypical autosome and has been suggested to have a closer kinship with the X chromosome than with the other autosomes [16,34]. The F-element has been suggested, partly on the basis of studies of the distant relativeD. busckii, to originate from the X chromosome [35,36]. The binding of POF to the F-element is reminiscent of the binding of the Drosophila dosage compensation complex to the male X chromosome, which mediates its hypertranscription [reviewed by [4,5]]. In D. busckii, POF binds to the male X, further supporting the suggested relationship between the X chromosome and the F-element [6].

All chromosomes differ to some extent in nucleotide frequencies, with the F-element beingextreme in this respect, having a high A/T content in all three species studied. When the raw data was analysed the F-elements in all three species separated collectively from the other chromosomes (Figure 1), due to differences in their contents of simple sequences containing only A and T. In Dm we performed the analysis on four datasets, derived from the original sequence, and the sequences obtained after masking simple sequence repeats, both simple and more complex repeats and after removing everything except the exon sequences. The results show that the simple A/T sequences, which separate the F-element in the original data, are distributed throughout the non-exon F-element sequences and cannot be attributed to microsatellites and transposable elements. It should also be noted that the F-element exons separate equally well from the exons of other chromosomes. The X chromosome also separates from the other chromosomes, albeit to a lesser extent, due to differences in their simple sequences. The same chromosomal separation is seen regardless of the motif length used. As shown in Figure 1, all of the Dp chromosomes are shifted relative to the Dm/Dy chromosomes, suggesting the presence of Dp-specific signatures in addition to the chromosome-specific signatures studied here.

To detect more complex and potentially functional motifs hidden by the skewed base composition, we normalised our scores according to the base composition of each chromosome analysed. As shown in Figure 1C, the resulting separation was nearly identical to that seen in the non-normalised analysis (Figure 1A). The Dm F-element was clearly separated even after removal of repeated elements from the genome (Figure 5B). It should be noted that the first component in this PCA (Figure 5B) distinguishes the exons from non exon sequences. In the second component, however, all F-element sequences including the exon sequences, cluster together. We conclude that the F-element exons also contain F-element signatures.

The separate analysis of the three species showed that the pentamer and hexamer motifs that are most important for distinguishing the F-element can be aligned into longer sequences. Examination of the top scoring hexamers clearly shows that they are part of a nonamer in Dm and Dy, and of a decamer in Dp. These sequences are strongly enriched in the respective F-elements (Table 2), although the individual hexamers in Dm are not enriched in the non-normalised analysis. Since Dm is the only annotated species, we concentrated our investigation on the Dm/Dy nonamers. Plotting the positions of these nonamers in the Dm F-element showed that they commonly occur in pairs, separated by no more than 146 bp, all but one of which consists of two sense or two anti sense nonamers. The individual nonamers are enriched roughly four-fold in the F-element, while the pairs are enriched about 15-fold. The nonamers and decamers are also organised in pairs in Dy and Dp respectively (Table 2). We conclude that even though the method is based on relatively short sequence motifs, it still provides a potentmeans for finding longer and more complex sequence motifs.

Since POF is a protein that specifically paints the Dm F-element, we tested the possibility that the nonamer or nonamer pairs may be correlated to POF-binding sites. For this purpose, we stained polytene chromosome preparations using POF antibodies. After carefullymapping the banded regions, we compared the positions of nonamer pairs to the POF staining pattern. The genomic regions with few or no pairs correlate well with regions on the F-element that do not bind POF (Figure 3). We hypothesise that the nonamer pairs have a function and are directly or indirectly involved in POF binding to the F-element in Dm. However, this hypothesis needs to be verifiedexperimentally. Since POF will not bind to a translocatedDm F-element [6] the nonamer pairs are not sufficient by themselves for recruiting POF. If the pairs have a function, it is possible that some variation is allowed within the nonamer and that there are motifs of differingstrength. According to our RepeatMasker analysis of the F-element, 95.3% of the nonamers are located within DINE-1 elements. As shown in Figure 2B, the hexamers forming the nonamer are important for the separation of the F-element. Nevertheless, after removing virtually all of the nonamers using RepeatMasker (Figure 5B) F-element separation was retained, indicating that other signatures, apart from the nonamer, help distinguish the Dm F-element.

Fragment analysis

In the whole chromosome analysis we identified sequence signatures that are overrepresented in different chromosomes, but we did not study the linear organisation of the sequence signatures along the chromosomes. Instead, we divided each of the Dm, Dy and Dp chromosomes into 100 kb fragments to check for the presence of sequence signatures that can distinguish fragments of specific chromosomes from those of other chromosomes, especially signatures distributed over the whole chromosome. For such an analysis it is important to remove all variation connected to differences in nucleotide composition. Using a Partial Least Squares (PLS) model with A/T composition of every fragment as a single response we removed this bias. Strikingly, when the approximately 3600 fragments from all three species were analysed using PCA based on di-, penta-and hexa-mers the 33 F-element fragments clustered together (Figure 6). The motifs responsible for this separation were not the same as in the whole chromosome analysis. Nevertheless, this demonstrates the existence of sequence signatures that are capable of separating all F-element fragments from the three different species. Based on the relationship of these species we conclude that these signatures have been conserved for at least 54.9 Myr [15]. These conserved motifs are also linearly distributed along the sequenced part of the F-elements (Figure 6). The F-elements from the three species have high A/T contents and are probably all enriched in mobile and repeated elements. However, the motifs separating the F-element fragments are not connected to simple sequence repeats since masking such repeats did not alter the results. In addition, the Dm F-element fragments clustered together when the original sequence was analysed together with sequences in which both simple and complex repeated elements had been masked (Figure 8). Therefore, the collective separation of F-element fragments in the three species cannot be attributed to any known repeated elements, and we speculate that the signatures we identified have a role in F-element identification. The X chromosomal fragments of Dm/Dy, but not Dp, can also be separated to some degree using penta-and hexa-mers.

As shown in Figure 7, some non F-element fragments are more similar to the F-element fragments. These non F-element fragments are the centromere proximal regions of Dm/Dy chromosomes 2 and 3. The heterochromatic nature of the F-element in Dm is well established, e.g. by its enrichment of HP1 and H3K9 methylation [31]. In our analysis, the proximal regions of chromosomes 2 and 3 in Dm/Dy showed similarity to the F-element. It is interesting that an anti-metH3K9 antibodydecorates the proximal regions of chromosomes 2 and 3 as well as the F-element in Dm. The proximal region of X is also stained, but to a much lesser extent using this antibody (JL unpublished results). We note that the same pattern is present in Figure 7. We must consider the possibility that chromatinsimilarities cause the partial overlap of the F-element and the proximal regions of chromosomes 2 and 3 (and that the heterochromatic nature of the F-element caused its observed separation from the other chromosomes). It is difficult to fully separate chromosome-and chromatin-specific effects. Sequences that have high A/T contents and are enriched in repetitive elements tend to be heterochromatic. As shown in Figures5 and 8, the F-element separation was retained after normalising for differences in A/T content. Furthermore, the results were not significantly different when simple sequence repeats were removed using Tandem Repeats Finder, or when simple sequence repeats and repetitive elements were removed using RepeatMasker. The findings even apply to the exon sequences. Thus, we conclude that our methodology is capable of detecting chromosome-specific sequences.

redgrey

Conclusion

We have shown that the F-elements of three species that separated roughly 55 Myr ago share sequences that are distributed over the entire chromosomes. These sequences are not related to their unusually high A/T contents or any known repeated elements. In conclusion, our results support the existence of sequence signatures that confer chromosome specific integrity in Drosophila.

redgrey

Methods

Hexamer scoring

We scored all positions of all possible di-(16), tri-(64), tetra-(256), penta-(1024) and hexamers (4096) in the genome sequence of Dm, Dy and Dp. Every motif was counted in each target sequence. Full-length chromosomes and 100 kb fragments were used as targets. Scoring was done by a slidingwindow approach, sliding one nucleotide at a time. The scoring function gives a two dimensional data-matrix with target sequences as objects (rows) and the total score for each motif as variables (columns). By dividing each element in the matrix by the length of its target sequence a relative score is obtained. Prior to analysis all data were mean-centred, i.e. each value was adjusted by subtracting the average value for the corresponding variable. All scoring and data normalisation procedures were performed using customsoftwaredeveloped in C, Java and Perl. The software can be obtained, on request, from the corresponding author.

Multivariate analysis

Principal Component Analysis – PCA

The centralidea of PCA is to extract a few, so-called, principal components describing most of the variation present in the data. The principal components are linear combinations of the original variables and uncorrelated to each other.

where t are the scores, p the loadings, A is the number of principal components and E is the residual matrix. The principal components can be determined using the NIPALS algorithm [38] or by Singular Value Decomposition (SVD) [39]. The scores (t) show how the objects and experimentsrelate to each other. The loadings (p) reveal variables that have an important influence on the patterns seen in the score plot.

Data normalisation

Probability normalisation

The probability of successfully aligning a motif to a target depends on the base composition of the motif sequence and the target sequence. For example, the chance of finding a given A/T-rich motif is relatively high in an A/T-rich target due to their similarity in base composition. Probability normalisation removes this systematic bias from the data. Each value is normalised by dividing the observed number of hits by the expected number of hits. The initial scoring is performed as described above, except that the scores are not divided by the target sequence length. The number of expected hits was calculated as follows:

where N is the target sequence length, i = {G,A,T,C}, f(i) = frequency of base i in the target sequence and ni = count of base i in the hexamer.

Fragment normalisation

To remove all variance in the scoring matrix obtained from the 100 kb fragment analysis that was solely related to the base composition of the target sequences, a different normalisation was applied, in which we created a PLS model with the base composition of every fragment as a single y-response and the scoring matrix as an x-matrix. By removing the variance explained by the first component a residual matrix was obtained, in which all variation caused by differences in base composition amongst the fragments had been removed. The residual matrix E was calculated as follows:

E = x - tp'

Where x is the hexamer scoring matrix, t = PLS-scores for the 1st component and p' = PLS-loadings for the 1st component.

The normalised data were then used for PCA analysis of the fragmented genome.

The medical information provided on this website is of a general nature and can not substitute for the advice of a medical professional
(for example, a qualified doctor/physician)! Information from the internet could and should NOT be used to offer or render a medical opinion or otherwise
engage in the practice of medicine.