Supplementary Data.

List of genes localised to the SCL-like Ets/Ets/GATA conserved clusters.[PDF]

Information on transcription factor binding sites.
TFBS name: ETS
IUPAC code: GGAW
Bound by: Winged helix-turn-helix transcription factor family members including Elf-1 (ETS related transcription factor-1) and Fli-1 (Friend Leukemia Integration factor-1). PU.1 (a.k.a. Spi-1).
Function: The ETS family members have important roles in haematopoiesis Sharrocks and coworkers 1997) binding critically important regulatory elements in vitro and within haematopoitic progenitor cells (Gottgens and coworkers 2002). PU.1 is required in macrophage development and is required in other myeloid and lymphocytic lineages Warren and Rothenberg 2003).
Ref: Based on the core concensus sequence detailed by Sharrocks and coworkers (1997) and TRANSFAC(v6) accessions M00032 and M00074.

TFBS name: GATA
IUPAC code: GATA
Bound by: Zinc finger transcription factors GATA1-3.
Function: GATA factors are key regulators of haematopoiesis (Weiss and Orkin 1995). GATA1 has been identified as a component of the SCL binding complex and GATA2 has been shown to contribute to a necessary and sufficient 3' enhancer of the SCL gene (Gottgens and coworkers 2002). GATA-1 is essential in eryroid development and is thought to participate in a mutually antagonistic role with PU.1 (Warren and Rothenberg 2003).
Ref: The GATA motif is the most widely identified binding sequence of GATA-1 (TRANSFAC(v6) accessions M0278, M00348 and M0349) and GATA-2 (TRANSFAC(v6) accessions M00126, M00127, M00128, M00203, M00346 and M00347). It should be noted that Merika and Orkin (1993) identified variation in the last position of the GATA motif.

Background to the TFBScluster analysis.
TFBScluster was designed to identify clusters of transcription factor binding sites (TFBSs) conserved in mammalian genomes.
Clusters are identified containing a specified selection of TFBSs. An additional suite of programs can also provide a list of SWISS-PROT/Locuslink characterised genes to which the clusters are localised. This information may be directly used in the experimental validation of a region. All these programs (PERL scripts) are available on request.

The raw data for TFBScluster are BLASTZ/CHAINNET genome alignments held at
Genome Bioinformatics (UCSC), including human/mouse, human/mouse/rat and human/chicken.
Genome-wide TFBSs are identified using
TFBSsearch (available on our web site) via a script that converts
the downloaded data format to the FASTA format.

The currently implemented alignments include:

June. 2003 human assembly (also known as build 34).

Feb. 2003 mouse assembly (also known as MGSCv4 or mm3).

June. 2003 rat assembly (also known as rn3).

Feb. 2004 chicken assembly (also known as galGal2).

The result is a set of libraries containing all the putative sites
for different transcription factors. For each TFBS (e.g., EBOX) one
library is created for the core sequence 'CANNTG'. The IUPAC letter 'N' is
allowed to differ between genomes. Libraries are also created to extend the
'core' binding site one to three nucleotides 5' and 3', i.e.,
NCANNTGN, NNCANNTGNN or NNNCANNTGNNN. In these libraries the IUPAC letter N
must be the same in both genomes. By extending the degree of conservation
between the aligned genomes a more specific and reduced set of TFBSs are created.

Selected TFBSs have also been screened using
Regulatory Potential scores (also see the corresponding publication at
PubMed).
For ease of use the 5bp window scores were converted to areas covered by
RP scores >= 0.0002. This is a threshold score determined by analysis
of the haemoglobin beta gene locus. New TFBS library files (TFBS_filtered)
were created to only include those TFBSs present in these areas.

Information for each TFBS cluster is stored in the
GFF format. The start and
end sites are coordinates of the human genome. The start and
end positions for each TFBS relates to the 'core' sequence, for example
NNGATANN - start = 3 and end = 6. Clusters are all reported on the sense
strand as individual TFBSs may be on sense or complement strands. TFBSs from
selected libraries are formed into clusters of a specified size.
The final length of each cluster may be greater than the specified range as
overlapping TFBS are combined to highlight the TFBS rich region.
The UCSC genome assemblies ('builds') are also used by the
Ensembl project; this
connection allows annotated genes to be localised to the final TFBS clusters.

The version of Ensembl used is 19.34a.

All Ensembl annotated transcripts are localised to each cluster when
a cluster is contained in a transcript, or a transcript is located within
100kb of a cluster. As a cluster may be localised to many transcripts
the list is processed to identify one of two scenarios for each cluster:

A cluster is situated in the intron of a transcript.

A cluster is situated 5' to a transcript and/or 3' to a transcript.
The nearest transcript is selected in both situations.

In order to identify the function of transcripts localised to
clusters the SWISS-PROT identifier and LocusLink identifier in the Ensembl
annotation are used (where available) to identify genes with characterised
gene products. Anecdotally - there are more Ensembl genes with Locuslink
IDs, but the genes may not have well defined functions.