Table 5

Known and novel predicted regulatory elements, obtained when applying FastCompare
to H. sapiens and M. musculus

Sequence

Rank

DATG

WATG

Orientation

U/C

Experiment

TRANSFAC

Comments

(a) Known regulatory sequences

CCCGCCC

1

256

-

-

2.26

8(7/1)

Sp1, GC box

Known Sp1 site, transcription from pol II promoter (p < 10-5)

GCCCCGCCC

2

165

-

-

4.64

9(9/0)

Sp1, GC box

Known Sp1 site, variant from above

CCGGAAG

4

160.5

[0;700]

-

2.37

-

Ets1, Elk1

Known Ets site, RNA metabolism (p < 10-6)

CACGTGAC

18

122.5

[0;600]

-

4.90

-

USF, GBP, SREBP-1

Known Myc/Max site

TGACGTCA

19

107

[0;1000]

-

4.24

-

CREB

Known CREB site

CGCATGCG

24

132

[0;1600]

-

4.26

-

-

Known palindromic octamer sequence (POS)

CCAATCAG

37

239

[0;700]

-

2.85

4(0/4)

NF-Y, CCAAT

Known CAAT box and CCAAT enhancer binding protein site

CGGAAGTGA

51

94

[0;1000]

-

3.96

-

STAT3

Known GA-binding protein (GAB) site

CCGCCTC

78

632

[0;500]

-

4.26

9(8/1)

-

Known insulin response element

CACGTGG

82

429.5

[0;300]

-

2.09

-

USF, Myc-Max

Known Myc/Max site, different from above

TAATCCCAG

119

1258

[100;2000]

← (p < 10-14)

7.06

3(1/2)

-

Similar to Bicoid (Drosophila), RNA processing (p < 10-5)

CACCTGC

227

925

[0;600]

-

1.64

1(1/0)

E47, Lmo2

Known ZEB site in vertebrates, Zfh-1 in Drosophila

ATTTGCAT

234

729

[0;300]

-

1.95

-

Oct-1

Known Oct-1 site, chromatin assembly/disassembly (p < 10-8)

CCAAGGTCA

242

801

[0;1800]

-

1.59

-

-

Known HRE site

GGAAGTCCC

253

124.5

[0;300]

-

2.60

-

NFκB

Known NFκB site

CAGCTGC

256

850

[0;1600]

-

1.03

-

AP-4, HEN1

Known AP-4, MyoD site

TTTCGCGC

275

245

-

2.42

-

E2F

Known E2F site

(b) Novel predicted regulatory sequences

CGCAGGCGC

6

127

-

-

2.76

-

-

Unknown site

GCGCCGC

13

311

[0;1900]

← (p < 10-5)

1.41

-

-

Unknown site

TCTCGCGA

17

116

[0;1700]

-

4.45

-

StuAp

Unknown site, similar to E2F

TTAAAAA

52

1142

[100;2000]

-

2.19

21(0/21)

-

Unknown site

CTCCGCCC

60

242.5

[0;1300]

-

3.85

-

-

Unknown site, similar to Sp1

CCCCTCCC

67

563

[0;500]

→ (p < 10-4)

5.12

1(0/1)

-

Unknown site, regulation of transcription, DNA-dependent (p < 10-5)

AAGATGGCG

76

334

[0;1300]

-

1.14

-

-

Unknown site

CTGCGCA

89

199

[0;300]

-

3.63

-

-

Unknown site

CCAGCCTGG

123

1245

[200;2000]

-

4.42

-

-

Unknown site

CCTGCCC

162

788

[0;1800]

-

1.55

21(20/1)

E47/Sp1

Unknown site

CCCTTTAAG

166

230

[0;800]

→ (p < 10-10)

3.45

-

-

Unknown site

CCCCAGC

207

785

-

-

1.42

22(22/0)

-

Unknown site

TACAACTCC

225

154

[0;700]

-

2.51

-

-

Unknown site

GTGAGCCAC

248

1208

-

→ (p < 10-6)

6.28

-

-

Unknown site

(a) For each known regulatory element, we show the best k-mer, its rank within the set of 284 highest scoring k-mers, the median distance to ATG (for occurrences upstream of genes within the conserved
set), the optimal window, the orientation bias, the corrected ratio of upstream/coding
bias, the total (upregulated/downregulated) number of microarray conditions in which
the k-mer was found (see Materials and methods), TRANSFAC matches, and the best GO enrichment.
(b) Novel predicted regulatory elements. k-mers shown here were selected from the list of 284 highest-scoring k-mers based on their short median distance to ATG, short optimal window, significant
orientation bias, strong over-representation ratio (U/C), presence in upstream regions
of over/underexpressed genes in several microarray conditions, palindromicity or resemblance
to known sites in other species.