Figure 2

GEMS identification of PfM2.1 from the Sexual Development cluster (GO:GNF0004). a) List of words derived from promoter regions of genes contained within the Sexual Development cluster. The words are ranked by log10P hypergeometric-derived scores that represent the degree of word enrichment in the promoters of genes contained within the Sexual Development cluster (positive set) versus the remainder of the genome (negative set). In this case, the seed word "GTACATAC" led to PfM2.1 (highlighted in red). b) A re-ordered list of the seed word "GTACATAC" (highlighted in red) and all other words that differ by one mismatch ranked again by log10P score. A PWM is generated using this list with the contribution of each word to the PWM being weighed by its |log10P| score. c) A re-ordered list of all words ranked by similarity scores to the generated PWM. Similarity scores for any word are obtained by calculating the geometrical mean of the corresponding PWM elements associated with each word. The similarity threshold that results in the inclusion of words that lead to the lowest p-value is identified as optimal (highlighted in blue, blue asterisk). d) Visual depiction of the optimization of parameters through minimization of the p-values for different mismatches and similarity thresholds. The local minima corresponding to mismatch 0, 1, 2 and 3 are highlighted by circles (red, blue, magenta, and green respectively). In this case, the optimal log10P score (-21.2) is found with one mismatch and a similarity score threshold of > 0.57.