Hi, I don't have much experience with motif searches, and I would like to hear your advice on the following task:

I have a DNA sequence (~300 bp) which hypothetically contains a regulatory motif. For example, a 300 bp region upstream the TSS of a gene. There is no prior knowledge of what could be binding there, and I want to have some predictions. What tool would be best to scan for motifs similar to any of the known TF binding motifs in Drosophila?
Further, what could be a good tool to submit an alignment from multiple species, and find a conserved motif? (again Drosophila; I don't want to find any motif, but a motif corresponding to a known factor).

As kennethcondon2007 said no matter what you chose to search for TFBSs in a single sequence you will get lots of falls positives. Using multiple coregulated genes to compare their promoters for enriched signal is one way of reducing FP. The second option is to search for SNP databases (this is somewhat similar to conservation) as some TFs tend to be very conserved and with much lower SNP probability. The third option is to focus on TFBSs and/or motifs that have a very narrow window of possible locations and build from them using known TF-TF interactions, for example, TATA-box. Fourth, if you know your gene is regulated by some TF and definitely is not regulated by the other, then you can force in TFBS for the first one and force second out. Fifth, lookup orthologous and paralogous genes, including pseudogenes - their promoter organization sometimes is conserved. Fifths, if you use PWMs say from TRANSFAC with Match, make sure you know the origin of the matrix and what it's direction mean. For some TFBSs direction is important and usually, this is direction relative to some other nearby TFBS. I am not an expert with Drosophila and your gene, but your gene can have alternative transcription start sites and alternative promoters. Finally, some TFs bind downstream of TSS and yours might be of this kind...

Thanks for the insights. I didn't explain the exact biological question but just an analogous example for simplicity, but in reality it's a region within the first intron, not TSS. This region is well conserved within the 12 Drosophila genomes, and there's a peak of DNA accessibility in D. mel . These observation make me think that there must be some factor binding there (not necessarily known). I would like to check for the presence of possible known motifs there, fully aware that there could be false positives.. but I don't know where else to start. There a few other genes that seem to be co-regulated, but it could be for other reasons, so I am not sure if adding them can help or hurt. I tried to see ChIP-seq/chip data on the modEncode browser from this region, but this data is from embryos and only a few TFs and there wasn't a convincing peaks.

Interesting. Do you see that conservation and peak accessibility in other species in the same region? Do you have access to a wet lab or have funding to order some wet lab tests elsewhere, or is this pure bioinformatics task for you?

The problem with a single sequence is the number of possibilities in the search space. In your case you have a 300 base example. You have no idea of the length of possible motifs if any, or where they occur. You would be searching a database of many motifs of many different lengths .... the number of possible matches is enormous so any results you get could be occuring completely by chance and have no biological relevance whatsoever.

A motif search usually is carried out on a group of related sequences (not a single sequence) to find short seeds that are enriched. For example, if you have a set of 10 co-expressed genes you can extract the 300 bases upstream of the TSS . This set of 10x300 base sequences can then be analysed for short enriched fragments within. Then it is those short fragments that are used as search items against a database of TFs.

You require more sequences with a close relationship to your current one. "Close relationship" can be defined as: co-expressed, tissue specific, homologues...and many other ways.

Right, I understand that it might not be relevant especially with only one sequence, but if I have a set of sequences that are potentially co-regulated as you say, then what tool I could use to predict TF binding sites (Drosophila)? Or a tool that takes conservation into account? Thanks