I want to find out how common my gene is within Bacteria. I have started blasting my gene against all sequenced bacterial genomes, but of course this is time-consuming as I manually check every species.. Are there other ways but Blast? Help would be greatly appreciated!

Hi marifly81 !I'm also interested in determining the prevalence of a given gene (or protein) from sequenced bacterial genomes.I have tried the links provided by bioinfo with no success. Have you found a way to do this ?

I still haven't found a way to get the info I want. I basically want to know if a given gene is present/absent in a species. The output I would like to see would be "this gene is found in 90% of the members of this genera, or taxon, or family". I've tried a few things like doing a Blast search in the sequenced genomes in NCBI and get a taxonomy report. The output gives you how many hits were found for a given species but only if there is a hit. So I still don't know in which species the gene is absent...

the question is, whether the other taxons where not sequenced yet or whether there is really no homologue. I guess you need to narrow your search to only few taxons and try them manually or you can always download all sequences from NCBI and run BLAST at-home, if you have some unused computer

My method for BLAST'ing large numbers of organisms was just construct a database from the SEED Network, download all the genomes of interest. Save these as a single fasta file. Then to blast against this file the gene of interest. This was all done on BioLinux but runs on perl programming language, I manage to gain large numbers of hits with probability indicators. I still have all the syntax to hand, so get in touch if I can help.Nick

As far as I understand it:What you are looking at is evolutionarily related genes. The problem then depends entirely upon where you draw your lines. If you play with the BLAST scores, you will get different results. Many of the tools on the net such as BLAST on NCBI will give you the result - if you perform a BLAST (I prefer psi BLAST), and then you set your level

type your gene sequence into ncbi home page - this gives you the huige amount of refernces. click on Unigene (note the number of links that have already been assigned)this takes you to page with: SELECTED PROTEIN SIMILARITIES Comparison of cluster transcripts with RefSeq proteins. The alignments can suggest function of the cluster.click on the top link that matches your protein, and in the submenu i suggest protein/protein matches, which has done the BLAST for you - in my example of CD40 i get a list of matches with the organism the sequence came from: as far as "Blink" very simple to then write all of the species down. i get as far as Cricetulus griseus, the chinese hamster, and its Tumor necrosis factor receptor superfamily member 22 . As I know that CD40 is in this family I can trust it. If I get a bacterial sequence, I know I cant- such a thing occurs when you dont want to use Blink but try to find previously unidentified connections- then you enter a grey area occupied by people who don't like being asked to nail jelly to plates but admit that some people are better at it than others. If you are not happy with the Blink data and want to challenge it, you can do your own BLAST - click the same link in Unigene, go to the protein, select BLAST, and then choose (in this example) the PSI-BLAST option (the default BLAST is also good), leave everything else the same - then do the alignment. Once you are familar with that, you need to consider how many matches you want to ask for, how much of the servers time you use is therefore worth bearing in mind. You can adjust the choice of matrix (I am not sure of the differences, but they will of course give you different answers - ) and you can adjust the stringency using the GAP-penalties section. You can generate a lifetimes work on one gene alone with the different options- it helps to have a guide if you do this work. PSI and PHI BLASTS allow greater analysis based on repeated iterations based on the previous analysis (including the new data each time, as I understand it). The threshold {EXPECT} is a means of lowering the threshold if you arent getting any results - now we are in 2011, this is rarely required as we have so much sequence data and most genes have homologs and in effect have been sequenced now in most organisms. In my example, I have just found a fish match to the CD40 gene - most interesting ! From the data I have discussed so far this would not be expected, but we know that CD40 exists in fish. So the question is, are you happy with what Blink gives you (humans and chines hamsters are related ) or would you have wanted to get as far as fish in the analysis? I expect you have a good understanding of bacteria- i dont hence my discussion of a mammalian gene but the similarities of approach apply still. If you want to limit the size of your study, just raise the EXPECT.