Mercator Tutorial

Mercator is a tool to batch classify protein or gene sequences into MapMan functional plant categories. Many MapMan categories deal with metabolic pathways and enzyme functions, therefore using this pipeline a draft metabolic network can be established, especially after manual corretion of the automatically derived classification see e.g. May et al. 2008. For a sequence clasification Mercator performs:

The results of the individual searches are then weighted by reliability. E.g. Uniref gets a low reliability since all proteins from Uniref90 are only classified based on keywords. The classifications with the highest reliabilty are retained. Current statistics (December 08):

Step 1 Sequence upload and option setup

JobName: Is a unique identifier given to your job. You can later use it to access your job.

Name: is a name for this search for the users' convienience e.g. myprotein

Email: is needed to alert you when the job is done. Depending on the selected options, the job can take quite some time.

TAIR: check this if you want to blast your sequences against the Arabidopsis genome

PPAP: check this if you want to blast your sequences against a large subset of the Swissprot plat proteins. (Arabidopsis is excluded from this set)

UniRef: check this if you want to blast your sequences against the Uniref90 set. These are clusters of >=90% sequence identity. Please note, that Uniref Clusters are not classified into MapMan bins, rather, keywords are extracted from their description lines, thus this does not add much to the classification accuracy, but might be useful for annotating sequences see below

Blast-Cutoff: This is the minimum bitscore value, below which blast hits are discarded. We suggest a value of 50.

KOG: performs a RPS-blast search versus the KOG database

CDD: performs a RPS-blast search versus the KOG database

IPR: performs a full interproscan, comprising searches for Pfam,

MULTIPLE: allows mutiple classifications of a protein

PARANOID: Treats unknowns as a separate class which is competing for an annotation, this increases specificity at the cost of sensitivity

ANNOTATE: adds the results of the blast and domain/protein family searches to the proteins or genes. This can be useful if the sequences are novel

IS_DNA: if the sequence is a DNA sequence please check this box.

LOAD_SEQUENCE: Please load a fasta formatted sequence file

Step 2 Check if everything is ok

Press "update job" to check if your submission is ok. You will get a small summary of your submission, in this case 22 sequences comprising 11361 nucleotides were submitted. If everything went ok, you will be presented with a new link: start Mercator

You will then be presented with a web site showing your sequence options, and the current status of your job.

Step 3 Completion of batch job and download

Upon completion you will be offered a download link.

The file is available in zip format. After unziping the file you can open it in MapMan, in a text editor, or in MS Excel. If you open the file using excel, make sure that excel keeps everything formatted as text. (There are bins like 26.10 which get rounded or interpreted as dates by Excel sometimes) At the top of the file you will be presented with all available MapMan Bins (classes). At the bottom you will find your sequences, grouped into Bincode and Name (the Classification into the MapMan system), an identifier and the description which has been extracted from the user-supplied FASTA file and enriched by the user checked searches. E.g. weakly similar to ( 107) AT3G03740| Symbols: ATBPM4 | ATBPM4 (BTB-POZ AND MATH DOMAIN 4); Here the number in brackets (107) gives the bitscore. In the following example the bist hit versus Arabidopsis is colored green, the one against Uniref90 in blue and potential domain names in red. All your sequences will be classified as "T" for transcripts.