⊟AureoWiki - The Wiki way of Staphylococcus aureus annotation[edit source]

AureoWiki is the first tool to support curated annotation especially of S.aureus biological entities in a Wiki based and community driven fashion. At the currently available and initial state AureoWiki provides information for a first draft of the S. aureus pan genome. Furthermore information on S. aureus strains COL, NCTC8325 (HG001), N315, Newman and USA300_FPR3757 is provided. These five strains were selected because of their detailed annotation (N315) or due to their importance for the scientific community (NCTC8325, COL). The main objective of AureoWiki is a step by step and community driven description and annotation of information for S. aureus biological entities and to force the process to make knowledge from lab based research available for the interested public. Some direct links may give an impression how a typical gene / protein page is structured:
hld, eno, hlY, splA, pflB, saeR, sarA

Every interested scientist is invited to complement the shown information. This can occure by using the edit icon on top right of each page or the [Edit] links besides the subheaders or inside some chapters. Changes can be made as a registered user (you can setup your own account by using the button with the small down arrow) left hand side from the [login] Link on top the every page or by answering a Staphylococcus research field associated expert question before the saving process is initiated. Any user, who is experienced in editing Wikipedia articles will easily be able to add information to the AureoWiki. Aureowiki Pages show a lot of information from a variety of sources.

Information preceeded by a filled bullet point is feeded from a data base behind the AureoWiki and cannot be changed. This is because we propose, that especially sequence based infromation will not be changed very often and will be maintained in a more or less static database. An [Edit] link makes it possible to add new user generated AureoWiki content. User added information is indicated by white bullets. At some places we added placeholders to motivate the user for further interesting input (e.g. phenotype section of the strain specific pages) to complement the currently available information. Iteratively the AureoWiki pages will grow depending of the engagement of the community.

The Main Page presents a clear and easy to use Search mask. Here the interested user can search for locus tags, gene symbols or keywords. The [Getting Started] button brings you to this page.

The bar of register tabls of the AureoWiki-Pages gives the user the opportunity to switch between the Pan Genome Pages and the Strain specific Pages. The pages are linked. This occurs via an orthologue reference table that is used behind the scenes and was constructed within our work. What kind of information you can find in the AureoWiki and where this information comes from will be explained in the following chapters:

For a comparative analysis of S.aureus strains an overall alignment of a public available set of 32 strain specific S. aureus genomes has been performed.

NC_007793

USA300_FPR3757

NC_010079

USA300_TCH1516

NC_002951

COL

NC_007795

NCTC_8325

NC_016912

VC40

NC_009641

Newman

NC_017331

TW20

NC_017341

JKD6008

NC_017347

T0131

NC_017351

11819_97

NC_002953

MSSA476

NC_003923

MW2

NC_013450

ED98

NC_017343

ECT_R_2

NC_009487

JH9

NC_009632

JH1

NC_017340

04_02981

NC_002745

n315

NC_002758

Mu50

NC_009782

Mu3

NC_016928

M013

NC_017349

LGA251

NC_007622

RF122

NC_017337

ED133

NC_016941

MSHR1132

NC_017338

JKD6159

NC_017763

HO_5096_0412

NC_002952

MRSA252

NC_017342

TCH60

NC_017673

71193

NC_017333

ST3989

NC_018608

08BA02176

On this basis it became possible to construct a positionally corrected and all genes containing pan genome. This was used to assign homologous genes (at least 50% identity on DNA and 70% identitiy on protein level) of different S. aureus strains to so called unified pan genome gene IDs and pan genome gene symbols which may simplify the knowledge transfer between scientists working with different S. aureus strains.

Finding these unified pan genome based gene symbols was the major challenge in our efforts. In a first round the gene names with an intuitively to understand meaning were extracted from three S. aureus strains and assigned to the corresponding pan genome genes. The best annotated N315 strain followed by the COL, NCTC8325 and USA300_FPR3757 strains served as basis for initial pan genome gene naming. In a second round names with still no easy to understand symbols were specified by using the whole overall S.aureus alignment and iteratively complemented by gene names from S. epidermidis, B.subtilis and at last from names of other bacterial homologues stored in Genbank.

The result of our work is presented inside AureoWiki on the so called pan genome pages, which can be accessed by using the most left register tab on top of the AureoWiki pages. The pan genome pages present the following information:

A unique pan genome ID resulting from the order of all genes within the S.aureus pan genome, followed by the pan genome gene symbol and the source of its naming (from the N315, COL, NCTC8325 strain, other S.aureus or other bacterial sources). This is complemented by a list of genes descriptions that were extracted from all available S. aureus genomes and in part independent annotation efforts. Because of the in part independent annotations this is a suitable collection for possible functions especially for genes with vague functional annotation.

Within the positionally standardized pan genome you can find the strand and the pan genome start and end positions of the pan genome genes as well as the so called synteny blocks, those IDs define phylogenetically conserved genome regions. How often a gene has orthologues genes in the 32 analyzed S.aureus genomes which have been used for the pan genome construction is shown in the occurence entry. 100% means, that the gene is part of the core genome, lower values defines genes of the dispensible fraction of the pan genome and genes with occurences less than 6% maybe considered as orphans.

Meta Function

Gene Functional Class (TIGRFam Main Role)

Color Code

Envelope

Cell envelope

Cellular processes

Cellular processes

Metabolism

Amino acid biosynthesis

Biosynthesis of cofactors, prosthetic groups, and carriers

Central intermediary metabolism

Energy metabolism

Fatty acid and phospholipid metabolism

Purines, pyrimidines, nucleosides, and nucleotides

Transport and binding proteins

Genetic Info processing

DNA metabolism

Mobile and extrachromosomal element functions

Protein fate

Protein synthesis

Transcription

Regulation

Regulatory functions

Signal transduction

Unknown function

Hypothetical proteins

Unknown function

RNAs

RNA genes

A graphical display of the genes' of interest region of the N315, COL, NCTC8325 and USA300_FPR3757 strains highlight phylogenetically conserved genomic regions. The used colors encode for gene functional assignments (see table). Assigning gene sequences to function was performed by using a collection of Hidden Markov models of TIGRFams [1].

Some technical remarks on the pan genome construction:

The pan genome was determined on the basis of a MAUVE[2] based total genome alignment followed by an iterative assignment of single genes to the core genome by using GenomeRing[3] and OrthologyPredator (Linus Backert, unpublished data).

The pan genome alignment was developed in a very close collaboration at the group of Kay Nieselt in Tübingen (L.Backert, A.Hennig, A.Herbig, K.Nieselt).

Much of the strain specific data comes from GenBank [4] and is complemented by data from many orther sources explained later. The gene summary condenses the most relevant information of the currently shown gene at the entree of the strain specific gene pages. This is enriched by the unique pan genome gene ID and the pan genome gene symbol.

For the AureoWiki we developed a space saving genome viewer reducing the genome information to a minimum. It's based on a vectorized (SVG) file format and is initially aligned to the position of the gene currently shown. The genome position in the genome browser can be changed by grabbing the slider or moving the mouse wheel on the left side of the genome browser. Clicking on gene arrows will bring you to the corresponding gene page, this makes a step by step or page by page walking through the genome possible. Colors encode the gene functional categories as explained above.

Here we display much of the basic information on S.aureus genes. This includes the strain specific gene coordinates, the gene length, the gene type (CDS, mRNA ...) and the strain specific gene symbol. Furthermore we exxtracted the description, the strand and the encoding replicon. This genbank based information was enrichted by essentiality information which we extracted from the DEG [5]; [6] that collected the essentiality information from [7]; [8].

The phenotype of mutants within the gene of interest normally can only be described in "free text". Availabiltiy of phenotype information is very limited in public database ressources. By this reason we would appreciate an active engagment of the S. aureus interested research community to share the knowledge with all interested scientists. Also we plan to extract phenotype information from the literature for genes we are interested in.

General information such as the protein symbol (three letter plus extension capital letters), description, sequence, length in aa were extracted from Genbank and implemented in AureoWiki. MW and pI have been calculated directly from the amino acid sequence.

Functional assignments or data that may help for a functional assingment have been generated and extracted in several ways:
The EC numbers were extracted directly from the Genbank sequence information. Additionally, ECs from a TIGRFam [9] functional classification have been inserted. EC numbers are complemented by the correspondig reaction equations and, if available, with enzyme names.

TIGRFams have been established by TIGR and are based on a Riley scheme for gene functional classification as first established for Escherichia coli by Riley [10]. TIGRfams deiscribe families of homologues proteins with defined functions in metabolism / physiology of bacteria, e.g. an enzymatic activity, a transporter with defined specificity, a protein with a structural function and many more. For each family a so called TIGRFam Hidden Markov Model (HMM), a general description of family specific sequence features has been established and can be used for a AI (artificial intellegence) such as HMMER [11] based assignment of sequences to TIGRFam functions. Because TIGRFams are systematically ordered in a hierarchical system of nested functions (enolase is part of glycolysis is part of energy metabolism) these data are displays in a tree like structure as known from computer file system hierarchies. Pressing the [+] sign gives access to all assignments to TIGRFams and the branches of the hierarchical system. For clarity reasons we inserted a meta level summarizing the TIGR main roles with the subordered sub roles and TIGRFam functions. The color codes within the genome viewer are based on these meta roles (orange brown - Metabolism, blue shades - Genetic information processing, green tones - Signal Processing, pink - Cell Wall and Envelope, red - Cellular Processes, black - RNA Genes, grey - Hypothetical and Proteins with no known function). The display of TIGRFams is ordered according their HMMER Scores giving a measure on the quality of the gene <-> TIGR HMM alignment. Especially for unknown-functions-genes also low scores may give a hint for a function.

TheSeed [12] is an open project for specialists' driven functional classification of DNA/protein sequences. As similar to TIGRFams this system is hierachically structured. Multiple assignments can be accessed by expanding the entry by pressing the [+] sign. Because the assignment is based on manual curation, no further quality measure is available.

As similarly described for TIGRFams the assignment of sequences to protein families PFAM [13] is based on HMMs. Due to the organization of Pfams in Clans the results have been displayed hierarchically with the Pfam clans on the top level and pfams as sublevels. The HMM scores reflect the quality of the assignments with best HMM scores shown on top.

Information on domains, protein modifications and cofactors are still not included but will be incorporated soon.

Some results from N315 based studies on regulators and those effectors, those results have been stored in Regprecise [14] are shown. Effectors are biomolecules modulating the activity of other biomolecules, in this cas of regulators. The N315 results have been transfered to the other strains on the basis of sequence homology of the other S.aureus strains' regulators.

A study in strain MRSA252 [15], this strain is until now not included into AureoWiki revealed a set of several thousand protein interactions. By using orthologue mapping these interactions have been inclued for the available S.aureus strains and indicated correspondingly.

There are several algorithms available predicting the (extra)cellular localization of proteins. We applied three (PSORTb [16], locateP[17], SignalP[18]) of them and present their reuslts. The TMHMM [19] Tool predicts the number of transmembrane domains, that is shown. Clicking [+] will reveal more detailed data which served as the basis to predict the proteins' theoretical localization.

If there is experimental evidence, that the protein as a whole or one or more peptides of it are really synthesized or accumulated this is stated. For this purpose a collection of global proteome analysis papers have been analyzed. The Papers are given as references.

Regulatory data have been extracted from RegPrecise [21]. Regprecise currently maintains exclusively S.aureus N315 regulatory data. By homology analysis the N315 data have been transfered to the other strains of AureoWiki, that is indicated.

A list of papers from where data have been inserted into the Aureowiki. These references are indicated on the Wikipages by the [Citations symbol]. By "mouse over" the corresponding citation will be shown or highlighted.