Title: Metagenomics of Antarctic Lakes: a Model for Defining Microbial Biogeochemical Processes in the Cold

Description:

Abstract:

Metadata record for data from ASAC Project 2899
See the link below for public details on this project.
We conducted a genomic analysis of Archaea and Bacteria collected from lakes in the Vestfold Hills, Antarctica. This provided
a new level of understanding about the life forms inhabiting these cold lakes. Linked to knowledge of meteorological, geological,
chemical and physical data that has been collected over years of previous research, the new genomic data will generate a complete
understanding of how the microorganisms have evolved and how they have transformed and presently interact with the Antarctic
environment. Deriving an integrated understanding of microbial ecology is essential for determining ways of preserving the
health of the World's ecosystems.
The data are available for download as an excel spreadsheet and a word document from the URL given below.
The GPS coordinates where samples were collected from are as follows:
(Note these are UTM (Universal Transverse Mercator) coordinates, from zone 44D)
Ace Lake: 44D 0384881 (easting), 2401821 (northing)
Deep Lake: 44D 0385351, 2391772
Organic Lake: 44D 0384928, 2403550
The fields in this dataset are:
Water temperature - degrees Celsius
Specific conductivity - micro Seimens per centimetre
Conductivity - micro Seimens per centimetre
Salinity - parts per trillion
Dissolved oxygen % - %
Dissolved oxygen concentration - milligrams per litre
Dissolved oxygen charge - This is an engineering value. The value is unit less, the recommended reading is 50 plus or minus
25. If you have a low reading it generally means you need to replace the membrane and if you have a high reading you need
to recondition the probe.
PressureA (This a depth reading of the Sonde) - (pounds-force per square inch absolute)
Water depth - metres
pH
pHmV (This is the pH millivolt reading that the probe is outputting the Sonde) - millivolts
Turbidity - (nephelometric turbidity unit)
BP (Barometric Air Pressure) - psi (pounds per square inch)
Taken from the 2008-2009 Progress Report:
Progress against objectives:
New lake and ocean samples, including additional opportunistic samples from Heard Island, were obtained Oct-Dec 2008. All
samples from 2006 forward are being processed. This includes DNA (metagenomics) and protein (proteomics). A great deal of
bioinformatic analyses have been performed on metagenome data. Metaproteomics has also proceeded well. Details of some of
the progress are as follows:
In the reporting period 1,064,488 Sanger sequencing reads were produced with 967,410 passing quality control, which at an
average of 700bp provided 677Mb of sequence data. The reads were produced in batches for each sample. We generated assembly
statistics and phylogenetic profiles after the completion of each batch. Sample diversity then guided the sequence allocation
for each sample. A number of pragmatic software tools have been created to perform the analyses. As an example, for one sample
the whole sample assembly was characterised by read depth, GC content, di-nucleotide frequency (Tetra) and tri-nucleotide
frequency (Tetra) on a per scaffold basis. The intrinsic properties then formed vectors in a feature space on which a self-organising
map clustering analysis was performed. The cluster which comprised the most abundant species was isolated and the genes annotated.
This represented 9 contigs with a total of 1.7Mb and 1683 predicted genes. For this sample, proteins were extracted and metaproteomics
performed resulting in a total of 3970 confident peptides matched providing identities for 504 proteins (at least 2 peptide
matches per protein) representing about 30% coverage. In comparison, a total of 170 proteins were identified against the non-redundant
database.
In other metaproteomic analyses, samples from 4 lake depths provided a total of 7,925 peptides providing the identification
of 1015 proteins against the NCBI non-redundant protein database (matches not yet performed to annotated metagenome data).
For testing detection limits and accuracy of identifications using a metaprotomics approach, a simulated mixed community study
was performed using S. alaskensis and E. coli. This has shown that cell numbers, protein abundance and cell volumes all impact
the ability to detect proteins of individual microorganisms within a population. The type and size of the database the metaproteomic
dataset is searched against (non-redundant versus S. alaskensis + E. coli protein database) also resulted in differences in
protein detection. The work has been useful for optimising parameters used for metaproteomics of the Antarctic samples.
An interesting eukaryotic virus that dominates the biomass of one of the samples is being analysed with the present work focusing
on classifying and characterising. Transmission electron microscopy of the water sample revealed virus-like particles of approximately
150nm but it was unclear from morphology if they represented a single virus type or several. Two complementary metagenomic
assembly approaches are being used to produce the most complete assembly possible of the large viral sequences. The first
assembly strategy follows a conventional metagenomic workflow consisting of assembly of the whole metagenomic dataset followed
by taxonomic binning of the constructs. An initial assembly has been constructed after determining the optimum acceptable
degree of error. A high degree of assembly was evident with the largest scaffold spanning 108kb with 6 X coverage. A BLASTx
search of the five largest contigs (greater than 10kb) produced two alignments to Major Capsid Protein (MCP) genes; one to
the short MCP gene of Chyrsochromulina ericina virus (28% identity) and the other to the full MCP gene of Phaeocytis pouchetii
virus (76% identity). Sequence flanking the full MCP gene corresponds to conserved hypothetical protein sequences from Ostreococcus
virus 5 (45% identity) and Paramecium sp. Chlorella virus AR158 (39% identity). These large deeply assembling contigs will
be used to 'tune' the parameters to improve assembly of the entire metagenome. A preliminary attempt to bin the scaffolds
using tetra nucleotide frequencies from the initial assembly has not completely resolved into clear taxonomic clusters. A
multi-dimensional binning approach including sequence coverage, GC content, nucleotide frequencies along with identification
of marker genes is being developed and will be applied once an optimum whole metagenomic assembly has been completed. Although
the presence of conserved genes is a promising sign of accurate assembly, validation of the scaffolds by comparison to sequenced
virus genomes is uninformative as viruses are poorly represented in the public databases and extremely diverse. Instead, a
second assembly strategy is underway that will conservatively extract and compile the viral sequence. The reads assigned in
an initial MEGAN analysis to the large dsDNA viral clade were used in a preliminary round of assembly. This first assembly
will be used as a reference to recruit more overlapping fragments and combined in another round assembly extending the construct
from the high confidence 'seeds'. Cycles of recruitment and assembly will continue until the assembly reaches an end point.
This is a new method of assembly that potentially can be used to extract and produce confident assemblies of other species
with no sequenced representatives. Comparison between this virus specific assembly and the conventional metagenomic assembly
will allow evaluation of the fidelity of both processes.

This data set conforms to the PICCCBY Attribution License (http://creativecommons.org/licenses/by/3.0/).
Please follow instructions listed in the citation reference provided at http://data.aad.gov.au/aadc/metadata/citation.cfm?entry_id=ASAC_2899
when using these data.

The data from Ace Lake were collected using a YSI Sonde 6600 Multifunction Probe the 20th of December, 2006.
Biomass collection was collected between December 20-25. In addition to Ace Lake, surface water samples from Organic and Deep
lakes for filtering were also collected. The biomass samples were collected using a Millipore Filter System with three 293mm
Filter Disk Holders with sequential 3.0, 0.8 and 0.1 micrometer filters. For concentrated filtrate samples we used a Millipore
Pellicon-2 Tangential Flow Filtration System with a Pellicon II Ultrafiltration Biomax 50kD Cassette. The sediment was sampled
with an Eckman sediment grab.
Taken from the 2008-2009 Progress Report:
Variations to work plan or objectives:
Samples were opportunistically obtained from Heard Island. This will broaden the scope of the project in a positive way.
Difficulties affecting project:
The logistics of the season were challenging - both in terms of the impact of the medical incident at Davis in October, and
the unusually overcast and windy weather experienced at Davis during our field period (5 periods with wind exceeding 100 km/h).
Despite the challenges our field work was remarkably successful.