Abstract

The Sesame Genome Working Group (SGWG) has been formed to sequence and assemble the
sesame (Sesamum indicum L.) genome. The status of this project and our planned analyses are described.

Keywords:

genomics; sequencing; sesame

The importance of the sesame genome

Sesame (Sesamum indicum L., 2n = 26), which belongs to the Sesamum genus of the Pedaliaceae family, is one of the oldest oilseed crops and is cultivated
in tropical and subtropical regions of Asia, Africa and South America [1,2]. Its cultivation history can be traced back to between 5,000 and 5,500 years ago
in the Harappa Valley of the Indian subcontinent [3]. The total area of sesame harvested in the world is currently 7.8 million hectares,
and annual production is 3.84 million tons (2010, UN Food and Agriculture Organization
data). Being one of the four main sesame-producing countries, China has contributed
15.2 to 32.5% of the total world sesame production over the past 10 years (2001 to
2010, UN Food and Agriculture Organization data). Sesame has one of the highest oil
contents: decorticated seeds contain 45 to 63% oil [2]. The seed is also rich in protein, vitamins, including niacin, minerals and lignans,
such as sesamolin and sesamin [4-7], and it is a popular food and medicine [8-13]. Sequencing and analysis of the sesame genome is essential if we are to elucidate
the evolutionary origins and characteristics of the sesame species.

Sesamum is the main genus in the family Pedaliaceae, which contains 17 genera and 80 species
of annual and perennial herbs that are distributed in the Old World tropics and subtropics
[14]. The taxonomy and cytogenetics of the Sesamum genus has been reviewed and debated for a long time [1,14-17], and many heterogeneous landraces present in various growing areas still need to
be distinguished [1,18]. S. indicum is the sole cultivar in the Sesamum genus and evolved from wild populations [14,19]. However, the origin and evolution of cultivated sesame is still unclear and requires
more detailed investigation [1,15]. Evidence suggests that sesame may have originated in either India or Africa [3,20-26]. Bedigian reported that sesame was derived from the Indian subcontinent (the western
Indian peninsula and parts of Pakistan) thousands of years ago, and believed that
the progenitor of sesame is a taxon named S. orientale var. malabaricum Nar. [22,23], although most species of Sesamum and genera of the Pedaliaceae are native to Africa [27-29]. We hope to clarify the origin and phylogeny of S. indicum by applying comparative genomics and morphological and cytological analyses.

Sesame seed is commonly known as the 'Queen of the oil seeds', perhaps for its resistance
to oxidation and rancidity [3]. As it contains lignans, sesame oil also exerts anti-cancer properties both in vitro and in animal bioassays [30-34]. Compared with peanut (Arachis hypogaea), soybean (Glycine max), oilseed rape (Brassica napus), sunflower (Helianthus annuus L.) and other oilseed crops, sesame seed oil has an ideal nearly equal content of
oleic acid (18:1) (39.6%) and linoleic acid (18:2) (46.0%), and has desirable physiological
effects, including antioxidant activity, and blood pressure- and serum lipid-lowering
potential [2,35,36]. Studies of the genome and functional genome of sesame are essential for elucidating
the regulatory mechanisms underlying fatty acid and storage protein composition and
content, and the secondary metabolism of antioxidant lignans [37-40].

Sesame grows well and gives good yields in both tropical and temperate climates. Its
tolerance of drought and high temperatures make sesame well suited to land where few
other crops can survive. However, compared with other oilseed crops, sesame seed production
is not consistent, as it is susceptible to pathogens, waterlogging and low temperature
conditions [41]. Sesame breeding objectives, like those for other seed-producing crops, especially
oil crops, are to create new varieties with high quality and yield potential, and
resistance to pathogens (including Fusarium wilt and Charcoal rot diseases), insect pests, waterlogging, drought and low temperature
stress [37,42-45]. However, identification of genes or gene families and marker loci associated with
yield, quality, and resistance to disease and abiotic stresses has been hampered due
to a lack of information on the sesame genome. Only a few functional genes, mainly
involved in the formation and regulation of fatty acids, seed storage proteins and
secondary metabolites, and salt stress response, have been investigated [46-54]. With the exception of a sole amplified fragment length polymorphism (AFLP) marker
associated with the indehiscent-capsule trait reported in 2003 [55], no quantitative trait loci have been found in the linkage map of sesame, let alone
used for molecular-assisted selection (MAS) in sesame breeding programs. Integrating
desirable qualities from the few available excellent germplasm resources, including
wild species, will not be achievable rapidly unless considerably more genomic and
functional genomic information is available. In addition, sequencing of the sesame
genome will facilitate studies of other genera of the Pedaliaceae family by providing
a closely related reference genome.

We therefore plan to implement a Sesame Genome Project and sequence the S. indicum genome using the Chinese domestic cultivar, Yuzhi 11, which represents S. indicum cultivars with a simple stem, three flowers per axilla, oblong-quadrangular capsules,
and white flower and seed-coat color. Yuzhi 11 is one of the most important Chinese
cultivars due to its high oil content (56.66%), resistance to fungal pathogens such
as Fusarium wilt, charcoal rot and Alternaria leaf spot, and waterlogging stress. It is cultivated in the main production regions
of China [56,57].

Phylogenetic position of sesame

S. indicum is located in the asterids clade of the core eudicotyledons of Angiosperm Phylogeny
Group 2 (APG 2) [58]. Its phylogenetic position determined using sesame chloroplast genomic data indicates
that Sesamum (Pedaliaceae family) is a sister genus to the Olea and Jasminum (Oleaceae family) clade and represents the core lineage of the Lamiales families [59]. Compared with the 19 families shown in Figure 1 (adapted from the NCBI taxonomy database [60]), Sesamum, which has 36 available genomes, is closely related to the Solanaceae and Phrymaceae
families, but distantly related to other oil crops such as soybean (Glycine max), castor (Ricinus communis) and rape (Brassica rapa). At present, genomic information on the Pedaliaceae family is quite limited, as
genomes from this family have not previously been sequenced.

Figure 1.Phylogenetic positions of sesame and the 36 land plants with available genome sequences. aRefers to sesame (S. indicum L.), a member of the Pedaliaceae family, only 34 genera of which have been entered
in the NCBI taxonomy database.

Overview of the Sesame Genome Project

The Sesame Genome Working Group (SGWG) comprises six major sesame research teams in
China involved in investigating genetic diversity of germplasm resources, functional
genomics, and biotic and abiotic resistance, in addition to sesame genome sequencing.
All members of the SGWG work under the Toronto Statement for prepublication data release
[61]. The main goal of the Sesame Genome Project is to provide a fine map of S. indicum and facilitate global genomic and functional genomic studies. We have already released
a preliminary draft assembly [62] of the sesame genome that can be used according to the conditions outlined in this
letter. A detailed plan for the Sesame Genome Project has been made available on our
website [62].

Properties of the S. indicum genome and available genomic resources

Natural sesame species can be divided into three types based on chromosome numbers,
that is, 2n = 26 (for example, S. indicum, S. alatum), 2n = 32 (for example, S. protratum, S. angolense) and 2n = 64 (for example, S. radiatum, S. schinzianum) [14,37]. The basic chromosome number in the Sesamum genus is X = 8 and 13, with X = 13 probably resulting from ancient polyploidy [37]. The size of a haploid genome of S. indicum (2n = 26) was reported to be about 0.95 Gb, with a mass of 0.97 pg [63], which is out of proportion with the 0.51 Gb and 0.97 Gb of Cerathoteca sesamoides (2n = 32) and S. radiatum (2n = 64), respectively [64]. Before beginning this genome project, we examined the characteristics of sesame
chromosomes using cv. Yuzhi 11. Results showed that its karyotype formula is 2n =
2x = 26 = 6m + 16sm + 4st, and chromosome length ranges from 1.21 to 2.48 μm (H Zhang,
unpublished data). We distinguished and numbered the chromosomes with 45S rRNA, simple
sequence repeats (SSR) and bacterial artificial chromosome (BAC) sequence probes using
fluorescent in situ hybridization (FISH) and BAC-FISH techniques to facilitate super-scaffold assembly
in the sesame genome (H Zhang, unpublished data). Comparing genome size with that
of Arabidopsis thaliana [65], soybean (cv. William 82) [66] and rice (cv. Nipponbare) [67], the genome size of S. indicum cv. Yuzhi 11 is estimated by flow cytometry to be about 369 Mb (H Zhang, unpublished
data). From our preliminary sequencing data, we estimate the genome size to be approximately
354 Mb, close to this result (see below).

The sesame chloroplast genome was published recently [59]. Sequencing of the chloroplast genome of S. indicum cv. Yuzhi 11 has also been performed (H Zhang, unpublished data), and will be used
for raw read filtering and genome assembly in our Sesame Genome Project. A total of
86,222 unigenes with an average length of 629 bp are available and 46,584 (54.03%)
unigenes have a significant similarity with proteins in the NCBI nonredundant protein
database and Swiss-Prot database (E-value <10-5) [39]. Before the beginning of this project, we sequenced sesame transcriptomes from 24
groups of S. indicum materials and treatments using Illumina paired-end sequencing technology to greatly
enrich available information on the functional genome [40,68], obtaining a 40G dataset containing 42,566 unitranscript sequences. We also constructed
a BIBAC (pCLD 04541) library of 80,000 clones with an insert size of 120 kb and a
BAC (CopyControl™ pCC1BAC™) library of 57,600 clones with an insert size of 85 kb.
The genome coverage of both BAC libraries was 27- and 13-fold, respectively (H Zhang,
unpublished data). There are 45,093 S. indicum expressed sequence tags (ESTs) available in the NCBI EST database. Prior to our work,
only two other S. indicum seed-specific cDNA libraries, including one full-length cDNA library, had been constructed,
some clones of which were chosen at random and sequenced [38,69]. In order to explore more genes involved in sesame growth and development, we constructed
a full-length cDNA library of S. indicum cv. Yuzhi 11 containing 300,000 clones, 1,200 clones of which were selected randomly
and sequenced (H Zhang, unpublished data). The genomic and transcriptomic data from
these studies should facilitate genome assembly and analysis. The first sesame linkage
map, which contains 284 microsatellite polymorphic loci, was set up in 2009 and has
been used as a landmark frame for assembly of the whole genome [70]. We recently updated this high-density linkage map with 653 SSR, SNP, AFLP and random
selective amplification of microsatellite polymorphic loci (RSAMPL) markers falling
into 14 linkage groups to facilitate sesame genome assembly and anchoring of trait
loci (H Zhang, unpublished data).

Sequencing strategy for the S. indicum genome

The Sesame Genome Project is divided into three phases. The first phase, which has
already been completed, involves high coverage Illumina sequencing and draft genome
assembly. We constructed five types of Illumina libraries, including two paired-end
libraries with insert sizes of 300 and 500 bp, and three mate-pair libraries with
insert sizes of 2, 3 and 5 kb. In order to avoid bias in library construction, at
least two libraries for each insert length were constructed. Illumina technology was
used to generate 98 Gb of reads, giving a 276× coverage of the estimated genome (Table
1). Subsequently, the draft genome was assembled using ABySS (v 1.3.3) [71]. Paired-end Illumina reads were first assembled into contigs. Mate-pair reads with
insert sizes of 2, 3 and 5 kb were then aligned into the contigs, and the relationship
between mate-pair reads was used to join contigs and construct scaffolds. As a result,
a preliminary assembly of 293.7 Mb was generated (Table 2).

The second phase will involve Roche 454 pyrosequencing and BAC sequencing and fine
map construction. We have constructed Roche 454 paired-end libraries with an insert
size of 20 kb and will generate 3.5 Gb of data giving a 250× coverage of the estimated
genome. We also plan to end-sequence 40,000 sesame BAC clones using conventional Sanger
sequencing, giving a 12× coverage of the estimated genome. To ensure hybrid de novo assembly of the best possible quality, we will use a modified Celera Assembler pipeline
[72]. Roche 454 paired-end reads and BAC-end reads are better for spanning longer repetitive
elements and joining scaffolds into superscaffolds. We will use BAC-end information
to retrieve and select 1,000 specific BAC clones, one end of which aligns well to
the scaffold while the other end is located in a gap region, for full-length sequencing
using the Illumina BAC polling method. The full-length BAC sequences will fill in
the gaps within superscaffolds and greatly improve genome integrity. At this stage,
we expect to obtain a fine map of Yuzhi 11 with 800 to 1,000 superscaffolds of a putative
N50 length of 1 Mb and N90 length of 250 kb.

In the final phase, the superscaffolds will be anchored to chromosomes. We will first
anchor the BACs containing mapped SSR markers on the updated linkage map [70] (H Zhang, unpublished data). Physical distances between landmarks will then be determined.
Furthermore, we will construct a physical chromosome map based on at least 1,000 BAC
clones using information obtained from BAC-FISH and BAC-end. At least one BAC will
be anchored on the chromosomes per superscaffold to ensure all superscaffolds are
anchored onto the 13 chromosomes. In order to validate the accuracy and integrity
of the sesame genome assembly, several quality control parameters, such as read depth
of coverage, average quality values per contig, discordant read pairs and gene footprint
coverage, will be examined. To check the accuracy of the assembly of scaffolds, we
will also complete full-length sequencing of 15 BAC clones using conventional Sanger
sequencing and align them to the scaffolds.

Timeline and goals of the Sesame Genome Project

The blueprint for the Sesame Genome Project was conceived and designed by the SGWG
in 2009. We completed the goals of the first phase in March 2012. In the second phase,
Roche 454 paired-ends reads will be sequenced by December 2012, and the double-ended
sequencing of the 40,000 BAC clones and full-length sequencing of 1,000 BAC clones
will be completed by June 2013. The final phase of scaffold anchoring will proceed
in parallel with bioinformatics analysis. We expect to complete all the goals of Sesame
Genome Project and submit a paper by December 2013. To make our data broadly available
prior to publication, the completion of each goal of these phases will be publicly
communicated via our website [62]. Updated versions of assembly data will be made available to any independent research
groups performing non-genome-scale analyses. Sequence data and the preliminary assembly
produced in the first phase are already available on the website.

Status of current preliminary genome assemblies

The current draft assembly of Yuzhi 11 is 293.7 Mb in length, with a GC content of
34.65%. The N50 and N90 sizes of the scaffolds are 22.6 kb and 4.3 kb, respectively
(Table 2). Genome size was estimated to be 354 Mb using the well-established 17-mer method
[73], in line with flow cytometry data that suggest it is 369 Mb (H Zhang, unpublished
data). The 17-mer distribution frequency in 16.77 Gb of trimmed Illumina PE reads
was calculated using Jellyfish (v1.1.4) [74]. We identified a total of 13,931,658,332 unique k-mers, and 87,207,553 k-mers that
had a frequency <10. The frequency of peak k-mers was 39 (Figure 2).

Figure 2.K-mer (17mer) frequency analysis of the S. indicum genomic sequence. Data produced from 500 bp insert libraries. The peak k-mer frequency is 39 and its
minimum point is 10. Genome size was estimated with the formula: Estimated genome
size (bp) = total number of k-mers with a frequency >10/peak k-mer frequency.

In order to determine the frequency and complexity of repetitive elements in the draft
assembly, we compared the assembly information with the Arabidopsis repetitive elements database from the RepeatMasker library (version 20120418) and
the sesame de novo database constructed for the Yuzhi 11 draft assembly (RepeatModeler, version 1.0.5)
using RepeatMasker (version open-3.2.9) [75,76]. Thirty-eight percent of the draft assembly was identified as repetitive elements
(Table 3), only approximately 5.7% of which shared homology with the Arabidopsis database.

Table 3. Repeats derived from de novo and homology-based predictions in S. indicum

Quality control the raw data and intermediate datasets

In order to control the quality of raw data, the SolexaQA package was used to verify
the sequence data generated from each of the 17 Illumina-Solexa libraries [77]. The raw reads were trimmed by DynamicTrim (quality threshold Q ≈ 20) and then filtered
by LengthSort (the length cutoff set as 25). Unpaired reads would be screened and
discarded in this system. Meanwhile, Roche 454 reads data, which are kept in Standard
Flowgram Format (SFF), were converted into FastQ format and evaluated using the traditional
quality metrics. As Sanger reads may contain vector sequences, the Lucy package was
used to search and trim for cutting off the vector sequence contamination [78]. Low-quality bases and chimeric reads would be tracked with trim modules of the Celera
Assembler.

We validated the coding region coverage of the draft assembly using two different
gene footprint coverage methods. Using the Core Eukaryotic Genes Mapping Approach
(CEGMA) [79], 444 (96.9%) of the 458 core eukaryotic genes (CEGs) mapped against the draft assembly
were identified. An RNA sequence based method employing Velvet [80] and OASES [81] allowed us to assemble 3.5 Gb of RNA-Seq reads (NCBI accession SRX061117) [39] into 99,589 putative transcripts. Putative transcripts were then translated into
82,549 peptides using ESTScan (version 2.1) [82]. These peptides were aligned against the SWISS-PROT [83] database using BLAST (E-value 10-5) to obtain high-confidence peptides. Redundant peptides (such as alternative-splicing
transcripts) were filtered according to BLAST scores and the names of the hits. More
than 99.5% of the 3,584 peptides obtained could be aligned to the draft assembly using
GMAP [84]. The above results indicate that the draft assembly has a high coverage of the coding
region.

Gene prediction for the draft assembly was performed using InchWorm [85]: 3.5 Gb of RNA-Seq reads [GenBank: SRX061117] were assembled into 472,257 contigs
and mapped to the draft genome using GMAP. The GMAP mapping results were used as a
training set for ab initio prediction using AUGUSTUS [86]. As a result, 23,713 gene models were obtained with a total length of 28 Mb (Table
4). Average coding sequence length was 1.2 kb and average GC content was 45%. We obtained
functional annotations of all genes using InterProScan [87], which also determines motifs and domains. Gene Ontology (GO) annotations were given
to 10,656 genes using corresponding InterPro entries and the Pfam database [88]. Visualization of the functional categories of these 10,656 genes was performed using
WEGO [89] (Figure 3).

Figure 3.Functional catalogues of sesame genes in the preliminary assembly. Results are summarized in three main categories: biological processes, cellular
components and molecular functions. A total of 10,656 genes have been assigned with
Gene Ontology terms.

Biological questions to be addressed

We plan to address several key biological questions specific to sesame using this
new genome and transcriptome data. We will compare the sesame genome with the genomes
of monocotyledonous and other dicotyledonous plants to elucidate the phylogeny of
the Sesamum genus and the origin of S. indicum. We will also perform more detailed investigations on the formation and regulation
of fatty acids, storage proteins and secondary metabolites (including sesamin) in
sesame. We will apply the bio-information obtained in this genome project in sesame
breeding programs, paying particular attention to the induction and regulation of
resistance to the main sesame diseases, including Fusarium wilt and charcoal rot diseases, and the environmental stress of waterlogging. Other
possible uses of the genomics dataset, such as determining the regulatory mechanisms
of biological characteristics in Sesamum, including simple stem or branch, leaf shape, indeterminate growth habit, flower
number per axilla, capsule carpel number, flower color and other species-specific
traits, will not form part of our analysis. We believe that the main achievement of
this project will be to markedly accelerate sesame genetic research and breeding.
Members of the SGWG also hope to address additional questions about the relationship
between sesame growth and environmental conditions, such as identifying which genes
regulate low temperature responses and drought sensitivity.

Joining the SGWG and using our early release data

This project is being conducted by the SGWG. We invite other research groups to access
and use the draft assembly and raw read data, which have already been released. Any
group performing non-genome-scale analyses, or investigating the above biological
questions, is welcome to use our data without restriction. As a matter of courtesy
and to avoid duplication of effort, we request that competing genome-scale projects
or studies that overlap with the above stated research areas disclose their status
to the SGWG consortium. Formal inquiries and requests to join the working group should
be made to HZ. Updated versions of the genome assembly, further project descriptions
and a complete list of current SGWG members dedicated to this project can be accessed
on our website [62].

Acknowledgements

This work was supported by the earmarked fund for the China Agriculture Research System
(CARS-15), China National '973' Project (2011CB109304), and Henan Zhongyuan Scholar
Fund (092101211100) to HZ. HM was supported by a grant from the China National Key
Technology R & D program (2009BADA8B04-03) and the earmarked fund for China Agriculture
Research System (CARS-15). HL, QW and MY were individually supported by the earmarked
fund for China Agriculture Research System (CARS-15). Special thanks to Dr Joy Fleming
for helpful discussions and suggestions in the manuscript revision process.

Kobayashi T: The wild and cultivated species in the genus Sesamum. In Sesame: Status and Improvement. Proceedings of Expert Consultation: 8-12 December
1980; Rome. Edited by Amram A. Rome; 1981:157-163.