Pathogens such as bacteria and viruses are leading causes of disease worldwide,
which makes it essential to identify them in DNA samples. Instead of analysing raw
DNA sequences, mathematical models based on Variable Length Markov Chains
(VLMCs), known as Genomic signatures, make it possible to classify DNA samples
faster than with traditional alignment-based methods. To analyse a set of genomic
signatures, we use clustering, which is an unsupervised machine-learning method.
For the clustering of VLMCs, an accurate and fast similarity measure (distance
function) is needed.
To analyse distance functions and clusters, we define metrics based primarily on
the taxonomic ranks of the underlying organisms. For the distance functions, we
primarily analysed whether the VLMCs within the same taxonomic rank were closest
to each other. For the cluster analysis, we use the silhouette metric to determine
how well separated the clusters are and define the average percentages, sensitivity,
and specificity of the captured taxonomic ranks.
We present a new distance function for VLMCs, called Frobenius-intersection, which
correlates accurately with the well-known Kullback-Liebler distance function, while
also being several orders of magnitude faster. We use average-link clustering together
with the Frobenius-intersection distance to cluster data sets of known viruses
and bacteria with relatively short DNA sequences. The clusters of VLMCs correspond
accurately to the Baltimore types of the viruses as well as the viruses’ and
bacteria’s taxonomic families. However, most of the classifications of viruses are also
subdivided into multiple clusters. Moreover, when combining the set of bacteria and
viruses, the clusters start to mix the viruses and bacteria before finding all of the
taxonomic families.
The clustering of the genomic signatures is accurate with respect to, for instance,
taxonomic ordering. Therefore, it can help in identifying unclassified pathogens.
Future research may reveal other causes of similarity between the genomic signatures.

Skapa referens, olika format (klipp och klistra)

BibTeX @mastersthesis{Gustafsson2018,author={Gustafsson, Joel and Norlander, Erik},title={Clustering genomic signatures A new distance measure for variable length Markov chains},abstract={Pathogens such as bacteria and viruses are leading causes of disease worldwide,
which makes it essential to identify them in DNA samples. Instead of analysing raw
DNA sequences, mathematical models based on Variable Length Markov Chains
(VLMCs), known as Genomic signatures, make it possible to classify DNA samples
faster than with traditional alignment-based methods. To analyse a set of genomic
signatures, we use clustering, which is an unsupervised machine-learning method.
For the clustering of VLMCs, an accurate and fast similarity measure (distance
function) is needed.
To analyse distance functions and clusters, we define metrics based primarily on
the taxonomic ranks of the underlying organisms. For the distance functions, we
primarily analysed whether the VLMCs within the same taxonomic rank were closest
to each other. For the cluster analysis, we use the silhouette metric to determine
how well separated the clusters are and define the average percentages, sensitivity,
and specificity of the captured taxonomic ranks.
We present a new distance function for VLMCs, called Frobenius-intersection, which
correlates accurately with the well-known Kullback-Liebler distance function, while
also being several orders of magnitude faster. We use average-link clustering together
with the Frobenius-intersection distance to cluster data sets of known viruses
and bacteria with relatively short DNA sequences. The clusters of VLMCs correspond
accurately to the Baltimore types of the viruses as well as the viruses’ and
bacteria’s taxonomic families. However, most of the classifications of viruses are also
subdivided into multiple clusters. Moreover, when combining the set of bacteria and
viruses, the clusters start to mix the viruses and bacteria before finding all of the
taxonomic families.
The clustering of the genomic signatures is accurate with respect to, for instance,
taxonomic ordering. Therefore, it can help in identifying unclassified pathogens.
Future research may reveal other causes of similarity between the genomic signatures.},publisher={Institutionen för data- och informationsteknik (Chalmers), Chalmers tekniska högskola},place={Göteborg},year={2018},keywords={Computer science, Bioinformatics, Master's thesis, Markov chains, Variable length Markov chains, DNA clustering, Genomic signatures, Clustering, Machine learning, Unsupervised learning},note={73},}

RefWorks RT GenericSR ElectronicID 255511A1 Gustafsson, JoelA1 Norlander, ErikT1 Clustering genomic signatures A new distance measure for variable length Markov chainsYR 2018AB Pathogens such as bacteria and viruses are leading causes of disease worldwide,
which makes it essential to identify them in DNA samples. Instead of analysing raw
DNA sequences, mathematical models based on Variable Length Markov Chains
(VLMCs), known as Genomic signatures, make it possible to classify DNA samples
faster than with traditional alignment-based methods. To analyse a set of genomic
signatures, we use clustering, which is an unsupervised machine-learning method.
For the clustering of VLMCs, an accurate and fast similarity measure (distance
function) is needed.
To analyse distance functions and clusters, we define metrics based primarily on
the taxonomic ranks of the underlying organisms. For the distance functions, we
primarily analysed whether the VLMCs within the same taxonomic rank were closest
to each other. For the cluster analysis, we use the silhouette metric to determine
how well separated the clusters are and define the average percentages, sensitivity,
and specificity of the captured taxonomic ranks.
We present a new distance function for VLMCs, called Frobenius-intersection, which
correlates accurately with the well-known Kullback-Liebler distance function, while
also being several orders of magnitude faster. We use average-link clustering together
with the Frobenius-intersection distance to cluster data sets of known viruses
and bacteria with relatively short DNA sequences. The clusters of VLMCs correspond
accurately to the Baltimore types of the viruses as well as the viruses’ and
bacteria’s taxonomic families. However, most of the classifications of viruses are also
subdivided into multiple clusters. Moreover, when combining the set of bacteria and
viruses, the clusters start to mix the viruses and bacteria before finding all of the
taxonomic families.
The clustering of the genomic signatures is accurate with respect to, for instance,
taxonomic ordering. Therefore, it can help in identifying unclassified pathogens.
Future research may reveal other causes of similarity between the genomic signatures.PB Institutionen för data- och informationsteknik (Chalmers), Chalmers tekniska högskola,PB Institutionen för data- och informationsteknik (Chalmers), Chalmers tekniska högskola,LA engLK http://publications.lib.chalmers.se/records/fulltext/255511/255511.pdfOL 30