Describe what is meant by data clustering and how it can be used for the analysisof gene expression matrices.

Lecture 16 slide #3-5

Clustering of data is a method by which a large set of data is grouped into clusters (groups) ofsmaller sets of similar data. In the context of gene expression matrices where rows representgenes and columns represent measurements of gene expressionvalues for samples underdifferent conditions, clustering algorithms can be applied to find either groups of similar genesor groups of similar samples or both:

Describe what is meant by a cluster centoid and what is meant by similaritymetrics.

Lecture 16 slide # 6-14, 26

The centroid is taken to be a “virtual” representative object for a cluster. Mathematically, itcould be calculated as a point in an M-dimensional space whose parameter values are themean of the parameter values of all the points in the clusters.

(where M is the number offeatures or parameters or dimensions used for describing each object).

It is a virtual object, since there does not need to be a real object in the cluster with thecalculated vales.

A similarity metric is a method used for quantifying the similarity between two objects. Wetypically represent objects as points in an M-dimensional space. Generally, the distancebetween two points is taken as a common metric to assess the similarity among them. Themost commonly used distance metric is the Euclidean metric which defines the distancebetween two points p= ( p1, p2, ....) and q = ( q1, q2, ....) is given by :

Other metrics include Manhattan distance, which is calculated as follows

Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghanem

Imperial College London



Make sure to describe the properties of a good similarity metric.

1.

Distance between two profiles must be greater than or equal to zero,distances cannot be negative.

2.

The distance between a profile and itself must be zero

3.

Conversely if the difference between

two profiles is zero, then the profilesmust be identical.

4.

The distance between profile A and profile B must be the same as thedistance between profile B and profile A.

5.

The distance between profile A and profile C must be less than or equal tothe sum of the distance between profiles A and B and profiles B and C.



Make sure to provide formulae for two different similarity metrics that can beused in data clustering.

Provided above, Euclidean and Manhattan

3.

Describe the operation of the hierarchical clustering algorithms

Lecture 16 slide # 16-24

Hierarchical clustering is a method that successively links objects with similar profiles to forma tree structure. The standard hierarchical clustering algorithm works as follows:

Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, thebasic process of hierarchical clustering is this:

1.

Start by assigning each item to its own cluster, so that if you have N items,you now have N clusters, each containing just one item.

2.

Find theclosest (most similar) pair of clusters and merge them into a singlecluster, so that now you have one less cluster.

3.

Compute distances (similarities) between the new cluster and each of the oldclusters.

4.

Repeat steps 2 and 3 until all items are clustered into a single cluster of sizeN.



Make sure you explain what is meant by a similarity matrix

At each step of the algorithm, we need to compute a similarity matrix (or alternatively adistance matrix) which represent the similarity (alternatively distance)

between the N objectsbeing clustered. At each step you use the matrix to find the two elements with maximumsimilarity (alternatively minimum distance). The two elements are merged into one elementand the matrix is recalculated. The matrix is thus updated during the operation of thealgorithm by reducing it to a smaller matrix at each step. You start by an NxN matrix, then an(N-1)x(N-1) matrix, ….



Make sure you explain what is meant by single linkage, average linkage andcomplete linkage

Linkage methods refer to how the distance between clusters (groups of objects) arecalculated. Whereas it is straightforward to calculate distance between two objects, we doCourse 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghanem

Imperial College London

have various options when calculating distance between clusters. These include singlelinkage, average linkage and complete linkage methods.

InSingle Linkage we consider the distance between one cluster and another cluster to beequal to the shortest distance from any member of one cluster to any member of the othercluster.

In Complete Linkage we consider the distance between one cluster and another cluster to beequal to the longest distance from any member of one cluster to any member of the othercluster.

In Average Linkage we consider the distance between one cluster and another clusterto beequal to the average distance from any member of one cluster to any member of the othercluster.



Make sure you explain what is meant by a dendrogram.

Dendrograms are used to represent the outputs of hierarchical clustering algorithms. Adendrogram is a binary tree structure whose leaf elements represent the data elements,which are joined up the tree based on their similarity. Internal nodes represent sub-clusters ofelements. The root of the node represents the cluster containing the whole data collection.

The length of each tree branch represents the distance between clusters it joins.

4.

Describe the operation k-means clustering algorithm using psuedocode.

Lecture 16 slide # 26

Given a set of N items to be grouped into k clusters

1.

Select an initial partition of k clusters

2.

Assign each object to the cluster with the closest centroid

3.

Compute the new centeroid of the clusters.

4.

Repeat step 2 and 3 until no object changes cluster.

5.

Compare and contrast the advantages of hierarchical clustering and k-meansclustering.

Lecture 16 slide # 34

The table in the slides provides the required

comparison from a computational perspective. Ingeneral hierarchical clustering is more informative since it provides a more detailed outputshowing similarity between individual items in the data set. However, its space and timecomplexity are higher than k-means clustering since you need to start with an NxX matrix, Ink-means you don’t. Also the output of k-means may change based on the seed clusters so

itcan generate different results each time you execute.

6.

Explain briefly the operation of the SOM algorithm and how it relates to k-meansalgorithm.

Lecture 16 slide # 35

Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghanem

Imperial College London

7.

Explain briefly what is meant by dimensionality reduction and why it may beimportant in data analysis.

Lecture 16 slide # 36

8.

Explain briefly how both MDS and PCA work and compare between them.

Lecture 16 slide # 37-30

9.

What is the main difference between clustering and classification.

In classification you already know the groups that the data is divided into, this is providedby a label (e.g. diseased vs. healthy), and you are trying to find a model in terms of thedimensions (d1…dm) that can predict the class. This type of analysis is useful forpredictive modelling.

In clustering you are trying to divide the data into groups based on the values of theirdimensions. You choose these groups such as to maximise the similarity inside thegroups and maximise the distance between them. This type of analysis is useful forexploratory

analysis.

(Problems)

10.

If you use k-means clustering on the data in table below to group the followingpeople by age into 3 groups. How many steps would it take the algorithm toconverge if you start with centroids defined by Andy, Burt and Claire? How may

Use the hierarchical clustering on the data of question 11 using a Euclidean metric1

in the

following cases:

a.

Using single Linkage

b.

Using Complete Linkage

Make sure to show the values of your distance matrix at each step

I build a matrix based on distance (not similarity), so at each step, so I scan for the minimumvalue–

If I used a similarity matrix, I would have to choose the maximum value.

a.

Using single Linkage

Note I only have to calculate distances once, I will operate only on this matrix from now on.

A and G are most similar items so I merge them to get first link between two elements. I drawthe connection andlabel the length on the scale bar.

I need to update the matrix, I delete the row and column for A and row/column for B. I insert anew row and column called AG. The entries for AG need to be calculated. Since I use singlelinkage, I choose to keep the minimum value between (AG, B) i.e. min (dist(A,B) , dist(B,G)) =min(8.6, 8.1) = 8.1 the distance from G to B. All other entries that do not involve AG remainthe same. The updated values are shown in italics.

I repeat the process, this time I have a choice since the distance between F and B is 1.4, thedistance between G and H is also 1.4 and so is the distance between C and D. I arbitrarilychoose to link F and B together.

1

This is a rather big size problem to solve by hand, but given to show how you can do it.

A

B

C

D

E

F

G

H

A

X

8.6

19.1

20

11.7

8.5

1

8.6

B

X

X

12.3

13.6

3.2

1.4

8.1

2.8

C

X

X

X

1.4

9.8

13.6

19

14.9

D

X

X

X

X

11.2

14.9

20

16.2

E

X

X

X

X

X

4

11.2

5.1

F

X

X

X

X

X

X

7.8

1.4

G

X

X

X

X

X

X

X

7.8

H

X

X

X

X

X

X

X

X

A-G

B

C

D

E

F

H

A-G

X

8.1

19

20

11.2

7.8

7.8

B

X

X

12.3

13.6

3.2

1.4

2.8

C

X

X

X

1.4

9.8

13.6

14.9

D

X

X

X

X

11.2

14.9

16.2

E

X

X

X

X

X

4

5.1

F

X

X

X

X

X

X

1.4

H

X

X

X

X

X

X

X

A

G

1

A

G

F

B

1

Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghanem

Imperial College London

I repeat and I link now BF and H

Now I link C and D

I now link BFH and E

AG

B-F

C

D

E

H

AG

X

7.8

19

20

11.2

7.8

B-F

X

X

12.3

13.6

3.2

1.4

C

X

X

X

1.4

9.8

14.9

D

X

X

X

X

11.2

16.2

E

X

X

X

X

X

5.1

H

X

X

X

X

X

X

AG

BF-H

C

D

E

AG

X

7.8

19

20

11.2

BF-H

X

X

12.3

13.6

3.2

C

X

X

X

1.4

9.8

D

X

X

X

X

11.2

E

X

X

X

X

X

AG

BFH

C-D

E

AG

X

7.8

19

11.2

BFH

X

X

12.3

3.2

C-D

X

X

X

9.8

E

X

X

X

X

1

A

G

F

B

H

C

D

A

G

F

B

H

1

Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghanem

Imperial College London

I now link AG and BFHE and then the final cluster AGBFHE and CD. Giving me the finaldendrogram shown below. Compare this to the scatter plot shown in the previous problemand see if it makes sense.

Here is the dendrogram generated by

the KDE data mining tools.

AG

BFH-E

CD

AG

X

7.8

19

BFH-E

X

X

9.8

CD

X

X

X

AG-BFHE

CD

AG-BFHE

X

9.8

CD

X

X

F

B

H

E

A

G

1

C

D

1

3

C

D

F

B

H

E

A

G

1

3

8

10

Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghanem

Imperial College London

b)

For complete linkage, we dothe same thing, but when updating the matrix, we choose themaximum distance between clusters rather than the minimum distance.

I still start be choosing A and G to start with the same matrix since they still have theminimum distance.

Now when updating the matrix, I set the distance between AG and B to be the maximum ofdist(A,B) and dist(A,G) i.e. 8.6 rather than 8.1 as in the previous case

I choose to merge B and F since they have the minumum distance

I choose to merge BF and H,

Etc

Here is the dendrogram generated by the KDE data mining tool. First compare it to the oneabove. Then generate your own dendrogram and compare it to the one below.

A

B

C

D

E

F

G

H

A

X

8.6

19.1

20

11.7

8.5

1

8.6

B

X

X

12.3

13.6

3.2

1.4

8.1

2.8

C

X

X

X

1.4

9.8

13.6

19

14.9

D

X

X

X

X

11.2

14.9

20

16.2

E

X

X

X

X

X

4

11.2

5.1

F

X

X

X

X

X

X

7.8

1.4

G

X

X

X

X

X

X

X

7.8

H

X

X

X

X

X

X

X

X

A-G

B

C

D

E

F

H

A-G

X

8.6

19.1

20

11.7

8.5

8.6

B

X

X

12.3

13.6

3.2

1.4

2.8

C

X

X

X

1.4

9.8

13.6

14.9

D

X

X

X

X

11.2

14.9

16.2

E

X

X

X

X

X

4

5.1

F

X

X

X

X

X

X

1.4

H

X

X

X

X

X

X

X

AG

B-F

C

D

E

H

AG

X

8.6

19.1

20

11.7

8.6

B-F

X

X

13.6

14.9

4

2.8

C

X

X

X

1.4

9.8

14.9

D

X

X

X

X

11.2

16.2

E

X

X

X

X

X

5.1

H

X

X

X

X

X

X

Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghanem

Imperial College London

13.

The following table shows the gene expression values for 8 genes under five typesof cancer. You are interested in discovering the similarity relationship between theeight genes.

ID

C1

C2

C3

C4

C5

A

1

1

1

1

2

B

1

2

1

1

1

C

14

15

15

15

15

D

15

15

15

15

15

E

16

16

16

16

16

F

6

6

5

6

6

G

4

4

4

4

4

H

5

5

5

5

5

a.

Using Manhattan distance and a single linkage show the resultingdendrogram.

Work out the calculation yourself by hand. When you do it, you will end up with a dendrogramlooking as the one below.

Note that even though there are more dimensions than in the previous problem (five featuresas opposed to only 2), you will mainly be dealing with the same size distance matrix (8x8)since this defined by the number of elements being clustered. In general itwill be as tediousto solve as the previous one, but get your hand working at it to figure out the pattern of doingit. Clearly as the computation progresses the matrix size gets smaller.

b.

How would memory storage requirements change if you use completelinkage? If you use average linkage?

In complete linkage it is the same requirements, you just pick values from the initial distancematrix, but update them differently.

In average linkage you need to calculate the distance between every pair of elementsin bothclusters. You would need to keep the initial distance matrix to look-up this information inaddition to the one you are updating.

14.

Based on the table in question 12, use hierarchical clustering (Manhattan distanceand single linkage) to study the

similarity between the five cancer types (C1 ..C5).How can this form of analysis be useful?

Analysis is useful when you want to study similarity between diseases (See question 4 intutorial 1). Here is the distance matrix for this problem, it is easierto calculate because of theManhattan distance..

C1

C2

C3

C4

C5

C1

X

X

X

X

X

C2

2

X

X

X

X

C3

2

2

X

X

X

C4

1

1

1

X

X

C5

2

2

2

1

X

Course 341: Introduction to Bioinformatics

2004/2005, 2005/2006, 2006/2007

Moustafa Ghanem

Imperial College London

There many different ways to proceed since there are lots of 1, the dendrogram can have anyshape you want based on which diseases you link-up since the distance that separates themis always 1.