gCLUTO (Graphical CLUstering TOolkit) is a graphical front-end for the
CLUTO
data clustering library. Its purpose is to make CLUTO's clustering abilities
available in a user-friendly graphical way. In addition, gCLUTO provides
several ways to interactively visualize clustered results. A copy of gCLUTO
can be found at
http://www.cs.umn.edu/~mrasmus/gcluto. For more information about CLUTO visit
http://www.cs.umn.edu/~karypis/cluto.

Read the README.txt file to locate the correct version of gCLUTO
for your Operating System.

Windows users can make a desktop shortcut to gcluto.exe by
locating gcluto.exe in the file manager, right-clicking the icon,
dragging it to the desktop, and choosing "Create Shortcut Here" from the
pop-up menu.

Linux users can create a symbolic link to the gcluto binary
("ln -s gcluto wherever/you/want/the/link"). Place the
symbolic link wherever is most convenient (ex: ~/bin).

Note: the actual executables (gcluto, gcluto.exe) must stay within their
folders in order to work propperly. Do not relocate them.

When clustering data, many pieces of information are involved, such as data
files, clustering solution files, and visualizations. Like many other
applications, gCLUTO uses the concept of a project to organize the
user's data and work flow. When a project has been loaded, its contents will be
displayed in the tree view located at (a) in Figure 3.1.

Each item in the project is presented as an icon in the tree.

Project - This represents the
project itself. It is the root of the project tree.

Data - After importing data into
a project, one of these icons will appear in the project tree.
A project can contain many different data items.

Solution - After clustering
one of the data items, a solution item will be created and placed underneath
the data item from which it was clustered.

Matrix Visualization - This is
a visualization that can be generated after clustering. All
visualizations appear under the solutions they were generated from.

Mountain Visualization - This
is another visualization that attempts to describe the inter-relationships
of clusters in a 3D way.

Right clicking on any item will
bring up a pop-up menu listing the available operations that can be preformed
on the item. Double clicking on any item will open its contents in a
new window called a view, similar to the windows (b), (c), and (d)
in Figure 3.1. While working in one of these views, extra menu options
specific to the view's content will apear in the menu bar.

When gCLUTO first opens it starts with an empty project tree. To begin work, a
new project must be created. To create a new project, go to the menu bar
and choose "File" and then "New Project". A file dialog window will appear.
Specify a name for your project and a location on your computer to save it.

gCLUTO will create a directory, called the project directory.
The Project Directory will be named
after the project and stored at the specified location. Within the project
directory, gCLUTO will save all the information related to the project.

To open an existing project, choose the "File" menu and then
"Open Project". A file dialog will appear. Navigate to the location of the
project directory and open it. Within the project directory there will be a
file named "project_name.prj", where project_name will be the name
of the project. Choose this file and click "Open".

After these steps, a project will be loaded and displayed in the project tree.

To import a new data item go to the menu bar and choose "Project" and then
"Import Data". The Import Data dialog will appear allowing the user to
specify the location of a file for each of the file types listed above.
Clicking on a "Browse" button will bring up a file dialog to allow the user to
locate the needed files. Only the *.mat file is required. The user must also
specify whether the *.mat file contains matrix data or graph data by selecting
the appropriate option.

If the *.mat file is chosen first, gCLUTO will try to guess the location of
the optional files (*.rlabel, *.clabel, *.rclass) by appending the extension
onto the *.mat filename. For example, for a file named genes.mat,
gCLUTO will guess genes.mat.rlabel for a row label file. If such a file exists,
gCLUTO will make it the default file to open in the "Browse" file dialog.

After specifying these files, the user may give a label for the data item.
If no label is given, the data item will be labeled after its *.mat
file with the extension removed. After clicking "OK" in the Import Data
dialog, gCLUTO will attempt to read in the chosen files. If no errors are
encountered, gCLUTO will add the new data item to the project tree and open a
Data View. The Data View allows the user to view the data and verify
that it has been loaded correctly.

If data has been imported using the steps given in 3.3 then
it is ready to be clustered. Clustering can be initiated two
different ways. The first is choosing "Cluster" from the pop-up menu that
appears when you right-click on a data item in the project tree. Secondly, the
very same menu can be found in the menu bar under "Data" if a Data View is open.

After choosing "Cluster" in either menu a Clustering Options dialog will
appear with all the options available for clustering. These options work
exactly the same as in CLUTO. For an explanation of their meanings see CLUTO's
documentation. Only particular options make sense together. To help make
sensible choices, gCLUTO will autmatically update the dialog as the user makes
choices to ensure that only reasonable choices are available.

Once the clustering options are chosen, click "Cluster" in the Clustering
Options dialog. After gCLUTO finishes the clustering calculations it will
respond by creating a solution item under the clustered data item in the
project tree.

gCLUTO will also automatically open a Solution View similar to (b)
in Figure 3.1. This view contains the options used for clustering and several
statistics about the clusters. The report is designed after the report given by
CLUTO. For further explanation of its meaning see CLUTO's documentation. In
addition, the report contains links, similar to a web page. Clicking on these
links allows for quick navigation between related information in large reports.

gCLUTO has been designed to facilitate clustering of the same data multiple
times. If a previously clustered data item is chosen for clustering again, the
Clustering Options dialog will appear with the options that were used the
previous time. To reload the options used for creating a particular solution,
right-click the desired solution item in the project tree and choose "Recluster"
from the pop-up menu. This will bring up the Clustering Options dialog with
the solution's options loaded. This feature eases the process of repeated
adjustments to clustering options.

Currently, gCLUTO contains two visualizations: the Matrix Visualization
and the Mountain Visualization. Visualizations can be generated from
solutions by choosing the desired visualization from the solution menu. This
menu can be found by right-clicking on a solution item in the project tree or in
the menu bar under "Solution" if the user is currently working in a Solution
View.

The Matrix Visualization is similar to the matrix visualization produced by
CLUTO. The former extends the latter by making the matrix interactive. A
detailed explanation of the visualization is given in CLUTO's documentation.

In the Matrix Visualization, the orginal data matrix is displayed such that
colors are used to graphically represent the values present in the matrix.
gCLUTO uses white to represent values near zero, increaingly darker shades of
red to represent large values, and increasingly darker shades of green to
represent negitive values. The rows of the matrix are reorder, such that rows
of the same cluster are together. Black horizontal dividers separate the
clusters.

If tree building is enabled, the Matrix Visualization will contain trees located
above and to the left of the matrix. If an agglomerative clustering algorithm
was used, the tree generated during clustering is displayed as the Row Tree.
Otherwise, a tree is generated to fit the clustering solution.
The Column Tree is generated by performing
agglomerative clustering on the inverse of the matrix.

If row and column labels were chosen when the data was imported, then they will
appear below and to the right of the matrix. Labels will only show if space is
available to display them.

To help explore the information contained within the Matrix Visualization,
several features have been implemented. First, the size of the matrix can be
scaled in multiple ways. Second, the trees can be used to collapse and expand
areas of interest within the matrix.

3.5.1 Matrix Visualization - Scaling

The easiest way to scale the matrix is with the scaling controls located
directly above the matrix. Scaling can be changed by entering a new size in the
text box, or by clicking on either of the up or down arrows. The control
labeled with "W" controls the width of the matrix and the control labeled "H"
affects the height. These scaling controls change the dimensions of the entire
matrix and are convenient for zooming in and out of areas of interest in the
matrix.

Often times the user needs to enlarge one area of the matrix, yet shrink
areas that are not as important. This type of scaling can also be done. To
resize only a portion of the matrix, start by selecting the area to be resized.
Selection is done by clicking on any cell and draging the mouse to another cell.
These two cells will become the corners of the selected region. Cells that are
selected are shaded blue. To resize the selected region, place the mouse over
any edge of the region. The cursor will change to a resizing cursor. Click and
drag the edge to the desired location. The selected cells will then resize
to fit within the new region.

Lastly, the matrix can be restored to its orginal scaling by choosing "Matrix"
from the menu bar and then "Reset Sizing". The matrix can also be automatical
scaled to fit the screen by choosing "Fit to Screen" in the "Matrix" menu.

3.5.1 Matrix Visualization - Using the Trees

The Row and Column Trees allow for collapsing and expanding of the matrix.
Blue squares in the tree represent nodes that are fully expanded. Clicking on
any expanded node will collapse it. Collapsed nodes are represented as pink
squares. When a node is collapsed, all of its descendents are hidden. If a
node in the Row Tree is collapsed, all of the rows of the collapsed region are
hidden and replaced with a single row that contains their average. Simply click
a collapsed node to expand it again. The Column Tree works in a similar manner.

The labels will change to describe the collapsed regions. If a region contains
rows which all belong to the same cluster, then it will be labeled with the
cluster id. If multiple clusters are present in a collapsed region then it will
be labeled "multi-cluster".

The Mountain Visualization is used to visualize the relative similarity
of clusters as well as their size, internal similarity, and internal deviation.
In the mountain visualization, each cluster is represented as a peak in the 3D
terrain. A peak's location, volume, height, and color are all used to protray
information about the associated cluster.

The user can navigate through and around the 3D visualization by clicking and
dragging the mouse over the 3D display. Different mouse buttons perform
different actions.

Left Click - Rotates the terrain.

Right Click - Moves the terrain up, down, left, and right.

Middle Click - Zooms in and out.

The location of the peaks in the plane is determined using Multidimensional
Scaling (MDS) on each of the cluster mid-points. MDS
attempts to preserve the distances between vertices as they are mapped from a
high dimensional space down to a lower dimensional space. In this application,
cluster mid-points are used as vertices in MDS and are mapped to a two
dimensional plane.

MDS allows users to make inferences about their data using the
Mountain Visualization. For example, in Figure 3.3 a data matrix was clustered
into ten clusters. The Moutain Visualization represents these ten clusters as
ten peaks labeled by their cluster id. Although ten clusters were
requested, MDS has placed the peaks in two distinct groups. We can infer that
clusters within each group are strongly similar, while widely different from
clusters in the other group. Thus, the visualization suggests the data would
better lend itself to a two-way clustering.

The shape of each peak is a Gaussian curve. This shape is used as a rough
estimate of the distribution of the data within each cluster. The height of
each peak is portional to the cluster's internal similarity. The volume of a
peak is portional to the number of elements contained within the cluster. The
resulting Gaussian curves are added togther to form the terrain of the Mountain
Visualization.

Note: When comparing peak heights keep in mind that the Mountain Visualization
has added the peak curves together. As seen in Figure 3.4, the resultant height
is taller than the true height.

The color of a peak is proportional to the cluster's internal deviation. Red
indicates low deviation where as blue indicates high deviation. Only the color
at the tip of a peak is significant. At all other areas, the color is
determined by blending to create a smooth transition.

Clicking on any label will load statistics about the associated cluster into the
text window located below the visualization. This information is identical to
the information found in the Solution Report. If column labels have been chosen
for this data, then the Mountain Visualization can display the most common
features above each peak. This option is called "Show Features" and is found
in the "Mountain" menu.