The Coordinating and Bioinformatics unit is responsible for the creation of the
software and informatics infrastructure for the consortium as well as facilitating
the efforts of the mouse engineering centers. This page provides information about
the infrastructure created for the consortium as well as any software created for
the scientific community.

Lab Personnel

Mike AufieroSystems Analyst

Danilo GueselaSystems Analyst

Jiaqi LiPhD Student

Colby WilliamsCurator

Infrastructure Information

DiaComp IT Infrastructure
Our programming paradigm is to develop software systems based on an n-tier architecture,
where we create the presentation layer, business logic and data layer into separate
software systems. These systems have been developed to minimize maintenance, but
provide a robust scalable model for future growth and interactions at the national
level with other organism databases. These systems have been designed using the
unified modeling language (UML) with the designs available to the general public.
The two UML modeling tools we use are Rational Rose and Powerdesigner.

DiaComp Data Model
The core relational data model for the DiaComp was created using SQL Server 2000 and
was based on a number of existing schemas containing our key subject areas: animal
models, genotypes (including array experiment data), histopathology, and phenotype
Assays. The Mouse Models of Human Cancer Consortium (MMHCC) and the Jackson Labs
were particularly helpful, and shared several successful models. Currently DiaComp
Data Model has been migrated to SQL Server 2005 and has been modified to include
MMPC (National Mouse Metabolic Phenotyping Centers) Data Schema. The current version
of the database addresses several domains, including DiaComp - MMPC administration,
models, strains, publications, external database references, experiments, phenotype
assays, microarray data, histology, images and dataset persistence. Current data
model has 250 tables, 55 functions, 994 stored procedures, 141 data views and a
total of 9344 lines of code.

* Note: Above links require Internet Explorer version 5.0 or above to view Data Model
with Zoom capability. Also please make sure to accept ActiveX warning to start viewer.
Viewer has links to different data schemas on Navigation Dropdown Box, you will need to click go Next to the Links
to load different schema.

DiaComp Object Model
The DiaComp Object Model (DiaComp-OM) created for the consortium fully describes the
activities of the DiaComp and provides an
OOP API to access the data generated by
the consortium. The DiaComp-OM was designed using Powerdesigner and UML, written in
C# and compiled as a .NET DLL. The object model contains both administrative and
domain specific classes. However, only the data centric classes are available to
the public. The Domain classes provide both object specific classes (e.g. Model,
Strain, Experiment, Protocol, etc.) as well as DataManager and SearchCriteria classes
used to retrieve data from the system. These DataManager classes are specific for
each of the data types maintained by DiaComp.
For example, the StrainMgr class provides
methods to retrieve strain specific data. The SearchCriteria classes are also datatype
specific and are used by the DataManager classes to query the database using different
type specific parameters. For example, the StrainSearchCriteria class provides queryable
properties specific for the Strain data in the system.

In order to provide the broadest access to the data, we are also creating a WebService
that exposes specific portions fo the DiaComp-OM to the public. Specifically, the
WebService will provide access to all the object specific classes as well as the
DataManager and SearchCriteria classes. This provides a mechanisms for programmers
to create local DiaComp-OM objects in other languages. The current version of the
DiaComp-OM has 185 object classes.

DiaComp-Web Services
The DiaComp Web Services layer exposes classes and methods of the DiaComp object model
which can be used by users to interact with the DiaComp object model using custom
built web applications or even without a user interface. Details about the interfaces
are provided to users through an XML document called a Web Services Description
Language (WSDL) document. There are several tools available to read a WSDL file
and generate the code required to communicate with an XML Web service including
a very capable “Add Web Reference” tool used in Microsoft Visual Studio. DiaComp web
services layer makes available public data search and retrieval methods for animals,
strains, experiments, histology images, investigators, phenotype assays and publications.
The exposed web service methods can be consumed through customized client ASP.NET
applications using SOAP calls or through traditional HTTP GET/POST METHODS without
the use of an API. The framework has been designed to be independent of any particular
programming model and other implementation specific semantics. A complete documentation
for each of the web service methods is available providing information about data
return type, input parameters and exceptions thrown. In addition, users may choose
to download a zipped Visual Studio 2008 solution file containing a sample ASP .NET
client application and C# class library project.

ParaKMeansParaKMeans is a high performance parallel processing implementation of the K Means Clustering algorithm. We designed the software so it can be deployed on most Windows operating systems. The applications are written for the .NET Framework v1.1 using the C# programming language. The parallel nature of the application comes from the use of a web service to perform the distance calculations and cluster assignments. Because we use a web service, it is essential that at least one computer has Internet Information Services (IIS v.5 or better) installed and running. The parallel K Means algorithm used in this application is based on the work of Ben Zhang, Meichun Hsu and George Forman. Documentation Available Here

HPClusterClustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high I/O costs involved and large distance matrices calculated, most of the clustering algorithms fail on large datasets (30,000+ genes/200+ arrays). We propose a new two-stage algorithm which partitions the high dimensional space associated with microarray data using hyper planes. The first stage is based on the BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) algorithm with the second stage being a conventional k-Means clustering technique. Because the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44,460 genes without failure and significantly decreases the time to complete when compared to popular k-Means programs. The software was written in C# (.NET 1.1). This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data.

ParaSAMSignificance Analysis of Microarrays (SAM) is a permutation-based method that relies on estimating the FDR for determining significance. SAM is freely available as an Excel plug-in and as an R-package module. However, for large datasets the memory requirements are high and the algorithm fails. To overcome the memory limitations, we have developed a parallelized version of the SAM algorithm called ParaSAM. This high performance multithreaded application does not require programming experience to run and is designed to provide the general scientific community with an easy and manageable client-server Windows application. The parallel nature of the application comes from the use of web services to perform the permutations. The software is written in C# (.NET 1.1) and is designed in a modular fashion to provide both deployment flexibility as well as flexibility in the user interface. Our results indicate ParaSAM is not only faster than the serial versions, but can analyze extremely large datasets that cannot be performed using a single PC.