The Daylight Chemistry cartridge is the result of several distinct
collaborative development projects within Oracle, Novartis, Ontogen
and Daylight. This whitepaper/presentation will describe Oracle Cartridge
technology, review the history of the internal and external projects which
have contributed to this effort, describe the current Daylight Cartridge
implementation, and provide a view of future direction for the cartridge.

Oracle Cartridge Technology:

An Oracle Data Cartridge is a bundled set of tools which extends Oracle clients
and servers with new capabilities. Typically, a single data cartridge deals
with a particular 'information domain'. Some examples of current data
cartridges include: image processing, spatial processing, audio processing.

A cartridge consists of a number of different components which work together to
provide complete database capabilities within the information domain. The
components include:

User-defined Object types,

Packages - PL/SQL utilities/libraries,

Interfaces to store, retrieve, search, and index the new types.

A cartridge extends the server. The new capabilites are tightly integrated
with the database engine. The Cartridge interface specification provides
interfaces to the query parser, optimizer, indexing engine, etc. Each of
these subsystems of the server learns about the cartridge capabilities through
these interfaces. The cartridge can use SQL, PL/SQL, Java, C, etc.
to implement the functions.

From Oracle's point of view, the cartridge idea allows third-party
organizations to expand the capabilities of the Oracle database server in a
modular, supportable fashion.

Project History:

In late 1997 - early 1998, two independent projects were undertaken
relating to 'alternate database interfaces' to the Daylight system.

Alberto Gobbi, Paul Gatewood (Novartis CP), Johnny Peterson (Oracle CH)
collaborated on a Pilot project for data warehousing with Oracle. Results
from this project were discussed and demo'ed at Mug98.

John Cargill (Ontogen) and Norah MacCuish (Daylight) collaborated on a
prototype Informix Datablade. This work was also presented in a Mug98
talk.

Interest generated from Mug98 resulted in an ongoing Daylight / Oracle
project to generate requirements, develop a cartridge prototype, and
refine its functionality. The paradigm for this version of the cartridge
was the "dayblob". Dayblob is a complete set of chemical functions and
a merlin-pool-like implementation of chemical searching embedded completely
within Oracle. This was most recently discussed in a
talk by Dave Weininger during Mug99.

In Spring, 1999, Oracle released Oracle 8i (version 8.1.5), the first
production version to support full cartridge capabilities. Version 8.1.6
is projected for general release in December of this year. It will have
some notable features which will impact the Daylight cartridge implementation
for the long term.

In light of Oracle 8.1.5 and 8.1.6 functionalities and performance issues,
we've revisited some of the design constraints related to the Dayblob
architecture and have been working on "Version 09" of the cartridge.

So, what is the Daylight Cartridge? First, what it isn't. It isn't
a full-functioned toolkit interface / environment. It is a set of chemical
utilities (normalizations, data conversions, comparison functions) which
provide the missing pieces required to manage chemical data in an Oracle
database.

Design Goals:

The key design goals for this version haven't changed from those presented
in previous discussions:

Provide high performance chemical functionality within an RDBMS framework,

Deliver this functionality in a robust, flexible, supportable,
expandable fashion,

The one new design goal added to the equation is the following:

Deliver the chemical functionality in a seamless, integrated fashion,
from the perspective of the RDBMS user.

The main point here is that we recognize that in addition to the differences
in the database itself (the servers, databases, tables, indexes), the members
of the Oracle user community (the developers, database administrators, users)
have very different perspectives than those which we are used to dealing with.

In order for the product to be successful, the Daylight cartridge must
behave in a predictable, understandable way. The paradigms we use within
our cartridge must match those which Oracle developers and DBAs already use
and understand.

Architecture:

The Daylight toolkit interfaces with the Oracle server via callouts to
the "extproc" utility. This utility provides a RPC-like
mechanism for performing C-language function calls. Daylight toolkit
code is wrapped inside this RPC layer for each of the defined cartridge
functions.

One of the first concerns about this architecture is the efficiency of
the RPC mechanism, and the potential that extproc will be a performance
bottleneck. Some simple tests indicate that this isn't a problem. Consider
the contrib program "cansmi", which takes as input SMILES and
outputs canonical SMILES. A simple example, with 1999 SMILES from the
medchem demo database runs rapidly:

The overhead of executing functions through the extproc utility is
roughly 250 ms in this case. This represents approximately
5000 - 10000 round-trips per second. For all but the most trivial
toolkit processing, this will be a minor factor.

Another issue is data bandwidth between the extproc utility and
the server. Tests on the blob-based index indicate that we can
retrieve upwards of 100,000 rows per second from a tabular data
source, and 60 MB per second from a blob-based data source.

Each of these data throughput and round-trip limits represents a
design constraint which must be considered in the overall cartridge
design. The design described in this whitepaper is what we
consider to be the optimum tradeoff between resource efficiency,
search performance, and the ability to maintain the full transaction
concurrency model within Oracle for all cartridge data.

Cartridge Specification:

As indicated earlier, an Oracle Cartridge typically consists of three
sets of functionalities: object type definitions, packaged functions, and
indexing / data access tools. In turn, each of these three areas will be
discussed with respect to the Daylight cartridge implementation.

Object Types:

The daylight cartridge does not define any new Oracle object types!!!

To review, within the Daylight Toolkit, each object has an internal object
type and an external representation. The external representation is used to
store objects in files, Thor databases, to communicate objects between
processes, etc.

Daylight Object

External Language

molecule

SMILES

reaction

SMILES

pattern

SMARTS

transform

SMIRKS

fingerprint

Encoded ASCII

depiction

2D coordinate list

conformation

3D coordinate list

datatree

TDT

monomer

Monomer SMILES

multimer

CHUCKLES

varimer

CHORTLES

varipattern

CHARTS

Note that our external languages are all very expressive and well-behaved.
That is, they are compact, with high information density. They are all
printable ASCII strings. They all have well defined syntax and semantics.

So, our strategy with respect to the Oracle Cartridge is the following:

When needed, the cartridge instantiates internal Daylight objects within
the cartridge to perform a specific task (eg. calculate a molecular weight
from a SMILES). The interfaces between Oracle and Daylight always pass
objects as their external string representations.

Only the endpoints of communication must understand what the objects mean:
the Daylight Oracle cartridge provides the server-side endpoint, and a
front end user interface provides the client-side endpoint. Otherwise, the
objects effectively "tunnel" through the middle layers.

This does not preclude the design of an Oracle-specific object layer on top
of the cartridge system (eg. ODBC); we simply don't require one, and
don't dictate which model, if any, you use.

There is an overhead associated with instantiating objects on demand.
Naturally, the functions which are provided in the cartridge perform higher
granularity operations than in the native Daylight toolkit.

The important points to recognize are that the function takes a SMILES string
as input, and returns a SMILES string as output. Inside the function, it
creates an internal molecule or reaction object from the SMILES, canonicalizes
it, and then destroys the internal molecule or reaction object before
returning. The internal Daylight object only exists for the duration of the
call. The function is stateless with respect to the Daylight toolkit.

Although there is some additional overhead for instantiating objects and
throwing them away, this must be weighed against the cost of performing the
extproc call, the overhead of performing the IPC for the call, and the overhead
of providing state information (if one were to attempt to implement persistant
objects within extproc). The additional overhead has minimal impact on
overall throughput.

Furthermore, there are very real opportunities for optimizations and shortcuts
which we can exploit at the string level. For example, caching the most
recent calculation can benefit queries like:

Canonicalize an input SMILES. Returns a canonical SMILES string. Type is
either 0 or 1, for unique or absolute SMILES, respectively. [This might
be better as two functions, eg; smi2asmi(), smi2usmi() without the
type argument]

Return the euclidean distance between two fingerprints. If the fingerprints
are not the same size (nbits()), then the larger will be folded automatically
to match the size of the smaller before comparison. The returned value is
a floating point number between 0.0 and 1.0.

Return the tanimoto distance between two fingerprints. If the fingerprints
are not the same size (nbits()), then the larger will be folded automatically
to match the size of the smaller before comparison. The returned value is
a floating point nubmer between 0.0 and 1.0.

function ftversky ( fpstr1 IN VARCHAR2, fpstr2 IN VARCHAR2,
alpha IN NUMBER, beta IN NUMBER ) => NUMBER
operator tversky ( fpstr1 IN VARCHAR2, fpstr2 IN VARCHAR2,
alpha IN NUMBER, beta IN NUMBER ) => NUMBER

Return the tversky distance between two fingerprints. If the fingerprints
are not the same size (nbits()), then the larger will be folded automatically
to match the size of the smaller before comparison. The parameters alpha
and beta must be in the range of 0.0 - 1.0. The returned value is
a floating point nubmer between 0.0 and 1.0.

Returns 1 if all of the bits in fpstr1 are also present in fpstr2. That is,
the fingerprint from fpstr1 represents a possible substructure of fpstr2.
If the fingerprints are not the same size (nbits()), then the larger will be
folded automatically to match the size of the smaller before comparison.
The returned value is either 0 or 1.

Returns a potentially reordered N-tuple string for the given list VARCHAR2
input parameter. The list string is reordered based on the canonical atom
or bond ordering of the input SMILES. ntuple is the number of comma-separated
values per atom or bond, and isotype is 0 for unique SMILES canonicalization
and 1 for absolute SMILES canonicalization.

Comparison functions:

function fexact ( a IN VARCHAR2, b IN VARCHAR2) => NUMBER
operator exact ( a IN VARCHAR2, b IN VARCHAR2) => NUMBER

Returns 1 if the two input strings are identical, 0 otherwise. The operator
is optionally backed by the ddexact indextype.

Returns 1 if the smarts expression matches the given SMILES, 0 otherwise.
The operator is optionally backed by the ddblob indextype.

function fsimilar ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2,
threshold IN NUMBER ) => NUMBER
operator similar ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2,
threshold IN NUMBER ) => NUMBER
operator score ( smiles1 IN VARCHAR2, smiles2 IN VARCHAR2,
threshold IN NUMBER ) => NUMBER

Returns 1 if the tanimoto similarity between the two SMILES is greater than
the threshold. Calculates normal fingerprints. The operators are optionally
backed by the ddblob indextype. score() is an ancillary operator to the
similar() operator, and allows the convenient selection of the tanimoto score
during index-based queries.

These functions are part of the interface, but are only needed by the
extensible index code. These should not be used; they are only mentioned
becuase they appear in the data dictionary. They may be hidden in the
production release.

Extensible Indexes:

The the major change in implementation between version 08 and version 09
of the cartridge is the implementation of indexes. In version 08 and before,
all index operations were implemented in a single blob-based index. Version 09
splits the indexes up into four distinct indextypes. The reasons for
this change are:

Simplify the blob-based index. In order to support all of the search
types required, the blob required multiple copies of the SMILES, multiple
hash tables, etc. Very complex and resource intensive.

Descrease the size of the blob-based index. As the blob-based index is
used for searching and must be sent between the Oracle server and the
extproc for a query, minimizing the size of the index increases search
performance. Caching of the blob in extproc is possible with version 09,
however caching the blob requires the availability of memory in extproc
sufficient to hold the entire blob.

Allow user chosen indexes. With the previous monolithic blob, all
indexes were implemented all the time. In this version, one can choose
any combination of the four indextypes for a given column. One can also
use any combination of the indextypes on multiple columns in a single
table.

Simplify maintainence of the transaction model. The three new tabular
indexes easily obey Oracle's transaction concurrency model. The simplified
blob-based index can also now obey the transaction model, whereas the
previous version could not.

Exact lookup indextype (ddexact):

This indextype supports the exact() operator. This is the least interesting
index, as it implements a simple string comparison between the indexed column
and the query. In that sense, the built in BTree index within Oracle has
the similar functionality. The BTree index even supports the "greater
than" and "less than" operators, which this index doesn't.

The main advantage of the ddexact indextype is that it can index strings up
to 4000 bytes (the maximum varchar2 length), while the internal BTree can
only index strings up to approximately 40% of a block; for a database with
a 2048 byte blocksize, this means that the largest indexable varchar2 string
using the BTree is approximately 700 bytes. Similarly, in a future version
of the cartridge this index will support the comparison of LOB datatypes;
the BTree index does not.

This index requires approximately 22 bytes per row, independent of the
length of the SMILES being indexed.

This indextype supports the graph() and tautomer() operators. It creates a
tabular index of the graph and tautomeric information (hydrogen count and
net charge) for rapid retrieval. The indexed search only requires an index
lookup based on the graph of the query, followed by comparison of the net
charge and hydrogen count for a tautomer search.

This index requires approximately 30 bytes per row, independent of the
length of the SMILES being indexed.

This indextype supports component level searches for exact molecules within
reactions by role. It creates a tabular index of each component in every
reaction, by role. Component level searches are implemented simply by a
index lookup of the component, followed by validation of the role.

This index requires approximately 22 bytes per component, independent of the
length of the SMILES being indexed.

This index caches the SMILES and fingerprints in a single binary large
object for rapid searching. Substructure and similarity searches are
implemented through this blob-based image of the structural data. The blob
is persistant; it is automatically stored in the Oracle database and
retrieved for processing. The blob is kept synchronized with the base
table in real time and strictly obeys the Oracle transaction model.
In addition, this indextype supports a user-controlled extproc cache
of the blob data. On a per-session basis, one can hold the entire blob
in memory in extproc; this eliminates the overhead of passing the blob
data from the Oracle server to extproc for each query.

The timings are on a Sun Ultra 60, with 768 MB of real memory. The Oracle
installation is a vanilla 8.1.5 install, with the only changes being the
increase of db_block_buffers, shared_pool_size, log_buffers parameters from
default values in init.ora.

Index Creation Times

Table name

Index type

Creation time (hh:mm:ss)

small

exact

2:20

small

graph

7:40

small

blob

16:00*

medium

exact

21:00

medium

graph

1:15:00

medium

blob

3:27:00*

large

exact

2:46:00

large

graph

7:24:00

rxn

exact

2:01

rxn

role

19:30

rxn

blob

30:19*

(* - Includes fingerprint generation time)

Query Performance:

Table name

Query

Hits

Time (mm:ss) no cache / cache

small

contains(smiles, 'OC(=O)C1')

0

00:00.35

00:00.02

small

contains(smiles, 'OC(=O)CS')

492

00:01.45

00:01.13

small

matches(smiles, '[OH]C(=O)CS')

268

00:01.27

00:00.95

small

similar(smiles, 'OC(=O)CS', 0.8)

7

00:01.43

00:01.10

medium

exact(smiles, 'Oc1ccccc1')

1

00:00.11

medium

graph(smiles, 'Oc1ccccc1')

14

00:00.06

medium

contains(smiles, 'OC(=O)C1')

0

00:09.50

00:00.03

medium

contains(smiles, 'OC(=O)CS')

5597

00:47

00:33

medium

matches(smiles, '[OH]C(=O)CS')

1592

00:35

00:28

medium

similar(smiles, 'OC(=O)CS', 0.8)

19

00:18

00:09.10

large

exact(smiles, 'NCCc1ccccc1')

1

00:00.12

large

graph(smiles, 'c1ccccc1')

28

00:00.11

rxn

reactant(smiles, 'Sc1ccccc1')

181

00:09.48

rxn

product(smiles, 'Oc1ccccc1F')

1

00:00.11

rxn

contains(smiles, 'OC(=O)C1')

0

00:00.50

00:00.03

rxn

contains(smiles, '>>OC(=O)CS')

754

00:05.70

00:04.90

rxn

matches(smiles, '[OH]C(=O)CS>>')

167

00:09.80

00:09.50

rxn

similar(smiles, 'OC(=O)CCl>>OC(=O)CS', 0.5)

68

00:01.45

00:00.98

Summary:

To summarize, we can deliver a set of Oracle utilities via a cartridge
which will provide high performance chemical information processing
within the Oracle system. These tools allows the handling of molecules
and reactions, complete with coordinates for display and fingerprints for
searching efficiency. Four chemistry-specific indextypes are provided,
which allow the database creator to tailor the database for specific
searches and applications.

To Do:

These are several outstanding issues which I haven't covered in
the above:

SMI->MDL and MDL->SMI functions.

"Packaging" of the cartridge. Oracle has a specific process
whereby a new third-party cartridge is registered. This includes the
assignment of unique naming conventions specific to the third-party
developer. We haven't gone through this registry process yet. Hence,
function and index names may change.

Fingerprint handling in the current blob-based index. The index
currently computes its own fingerprints. We will provide the option
for the blob to get its fingerprints from a user column. The
blob-based index can and will support tversky, and fingertest operators.

We will provide the ability to create a "fingerprint-only"
index, which will support fingerprint searching and screening based
on user-supplied fingerprints.

Timetable:

Our target date for a limited release of Version 09 is the end of November,
1999. Our target date for full release of Version 09 is February 1, 2000.
This release will be supported on Solaris 2.7. Other platforms will be
made available as dictated by demand.

Futures:

Oracle indicated that version 8.1.6 will be available in production by
December, 1999. One of the important new features in 8.1.6 is the ability
to run Java in the server. This will impact the design for future
versions of the Daylight cartridge, specifically the design of the
blob-based indexes.

Informix has been "back-burnered". Note that all of the
various interfaces and tools described here have direct parallels within
the Informix Datablade model. It is possible to provide an Informix
datablade; this will be dictated by demand.

Program objects can be called from extproc. It is feasible to provide
an extensible interface through program objects which doesn't require
any user extproc programming.

One last possibility down the road. Consider the implications of the
following PL/SQL code, which runs under 8.1.5.. The first section
is a simple PL/SQL program. The second is a user-defined function which
takes a molecule or reaction object handle as an argument.

It is possible for Daylight to provide a general PL/SQL toolkit interface
within Oracle, including Smiles, Smarts, Fingerprint, Depict, Reaction,
Monomer, and Program object but maybe not Thor or Merlin.
Issues to be resolved include memory management, memory persistance and
multiprocessing within extproc. We must understand Oracles long-term plans
for extproc before proceeding. Of course, more than just a novelty, work
on this product will be dictated by customer demand.