This document describes some of the operations performed
to generate the downloadable bulk files from the NCI Open Database
structures and biological test data (cancer and AIDS, see here
for more information). The aim of this document is to show how to combine the tools
of the SDF Toolkit and to provide tricks and recipes by showing real examples.
All these examples shoud be run on a Unix system.

Input files

All input files are based on the publicly and freely available
data from NCI's Developmental Therapeutics Program (DTP).
We collected the structures and biological data from DTP (cancer data
as of August 1999, AIDS data as of October 1999), combined them where applicable,
and generated MDL SD files from this information.

SDF Toolkit

The SDF_Toolkit can be downloaded here.
You'll need version 1.06 or later.

Merge and remove duplicates from two SD files

Objective: Merge and remove duplicates from two SD files. Duplicates are
recognized by the same identifier (a non chemical data entry in the
SD files). The identifier (here the NSC number) must be present in both input
files.

Merge and remove duplicates from two SD files and make a list of the new entries

Objective: Merge and remove duplicates from two SD files. Duplicates are
recognized by the same identifier (a non chemical data entry in the
SD files). The identifier must be present in both input files. Make a list
of the new entries.

Notes: tee is a standard Unix command which reads from standard input,
writes to standard output and saves to a file. cactus_2d_nci is a
TCL script (not part of the SDF_Toolkit) which calculates 2D coordinates.
This script makes use of the CACTVS
system.

Remove entries with a special filter

Objective: Remove entries from the NCI files that have an NSC number greater
or equal than 900,000 (these are combinatorial library entries).

Input files:

open_397.mol - NCI data file released in March 1997

689_aug99_0D.sdf - supplemental structures from the August 99 release

2212_oct99_0D.sdf - supplemental structures from the October 99 release

Prepare biological data file

Objective: The NCI cancer screen data are comma separated value files, which unfortunately,
cannot be used directly by the add_propd_sdf tool. The problem is that
data for one molecule (NSC number) are split over several lines.
The solution is to combine in one line all the data which belongs to one
entry. The Perl script nciscreen2csv was written for that purpose.

-perlclass is a special option for the tool add_prop_sdf. The argument
to the option -perlclass, NCI_screen, is the name
of a customized Perl class which derives from the class that processes
standard CSV (comma separated value) table files. Its purpose is to reformat
the biological data. See the file NCI_screen.pm in the toolkit (this will
interest probably only Perl 5 programmers).

-noskip is an option that instructs to keep all entries even if biological
data is not available.

Add biological data to a SD file and filter out entries for which biological
data is not available

Objective: Add AIDS and cancer cell data to a SD file in one operation.
Same as before, but now only the structures for which all biological
data (AIDS and cancer cells) is available.

Input files:

nciopen_LMCH_oct99_2D.sdf - NCI data file (includes 2D information)

cancer_screened_gi50_a99.csv - comma separated value file with a special
format which matches the NCI_screen format. Each line contains all the data for one NSC entry.

-perlclass is a special option for the tool add_prop_sdf. The argument
to the option -perlclass, NCI_screen, is the name
of a customized Perl class which derives from the class that processes
standard CSV (comma separated value) table files. Its purpose is to reformat
the biological data. See the file NCI_screen.pm in the toolkit (this will
interest probably only Perl 5 programmers).