DTP Data Warehouse Model
Introduction
The Developmental Therapeutics Program
(DTP) of the National Cancer Institute
has long been committed to providing researchers access to
chemical and
biological data. In order to make this data even more useful,
we have developed a common data architecture for all the data DTP
makes available. The basis for this architecture is a set of
relational database tables. We have also developed a library
of JAVA classes
that provides a consistent programming interface to the database.
This architecture has been implemented in a publicly accessible
database.
We hope that this model encourages the development of applications
that not only make use of a wider range of DTP data, but also
integrate data from other sources.
Overview
A data architecture needs to provide three main things:
- a way to store the actual results
- a way to store the additional information necessary to fully characterize
the results (protocol, compounds used, cell lines used, etc.)
- useful ways to search and retrieve the results
It is common to achieve all three goals by making each result a row in a relational
database table. There would be a column for the result and columns for compound
and/or cell line identifiers. Additional information could be contained in additional
columns if searching this information was desired or it could be contained in written
documentation that explained the specific data the particular table was designed to hold.
The problem with this strategy of one specifically designed table per set of assay
results is that it becomes difficult to handle data in a general fashion. Data can not
be loaded or accessed until the appropriate tables and supporting software are developed.
Search tools are difficult to maintain because the software has to be aware of every
new table and its structure.
We have decided to create a more general scheme by separating the storage of actual
results and the characterization of the results into separate sets of tables. There is
one table that stores actual results, no matter what kind. There are tables that
provide for general characterization of results (which assay, what kind of result,
which particular batch of data, etc.) and tables that allow more specific characterization
(which compound(s) and/or which cell line(s) were used).
Result
Actual results are stored in the
TESTRESULTS table. This table allows for four types of results:
- Numerical
- Free text
- restricted text
- chemical structure
All or any subset of these types may be used in a particular record. There are fields that can be used to
specify the unit of the result, the number of observations summarized by the result,
a dispersion type (standard deviation, range, etc.) and a dispersion value. In order to insure
consistent usage, all except the last of
these are keys to records in the appropriate tables. The free text field allows any text, while the
restricted text field is a key to a
DESCRIPTIVEDATA record. This allows the result to be restricted to one out of a defined set of
text strings. There is also a priority field that can be used to provide a sorting key for similar results.
The chemical structure result is a key to a record in the
STRUCTURES table.
Addition of an image result is under discussion.
Overall Characterization of results
The primary table for providing an overall characterization of results is
TESTMASTER.
This table provides fields for a short descriptive name and a longer text explanation.
The longer text explanation should describe the use of any more specific characterizations
described below. The table also provides pointers to records in
DATACLASSES
and DATATYPES,
allowing data under this TESTMASTER to be described by those categories. There are text fields
for protocol description (see below) and for a release identifier. The TESTMASTER record
also contains fields that point to where the data can be found. The database pointed to need
not be the same database that holds the particular TESTMASTER record. This means data
can be distributed, with local databases not needing to copy all data from central sources.
Finally, the TESTMASTER record also contains fields that hold URLs. The WEBLINKDOC field
should contain a URL that provides more detailed documentation and explanation for the
particular set of data referred to by the TESTMASTER and the WEBLINKCGI field should
contain a URL for a CGI or servlet type program that allows web retrieval of the data.
The specific syntax of the servlet may vary, but for all DTP servers adding
?searchtype=help to the end of the WEBLINKCGI value will always return full documentation
for use of the servlet.
Further general characterization of a particular set of data is provided by
the TESTSUBSET
table. As implied by the name, each TESTSUBSET record is a subclassification of
a particular TESTMASTER. A priority field is included to allow for sorting of
TESTSUBSETs within a given TESTMASTER.
Specific Characterization of results
In principle, the TESTMASTER and TESTSUBSET tables provide for complete characterization
of a given result, but in practice there are situations where this would not be useful.
For example, the same protocol may be applied to test a large number of compounds. It is
possible to assign the result for each different compound to a different TESTSUBSET,
but that makes searching by compound (especially across different TESTMASTERs) very
difficult to reliably implement. We have attacked this problem by making specific tables
to handle specific result characterizations. As of December 2001, there are two such
tables;
TESTCOMPOUNDS
and TESTCELLLINES.
The basic function of these tables is to link an internal database key for the compound or cell line
to a particular TESTRESULT record. There is also a "position" field in each table that allows
for unambiguous linking of a particular test record to not only a specific TESTRESULT record,
but also to a specific place in the protocol. For example, a TESTMASTER protocol
might read "CELLLINE 1 was plated at time zero and media conditioned for 48 hours on CELLLINE
2 was added. At 24 hours COMPOUND 1 was added and at 48 hours COMPOUND 2 was added." The compound
and cell line numbers refer the value in the "position" field in the test record. Note that the TESTCOMPOUND
table also includes fields to specify the concentration of compound used (with unit).
Notice that the above only addresses half the problem. We have connected a TESTRESULT record to
a compound or cell line, but only via an internal database key representing it. We still
need to make a connection between this key and some externally meaningful identifier. This is
done via the
TRANSLATECOMPOUNDNBR and
TRANSLATECELLINENBR tables. These tables make the association between the internal database key and
specific externally meaningful identifiers. Each has a number of fields that can be used. The
particular fields used and their meanings are stored in the
COMPOUNDIDSYSTEMS
and
CELLLINEIDSYSTEMS tables. Note that:
- The internal database keys are specific to a particular database and can not be assumed to
be consistent from database to database.
- There is no requirement that a particular external identifier map to one and only
one database key.
These facts make database maintenance much easier. Maintaining consistent internal database keys
across multiple databases in multiple places essentially means making those keys a new externally
meaningful identifier, a monumental task that is well beyond the scope of this project. Also,
integrating data with different identifiers becomes easier. If one set of data identifies a
compound as NSC X and a different set identifies the same compound as CAS YY-YY-Y, the equivalence
of the two external identifiers need not be known at load time. When the equivalence is
realized, the only thing that needs to be done to insure that both results come up when
a search is done for either X or YY-YY-Y is to add the appropriate records in the
TRANSLATECOMPOUNDNBR table.
Publicly available implementation
The DTP data warehouse model has been implemented in an Oracle database that is
publicly accessible. The specifics required for connection are:
- host name - dtpax2.ncifcrf.gov
- instance name - webdb
- port - 1523
- read only user - anyuser
- read only password - look
The DTPConnect class in our JAVA library is set up to connect to this instance
by default.
As of January 2004, this database contains
- 266,873 NSC numbers
- 124,453 CAS numbers
- 262,298 molecular formulas
- 261,179 molecular weights
- 255,356 SMILES strings
- 257,697 2D chemical structures
- 235,661 3D chemical structures
- 225,179 chemical names
- 131,527 results from the AIDS screen
- 4,581,555 results from the 60 cell line/one dose screen
- 7,430,929 results from the 60 cell line/five dose screen
- 513,667 results from the Stage 0 yeast screen
- 169,308 results from the Stage 1 yeast screen
- 123,120 results from the Stage 2 yeast screen
- 2,888,757 results from the 60 cell line screen on natural products
- 37,817 results from various molecularly targeted assays
- 175,449 results from the 6 day cancer screen
- >1.1 million results from the old mouse in vivo antitumor screen
Programming
Some programming examples.