DTP Data Warehouse Model

Introduction

The Developmental Therapeutics Program (DTP) of the National Cancer Institute has long been committed to providing researchers access to chemical and biological data. In order to make this data even more useful, we have developed a common data architecture for all the data DTP makes available. The basis for this architecture is a set of relational database tables. We have also developed a library of JAVA classes that provides a consistent programming interface to the database. This architecture has been implemented in a publicly accessible database. We hope that this model encourages the development of applications that not only make use of a wider range of DTP data, but also integrate data from other sources.

Overview

A data architecture needs to provide three main things:
  1. a way to store the actual results
  2. a way to store the additional information necessary to fully characterize the results (protocol, compounds used, cell lines used, etc.)
  3. useful ways to search and retrieve the results
It is common to achieve all three goals by making each result a row in a relational database table. There would be a column for the result and columns for compound and/or cell line identifiers. Additional information could be contained in additional columns if searching this information was desired or it could be contained in written documentation that explained the specific data the particular table was designed to hold. The problem with this strategy of one specifically designed table per set of assay results is that it becomes difficult to handle data in a general fashion. Data can not be loaded or accessed until the appropriate tables and supporting software are developed. Search tools are difficult to maintain because the software has to be aware of every new table and its structure.

We have decided to create a more general scheme by separating the storage of actual results and the characterization of the results into separate sets of tables. There is one table that stores actual results, no matter what kind. There are tables that provide for general characterization of results (which assay, what kind of result, which particular batch of data, etc.) and tables that allow more specific characterization (which compound(s) and/or which cell line(s) were used).

Result

Actual results are stored in the TESTRESULTS table. This table allows for four types of results:
  1. Numerical
  2. Free text
  3. restricted text
  4. chemical structure
All or any subset of these types may be used in a particular record. There are fields that can be used to specify the unit of the result, the number of observations summarized by the result, a dispersion type (standard deviation, range, etc.) and a dispersion value. In order to insure consistent usage, all except the last of these are keys to records in the appropriate tables. The free text field allows any text, while the restricted text field is a key to a DESCRIPTIVEDATA record. This allows the result to be restricted to one out of a defined set of text strings. There is also a priority field that can be used to provide a sorting key for similar results. The chemical structure result is a key to a record in the STRUCTURES table.

Addition of an image result is under discussion.

Overall Characterization of results

The primary table for providing an overall characterization of results is TESTMASTER. This table provides fields for a short descriptive name and a longer text explanation. The longer text explanation should describe the use of any more specific characterizations described below. The table also provides pointers to records in DATACLASSES and DATATYPES, allowing data under this TESTMASTER to be described by those categories. There are text fields for protocol description (see below) and for a release identifier. The TESTMASTER record also contains fields that point to where the data can be found. The database pointed to need not be the same database that holds the particular TESTMASTER record. This means data can be distributed, with local databases not needing to copy all data from central sources. Finally, the TESTMASTER record also contains fields that hold URLs. The WEBLINKDOC field should contain a URL that provides more detailed documentation and explanation for the particular set of data referred to by the TESTMASTER and the WEBLINKCGI field should contain a URL for a CGI or servlet type program that allows web retrieval of the data. The specific syntax of the servlet may vary, but for all DTP servers adding ?searchtype=help to the end of the WEBLINKCGI value will always return full documentation for use of the servlet.

Further general characterization of a particular set of data is provided by the TESTSUBSET table. As implied by the name, each TESTSUBSET record is a subclassification of a particular TESTMASTER. A priority field is included to allow for sorting of TESTSUBSETs within a given TESTMASTER.

Specific Characterization of results

In principle, the TESTMASTER and TESTSUBSET tables provide for complete characterization of a given result, but in practice there are situations where this would not be useful. For example, the same protocol may be applied to test a large number of compounds. It is possible to assign the result for each different compound to a different TESTSUBSET, but that makes searching by compound (especially across different TESTMASTERs) very difficult to reliably implement. We have attacked this problem by making specific tables to handle specific result characterizations. As of December 2001, there are two such tables; TESTCOMPOUNDS and TESTCELLLINES. The basic function of these tables is to link an internal database key for the compound or cell line to a particular TESTRESULT record. There is also a "position" field in each table that allows for unambiguous linking of a particular test record to not only a specific TESTRESULT record, but also to a specific place in the protocol. For example, a TESTMASTER protocol might read "CELLLINE 1 was plated at time zero and media conditioned for 48 hours on CELLLINE 2 was added. At 24 hours COMPOUND 1 was added and at 48 hours COMPOUND 2 was added." The compound and cell line numbers refer the value in the "position" field in the test record. Note that the TESTCOMPOUND table also includes fields to specify the concentration of compound used (with unit).

Notice that the above only addresses half the problem. We have connected a TESTRESULT record to a compound or cell line, but only via an internal database key representing it. We still need to make a connection between this key and some externally meaningful identifier. This is done via the TRANSLATECOMPOUNDNBR and TRANSLATECELLINENBR tables. These tables make the association between the internal database key and specific externally meaningful identifiers. Each has a number of fields that can be used. The particular fields used and their meanings are stored in the COMPOUNDIDSYSTEMS and CELLLINEIDSYSTEMS tables. Note that:

  1. The internal database keys are specific to a particular database and can not be assumed to be consistent from database to database.
  2. There is no requirement that a particular external identifier map to one and only one database key.
These facts make database maintenance much easier. Maintaining consistent internal database keys across multiple databases in multiple places essentially means making those keys a new externally meaningful identifier, a monumental task that is well beyond the scope of this project. Also, integrating data with different identifiers becomes easier. If one set of data identifies a compound as NSC X and a different set identifies the same compound as CAS YY-YY-Y, the equivalence of the two external identifiers need not be known at load time. When the equivalence is realized, the only thing that needs to be done to insure that both results come up when a search is done for either X or YY-YY-Y is to add the appropriate records in the TRANSLATECOMPOUNDNBR table.

Publicly available implementation

The DTP data warehouse model has been implemented in an Oracle database that is publicly accessible. The specifics required for connection are: The DTPConnect class in our JAVA library is set up to connect to this instance by default. As of January 2004, this database contains

Programming

Some programming examples.