A Workshop organised by The Biotechnology Information Strategic Forum, with support from DGXII of the Commission of the European Communities, and held at CAB International, Wallingford, Oxon, UK, October 1996
Validating the factual databases, keeping standards high -- Rolf Apweiler
SWISS-PROT is a curated protein sequence database established in 1986 by Amos Bairoch in Geneva and maintained collaboratively with EMBL since 1987. It currently contains nearly 60 000 protein sequence entries with a high level of annotation, a minimal level of redundancy, and a high degree of integration with other databases.
The curation element is essential: SWISS-PROT is a R&D tool and was and is built by biologists for biologists. Its scientific level is guaranteed and it is also highly integrated with other databases. The 59 000 SWISS-PROT entries are abstracted from 50 000 references and linked by 250 000 direct pointers to 27 related or specialised data collections.
Every database has its own quality needs. Factual databanks are often dismissed as a collection of numbers but these numbers are naturally just as important to science as words. The essential criteria for a sequence data bank are that it should:
* be complete with minimal redundancy
* contain as much up-to-date information as possible on each sequence
* allow the information items to be retrieved by computer programs in a consistent manner
* be integrated (cross referenced) with other sequence related data banks
Preparing SWISS-PROT takes time which means that it can face a currency problem. As industry and academia demand up-to-date databases a solution has been sought in TREMBL - a Computer-annotated supplement to SWISS-PROT. This database is made by translating coding sequences from EMBL into SWISS-PROT format allowing "pre-SWISS-PROT" data to be made available for research purposes. When the SWISS-PROT record is completed, the TREMBL record is deleted; thus allowing a continually updated and current set of data to be used. Together, SWISS-PROT + TREMBL offer a complete and up-do-date protein sequence collection.
A deeper integration between the EMBL Nucleotide Sequence Database and SWISS-PROT + TREMBL has been achieved by using PID numbers in addition to accession numbers. PID stands for the "Protein IDentification" number, found in EMBL entries in a qualifier called "/db_xref" which is tagged to every CDS in the nucleotide database.
Example: FT CDS 54..1382 FT /note="ribulose-1,5-biphosphate carboxylase/ FT oxygenase activase precursor" FT /db_xref="PID:g1006835"
When an EMBL database CDS exists as a sequence report in SWISS-PROT,
the SWISS-PROT DR lines of the corresponding SWISS-PROT entry have been updated
by citing the PID as secondary identifier. In all cases where a PID has been
integrated into SWISS-PROT, a "/db_xref" qualifier citing the
corresponding SWISS-PROT entry has been added to the EMBL database CDS labelled
with this PID.
Example: FT CDS 144556_15695 FT /gene="cytochrome b" FT /codon_start=1 FT /product="apoprotein" FT /db_xref="PID:g463170" FT /db_xref="SWISS-PROT:P12778"
This approach enables SWISS-PROT to point precisely from a given SWISS-PROT entry to one of potentially many CDS in the corresponding EMBL entry and vice versa. A STATUS_IDENTIFIER in the cross-references from SWISS-PROT to the EMBL entries provides information about the relationship between the sequence in the SWISS-PROT entry and the CDS in the corresponding EMBL entry.
The data is therefore validated by cross checking. It is essential that the data therefore has stable identifiers which are used to link the data and thus provide interoperatibility of the datasets.
The computer is an essential element in sequence and annotation validation, as well as for adding value to the data. An important step in the production of SWISS-PROT + TREMBL is the reduction of redundancy. It is important to remember that there is a huge amount of redundancy (for instance, there are 8 HIV genes but more than 13000 sequence reports). Another important step is the information enhancing process. For TREMBL to act as a computer-annotated supplement to SWISS-PROT, new procedures have been introduced whereby valuable annotation has been added automatically. Currently, special analysis tools, sequence similarity searches and scanning against other databases such as PROSITE, Enzyme and Genomic databases are used. On all these levels potential errors in the data are detected and marked in the database entries. All this requires trained validators. SWISS-PROT has a permanent full time staff of 14, 7 based Geneva, with another 7 in Hinxton, but also relies upon some 200 external scientists contributing their skills voluntarily as experts.
Because the data changes, SWISS-PROT offers a "snap-shot" of the situation at any one time. Corrections need to be tracked so that a story can be developed. All in all, SWISS-PROT is a living database that offers highly annotated information in a complex, cross-referenced and therefore cross-checked, medium. It is not for nothing that SWISS-PROT is increasingly the "core database" used when navigating through a series of databases answering a particular question.