A Workshop organised by The Biotechnology Information Strategic Forum, with support from DGXII of the Commission of the European Communities, and held at Purmerend, The Netherlands, May 1997
The workshop - Financing Biotechnology Databases was held at the Golden Tulip Hotel, Purmerend, The Netherlands in May 1997. It was organised by The Biotechnology Information Strategic Forum, with support from DGXII of the Commission of the European Communities.
Databases from Universities - PRINTS, a research tool that has grown into a resource -- Teresa T Attwood
The database today known as PRINTS is based upon work that began at Leeds University, in 1989, as part of a post-doctoral project to model G-protein coupled receptors (GPCRs). These are membrane proteins of great interest to the pharmaceutical industry, being the targets for ~75% of all current drugs.
Sequence analysis revealed GPCRs to be one of the fastest growing families in sequence databases, and so a reliable mechanism for extracting all the known GPCRs from the primary (sequence) databases was required.
Early work showed that characteristic signatures, or fingerprints, denoting family memberships, could be derived using groups of conserved motifs within sequence alignments. This gave birth to a new analytical technique, whereby a new sequence matching all fingerprint elements (motifs) could be reliably diagnosed as a true `hit'.
Fingerprinting offered a significant improvement over single-motif methods and it was soon apparent that the technique could find wide application. This realisation afforded the opportunity to start a new database, and in 1991 work was extended to other families. 10 fingerprints were therefore compiled into a prototype database and work, sponsored by a software company who had an exclusive licence to commercialise the software needed to carry out this task, was begun. The team at Leeds then included three undergraduate summer students (who created database entries) and a PhD student (who provided search software). A collaboration with the SEQNET manager yielded the new database query language.
Unfortunately, the high hopes raised in this period were not maintained. 1992/1993 saw the project enter the doldrums as the software was not commercialised; the PhD student graduated and left, so ending our programming support; two of the summer students graduated and left; my contract finished, but I was able to stay on using "ancillary funding"; and the third student was unemployed, but continued to help.
In spite of there being no official funding, we sustained the database, and by July '93 it contained 100 entries. In October '93, I was awarded a five year independent Fellowship from the Royal Society. This precipitated a move from Leeds to University College London (UCL) and signified the first direct funding to support the resource. This event was marked by naming the database PRINTS, formally identifying it with UCL. The first PRINTS release (10/93) contained 150 entries.
But our problems were not over. Universities now protect their intellectual property very aggressively and, perhaps seeing a lost opportunity, Leeds felt that, since the work had originated there, they "owned it". Intense discussions concerning ownership raged for the next two years between Leeds and UCLi, the department that protects UCL's property. Eventually, the DTI urged a settlement in UCLs favour but the matter has still not been formally resolved.
In spite of these difficulties, work continued and, in August 1994, we established a Web site at UCL to serve PRINTS and related databases and tools. The server is presently accessed ~3,000 times/day; as the database is also mounted on other servers and ftp sites around the world, the actual usage is probably considerably greater. PRINTS continues to grow. It is released quarterly, and version 16.0 contains 750 entries.
However, the very success of the database threatens its future - keeping PRINTS alive has become a Sisyphean task, and the truth is that its maintenance is now a major project in its own right. No longer just an interesting research tool, it is an increasingly important resource accessed by researchers worldwide, who rely on the next release - on time. Curation is all-consuming, and there is little time for research. Yet despite the need for resources of this type, financial support is not guaranteed: the UK Research Councils will not fund database maintenance; universities are not database providers; and the EBI is not, and cannot, be a foster home for our teenage databases, now that they've grown & are proving difficult to manage. On the other hand, if a database is successful, one person can't do it all...no matter how hard s/he tries.
Although essential research tools, realistically, secondary "motif" databases do not begin to address the flood of primary sequence data. There are >250,000 protein sequences in non-redundant primary databases (>1,000,000 in EST databases); and probably >10,000 protein families. Deriving diagnostic patterns, annotating this number of families, and keeping pace with the primary resources is a virtually impossible (individual) human task; but the work has to be done by experts. Although there are now several automatically-derived secondary databases, these are without annotation and result validation. The inherent value of annotated databases is that they are a vital step towards converting information into knowledge; they are therefore still people-dependent.
The need for expert curation is key. At present, only 3 groups in the world produce annotated secondary (i.e., protein motif/pattern) databases:
| Geneva | ISREC | UCL |
| PROSITE | Profiles | PRINTS |
| Amos Bairoch | Kay Hofmann | Teresa Attwood |
| Philipp Bucher |
PROSITE was the first, is the largest and most widely used, and the current version documents 947 families/sites. But this project is also feeling the strain and the last PROSITE releases were 15 months apart. Profiles are slowly being compiled and released as part of PROSITE, but only ~24 have been made available. PRINTS is still on schedule, just, but maintenance is the proverbial uphill struggle. Overall, there is simply too much work for 4 individuals; and, like it or not, if we consider databases of this type to be important, then they need financial support. Happily, all three groups work together and continue to put in grant applications, with a view to integrating our efforts. But even if such grants are funded, these are ultimately only short-term solutions. Most of my group is funded by the pharmaceutical industry to work on PRINTS-related tasks (i.e., not PRINTS per se), and my own funding is only guaranteed for another 16 months - if I'm lucky, I might get a further 3 years.
The key question for me, and for our users, is whether PRINTS can survive? Under the present conditions, 12-weekly releases are almost impossible to sustain (a single release involves researching 50 new families). It is clear that PRINTS has no guaranteed long- or even medium-term future. The team is not funded in anything like the manner it needs to establish a stable, regular, updating and validating system. PRINTS needs a "life sentence", which means providing the resources to keep it regularly maintained, updated, error-checked, validated, annotated, etc.. In short, long-term annotators and software support are essential - secondary databases need professional not amateur curators. Ideally, we need international cooperation, coordination and collaboration (e.g. from publishers, industry and academia) to make the most of our sequence analysis resources. We must define standards, eliminate duplication of effort, and invest in the future. But we need people, or organisations, or an infrastructure that is prepared to make that investment. Otherwise, the dedication of committed research groups will have been a waste, the lessons learned through blood, sweat and tears will be lost, and the enthusiasm of future scientists will be frittered away in Sisyphean perpetuity. In the final analysis, are we so affluent that we can afford to squander the time, the talent and the money already invested in these vital secondary databases?
Collaborators
| Alex Michie (UCL) | Web s/w |
| Martin Jones (UCL) | Server support |
| David Parry-Smith (Pfizer) | Analysis s/w |
| Alan Bleasby (Daresbury) | Database s/w |
| Mike Beck (Leeds) | Fingerprints |
| Kirill Degtyarenko (Leeds) | Fingerprints |