Financing Biotechnology Databases

A Workshop organised by The Biotechnology Information Strategic Forum, with support from DGXII of the Commission of the European Communities, and held at Purmerend, The Netherlands, May 1997


The workshop - Financing Biotechnology Databases was held at the Golden Tulip Hotel, Purmerend, The Netherlands in May 1997. It was organised by The Biotechnology Information Strategic Forum, with support from DGXII of the Commission of the European Communities.


Electronic Databases and the Scientific Record -- Graham Cameron

In the discussion of electronic publishing much attention typically focuses on the activities of conventional publishers and the move towards electronic mechanisms in the publishing process. A different viewpoint comes from the role of electronic archives of scientific information which were originally developed fundamentally as databases but have become part of the basic scientific record. We now see something of a convergence of two different kinds of activities. Publication is becoming more database-like. In the electronic era the structure of published information and the tool-set to exploit it invoke more and more sophisticated information technology. At the same time databases are starting to become, in some senses, more like publications. As the shared repositories of scientific information become part of the under-pinning infrastructure of science, they play an increasing role in the scientific record. I want to discuss here the role of the databases in the revolution of the scientific record in the electronic era.

1 THE ORIGIN OF SCIENTIFIC DATABASES

Scientific databases have various origins, but typically were never conceived actually as databases. They turned into databases almost by accident. In the first instance people used computers to compute on scientific data - to help them do their sums. Soon repetitive tasks led to reusable programs and those programs processed data in "standard formats". Scientists working in particular domains developed increasing collections of information in the so-called "standard format" of the programs they used, and began to write more and more software to process data in these formats. Soon the information stored this way began to be referred to as a database.

Another thrust came from the kind of experiments made possible by electronic and technological developments. Experimentation which collects huge amount of data has now become possible and automatic data capture tools in many areas of science have resulted in enormous repositories of data which would have been inconceivable without electronic tools. Systems were built to capture and store those data, again typically in some kind of "standard format", and the resulting pool of information soon became referred to as a database. Many of the resulting so-called databases would cause any database designer to throw up their hands in horror, for they never were designed, they were simply an ad hoc method of capturing a lot of scientific information.

2 THE EUROPEAN BIOINFORMATICS INSTITUTE

The European Bioinformatics Institute (EBI), grew out of such a data collection effort. Many of you will know that, for a couple of decades, it has been possible to determine DNA sequences - the exact sequence of the base-pairs of the genetic information in the cells of organisms. Indeed, in recent years methodological advances have turned this into a very large scale data collection effort.

As long ago as 1980 the European Molecular Biology Laboratory (EMBL) established its Nucleotide Sequence Data Library, with the goal of collecting all such information. This information had hitherto been published on the pages of scientific journals, and the goal of the Data Library was to build a scientific database to incorporate it, organise it and distribute it.

Time has moved on since then, and biology has become information-intensive in many different areas. Current methods and emerging methods to determine the sequences of bio-macromolecules (DNA, RNA and protein), molecular structures and the biological functions of those molecules, have all generated huge amounts of information. In the response to this, in December 1992 , the European Molecular Biology Laboratory decided to establish the European Bioinformatics Institute, and this decision was gradually implemented through to 1995 including relocating the operation to the UK. The EBI has an institutional mission to provide public domain information services for molecular biological and biotechnological research. It incorporates and extends the mandate of the original EMBL Data Library. It is worth noting that the EBI is in fact one of five locations at which EMBL operates, its headquarters being in Heidelberg, in Germany.

The domain of the EBI is largely in the area of macromolecular information - information about the large molecules that are important in biological functions. The original motivation for establishment of the group in Heidelberg and indeed still the single biggest project, is the Nucleotide Sequence Database. The EBI is now also deeply involved in the SWISS-PROT Protein Sequence Database, and, in collaboration with a US group, a protein structure database, as well as various aspects of documenting protein function. The core services of the EBI hinge around databases in these areas, which are made available in various modalities, with, of course, the World Wide Web access being the preferred method of using the databases. The core collections are doubling in size in less than two years and the growth is, if anything, accelerating. The databases acquire a new sequence once every minute 24 hours a day, 365 days a year. They represent information from 20,000 or so organisms, with about 40% of the data being human information.

These activities are conducted in a long-established international collaboration with the groups in the USA and Japan which ensures that data entering the collection in Japan, the USA or Europe is exchanged on a daily basis. The user community is broad and diverse with tens of thousands of accesses to the databases each day from areas such as basic biological research, biotechnology, pharmaceutical research, medicine, and agriculture, both in the academic and the commercial sectors.

Much of the usage is analogous to the use of bibliographic databases, with simple look up to see what has been sequenced and what is in the database. But unique, or at least different, from typical database usage is the kind of computation that users of the database do on the entire collection. For example, it is commonplace for people who have the genetic sequence of a particular gene to search that against the entire database to look for other sequences that are biologically similar. Indeed, the business of deciding what constitutes biological similarity is a research topic in its own right.

Perhaps the single biggest computational problem studied with the combination of databases available from the EBI is the relationship between sequence, structure and function. DNA sequences code for protein sequences which in turn are responsible for the determining the final 3-dimensional structure of the proteins involved. It is that 3-dimensional structure which is in large part responsible for the function of the proteins. However, whilst it is accepted that sequence and structure are deterministically linked, their relationship is still poorly understood and enormous research is carried out in investigating that relationship.

Another important use of the database is in the area of molecular evolution. The understanding of the relationship between different organisms in the evolutionary process is now best studied by looking at the differences between the DNA sequences of those organisms.

3 SUPPORT OF INFORMATION PROVIDERS

In recognition of the importance of such information sources, the EBI is a publicly-funded information provider (as indeed are our analogous organisations in the US and Japan). There are commercial suppliers in the same domain although these do not collect and collate the data in the way the EBI and its colleague institutions do. The existence of organisations like the EBI reflects the acceptance that the understanding of living systems is utterly dependent on this shared pool of information. Indeed society in general depends on many such information sources, and it is my belief that, if we regard this as so crucially important, greater attention should be given to the overall principles of information provision.

There are several information funding models:

There are also mixtures of all of these.

The public model funds the information sources from the public purse and gives the information away free of charge. This is a typical case for a situation where the users and suppliers are funded from the same source; basically, either the database builder is paid to produce the database and give it away free, or the user is paid to buy the data.

The usage models mean that the user pays a charge which is in some way related to usage. The database supplier then recoups the costs of building the product. Another mechanism is to charge for the right to use the information, perhaps by a subscription to the service in question.

There are examples of public funding being mixed with other forms of payment.

Public funding is simple, and can reduce bureaucracy. It creates equal access for all users, reflects policy as well as market forces and can encourage the exploitation of the information by others. It is, usually, also continuous. However, it is hard to know if the tax payer is getting value for money, it is unresponsive to the market, and it can threaten commercial operations.

Usage forms are responsive to the market, and the user only pays for what he wants. They are however often complex, favour the rich, can be hard to ensure continuity, and discourage further exploitation of the data.

It is difficult to say that any one mechanism is best. The EBI is currently part of a publicly funded organisation and operates for the academic community. The EBI believes that it is essential that no financial barriers are raised to prevent the use of the databases, and it is also essential that scientists are encouraged to submit data — which they will best do if they are working with a non-profit organisation from where they too can get data for free. It is clear that the free supply of such data should not undermine the further use of that data.

4 PUBLICATIONS AND DATABASES

Let us compare for a moment conventional publication to the information in databases and collections such as those we maintain at the EBI. My view of the goals of conventional publication is, I think, consistent with that of other contributors to the meeting. Its goals are (at least):

- communication

- creation of a scientific archive

- creation of a citable record

- establishment of scientific priority

- ensuring appropriate credit for work.

Databases like ours were established with similar but not identical motivation. The goal of communicating scientific findings was clear, but they also wished to enable scientists to compute on that information and to re-use scientific data for purposes other that for which they had originally been gathered. In many cases an explicit goal was to provide information ancillary to conventional publications. However, more recently, databases have begun to stray into what might be seen as conventional publishing territory. In molecular biology:

- people now cite databases

- they have come to be seen as a part of the archival record of science

- patent lawyers (at least) have started to use them to establish scientific priority

- some US funding agencies explicitly give scientific credit on the basis of "database submissions".

The role of the databases by comparison with traditional scientific record is worryingly ill-defined. The traditional scientific record is, at its best:

- high quality

- permanent

- citable

- accessible

There are procedures and practices which have evolved over hundreds of years to ensure that this is the case. Databases raise new issues:

- Databases can be updated: it is hard to determine where the definitive version of a particular piece of information is.

- The history of information in the database: what was included and when, is often hard to determine

- often data are made available by network sources that may come and go with the enthusiasm of the group that supports them.

- The procedures of quality control are typically quite ill-defined.

Indeed there is a difference of motivation between the archive of the scientific record in traditional publication and that of a database. The traditional scientific archive values permanence, citability and immutability. All these are seen as necessary to enable us to trace scientific activity. Databases often have a completely different motivation. The goal is to present "today's best-bet" at what is the scientific truth. They correct errors when they are discovered, they delete wrong or superseded information, they add new information as it becomes available. This is all designed to make them as useful as possible, but it makes them difficult to cite, lacking in permanence, hard to trace.

An easy conclusion would be that we should simply archive all versions of everything that ever appeared in the database. Sadly, it is not as straight forward as it seems. Databases are subject to so many updates that this is typically infeasible. Our nucleotide sequence database, for example, changes tens of thousands of times every single day. Also databases in today's networked or "webbed" world don't stand alone. They often refer to electronic external authorities, e.g. to get nomenclature for legumes we might go to the ILDIS database maintained at Southampton University. Often the most up-to-date information can nowadays be got "on the fly" as the user accesses your database.

Databases are dynamic. Users are interacting with them while they are changing, and, even if you are logging all the changes to your own information, you may be using external resources whose changes you cannot detect.

5 DERIVED KNOWLEDGE

Another problem, pronounced in the biosequence databases, is what can be referred to as derived or secondary knowledge. As the databases become tools in our research, the meaning of new information in the database is often determined by analogy to existing information. When conclusions determined by analogy are added back to the pool of information the mixture can get us into trouble. It may then be used to build new analogies, thus creating a spurious impression of a large knowledge base, when in fact the raw knowledge is very sparse indeed. I often comment that if you feed the databases on their own offal you end up with electronic BSE.

6 NETWORK ANARCHY

In terms of trying to ensure the robustness of scientific record, the anarchy of today's networks creates rather than solves problems. Anyone can mount a web site. Sources come and go and it is hard to determine what is behind a home page, whether it will be permanent and what its quality is. Even if we can determine quality, it is near impossible to locate resources of interest among all the useless information.

7 CONCLUSIONS

Sadly, I am afraid this discussion raises more problems than it solves. In facing up to the electronic era, I argue that the undisciplined use of electronic media will be at least as damaging as the undisciplined use of conventional publication. However I do believe that the optimal exploitation of electronic media will create new opportunities. Opportunities which will not diminish the cost of information provision but can enormously enhance the utility gained by exploiting that information. Recycling of data, using it for purposes other than that for which it was gathered, data-mining to find new patterns of information, can all yield novel insights. I feel that we can capitalise on that opportunity offered by the electronic era only if we behave in a disciplined manner. The issues are not technical, they are rather those of conventions and protocol and establishment of good-practice in dealing with electronic information.

I am also convinced that the concerns of publishers about the electronic medium destroying the market are unrealistic. The economic activity in electronic information provision will surely be as great as that in conventional publishing, but it will require that the players are prepared to retool to deal with the new medium, technology and mind-set. In this exciting future commercial organisations such as publishers alongside publicly funded organisations such as the EBI have a major and exciting role to play.


Back to Workshop Contents Page