Building and Owning Biotechnology Databases

A Workshop organised by The Biotechnology Information Strategic Forum, with support from DGXII of the Commission of the European Communities, and held at Purmerend, The Netherlands, 22-23 September 1998


The workshop - Building and Owning Biotechnology Databases was held at the Golden Tulip Hotel, Purmerend, The Netherlands on 22-23 September 1998. It was organised by The Biotechnology Information Strategic Forum, with support from DGXII of the Commission of the European Communities.


Discussion

Why bother with IPR?

It is clear that in the sixteen months since the last BTSF meeting, on Financing Biotechnology Databases, much has changed. Then, in May 1997, hardly anyone thought, or if they did they did not dare to mention, that SWISS-PROT would start to charge license fees, nor was it envisaged that SRS, a software chosen by many in the academic bioinformatics community as the linking service, would also be privatised.

However, there were signs that changes were beginning to reach bioinformatics. The ETI had demonstrated that central (software) facilities could be developed for academics to use in exchange for submitting data for common use. They had illustrated that the pool of data could be exploited for scientific and commercial purposes and that large heterogeneous teams of scientists could cooperate well and that the products could be "sold". The PRINTS story had highlighted that databases cost more than dedication; and there were also indications that funding organisations increasingly realised that databases cost a great deal of money and that this could not continually be found from R&D funds.

This meeting has brought these problems and opportunities for bioinformatics further into the open. Bioinformatics will (have to) continue to evolve and expand, but it will be impossible to gain all the necessary funding required from public R&D sources. Changing to a more market-driven economy will, for many of the academic teams in the area, be difficult and against their principles but there might be no alternative. However one looks at it, databases, and the software to organise and manipulate them, cost money. Furthermore, the institutions that have hitherto allowed their staff to develop these projects "on the side" are now aware that many of these projects have become recognised research tools which have to be delivered to their users in a professional manner. Few institutes are yet in a position to take these products to the market but many are considering how they should; and there are bound to be misunderstandings and opportunities, lost and found, before a structure is found that best suits this field.

There is therefore certainly a more commercial atmosphere surrounding bioinformatics today but this meeting made it equally clear that for many in this field, it is not just the principle of academic exchange and cooperation that is at stake. Several of the databases in question could not survive unless the academic community gave their services and data freely (it is even questionable whether "icons" such as SWISS-PROT will be able to continue to rely upon the academic community once they begin to charge) and so many databases will try to continue as they are, running on the labours of dedicated staff who maintain their project in their own hours next to their research activities. However, if these individuals are as successful as they aim to be, they too will eventually produce stable products with an important role in the bioinformatics infrastructure, which again will demand the same professional standards of validation and timeliness and these cost money. It is almost inevitable that success will bring with it responsibilities that cannot/will not be funded from the public purse.

Thus there was general agreement, reluctant or otherwise, that outside investment is regularly going to be required and in all these cases, and even in cases where a product remains in the public domain, it is essential that the product has a clear owner. Today, the ownership of many databases is confused. Many scientists assume that the curator who has the scientific responsibility and is the visible authenticator of the data and therefore of its quality, is the owner but this may not be so in the legal world where the claims of the institute providing the infrastructure and general funding behind the product may be important. To confuse matters more, the curator might well have moved institutions, been joined along the way by different staff, who are often transient, short-stay, post-docs, and depended on some external grants for support. And these changes might have taken place during the past decade when attitudes concerning ownership were far looser than today.

The EU Directive on Database Protection has focussed minds and attitudes on this topic, and even spurred some people to take decisions on ownership which will certainly be challenged. Too often there is no clear owner and the situation in biotechnology is further confused as databases and software programs regularly benefit from cooperation with other institutes and groups who place data and other improvements in the common pool. Thus there are many who feel that they can claim at least partial ownership of the product; and this normally comes to the surface when the scent of success is in the air.

To-date, too few databases seem to have secured the necessary clarity on ownership, relying too often on an untested "moral" right which would probably not win their case in a court of law. Furthermore, such untested or undefined rights are certainly insufficient to offer any third party investor the guarantees they require before putting money into the product to take it further. Nevertheless, the legal framework to claim and secure IPR and copyright is in place and, while it has perhaps come too late for some biotechnology databases which have been running for almost a decade or more, it is now clear that ANY database producer should handle the IPR situation BEFORE rather than AFTER starting their product.

Who are the owners?

It is generally accepted that the raw data or primary databases such as the nucleotide sequence databanks, should remain in the public domain. These are currently built and maintained by large international centres such as the EMBL Outstation, the EBI. The storage and release of data in these databases is presently "governed" by a series of decisions known as the Bermuda Principles (http://hugo.gdb.org/bermuda2.htm) - which, in this respect, state that all such data should be left in the public domain. Unfortunately, the Bermuda meeting did not secure long term funding for these databases and so it is not clear how these will be maintained. Basically, if a database is to remain in the public domain it will probably be funded from the public purse and long term agreements are needed to provide the security the research world requires; otherwise they, like SWISS-PROT, will have to seek ways of raising funds from the market. In these cases, it is also important to remember that the databases’ success is due largely to their completeness: the scientific community accepts they have a duty to submit data for free and as rapidly as possible so that this can be used in the other products emanating from the raw data collections. If such databases were no longer seen as "public", the willingness to submit data so freely might well diminish.

Another generally accepted rule in this area is that "added value" can be charged for. To-date, this has generally meant that customers have been asked to pay a license or user fee for value-added databases although there has been a general exemption for academic users. This exemption is now being defined, for instance in the reported SWISS-PROT case, as being a recognition that academics have "paid in kind" - i.e. they provide the refereeing and other educated support the database needs. Whether that will work in the future remains to be seen. Some scientists from commercial organisations who presently help validate might also request the same rights, as might companies who have sponsored staff who work on these databases; and the academics might also prefer to ask for a fee now that the results of their labours are being used to generate revenue for other organisations.

The SWISS-PROT case is at least simple in that the owners of the database are clearly defined and continue to work together. Today, most academic institutions now claim that products and IPR developed by their staff ultimately belong to that institution and the majority of staff contracts have clear rules and regulations as to what can be developed and on what conditions; furthermore, many universities have Technology Transfer Officers who are able to advise and protect the staff and the institution.

Thus academics are being taught to look at the exploitation possibilities of their work, and some funders, e.g. the EC, make exploitation of the results of research a contractual obligation. That this attitude conflicts with many academics’ desire for the free exchange of information is of course no surprise, but it is perhaps odd that many universities are even more protective of their "IPR" than the commercial companies they have for so long reviled! This is not only the case in Europe: in the USA, where public domain rights are championed even more loudly, some Technology Transfer Officers are gaining a reputation of being every bit as hard-nosed as the commercial companies they claim to "fight" (see the recent report prepared for the NIH by the expert group chaired by Professor Rebecca Eisenberg).

One complicating factor in this scenario is that many of these databases can be exploited and maintained only by the staff who developed the product - institutions rarely set up departments and appoint staff to exploit a database. Thus some form of compromise might be needed whereby it is in the interests of all parties for the product to be owned by both the supporting and the creating bodies. Certainly, the research market will not be helped if databases and other research tools become the victim of legal battles, or of staff changes, which result in their loss. While it may be obvious to those inside the field, some institutional players have not realised that a database is of no use unless it is kept up-to-date, curated, refereed and validated and that these tasks are all too often the responsibility of one, or a few, dedicated individuals. Furthermore, when institutes claim ownership and charge fees, they enter into contracts with users which then confer responsibilities. At present, while databases are clearly by-products from a research process and distributed rather than sold, the market has to take a generally relaxed attitude towards the data and there can hardly be any claims for damages brought about by mistakes or errors made after using such a product. This would certainly change when a database is sold - entering into a contract with a customer brings obligations which have to be followed, and sometimes defended.

The detailed discussion surrounding the PRINTS case showed that questions of database ownership in this area are still not going to be straightforward, not least because the EU Directive does not readily cater for the inherent complexity of biological databases. These are not simply collections of data objects that anyone could assemble from the public domain, but are the results of considerable research effort. Every individual entry within protein pattern databases such as PRINTS requires many man-hours of labour to create: the process requires the use of software to detect and align protein sequences, to identify conserved regions, to scan the primary sequence databases for family relationships, and to collate all of these diverse pieces of information into a single template. Crucially, the process then involves the compilation of detailed written documentation of the technical procedures involved in producing the template, scouring the literature to provide coherent protein family abstracts, manually cross-referencing kindred biological databases for additional information, and internal cross-referencing to related entries. The Directive has no concepts for dealing with the fruits of research projects like this, but simply addresses itself to the arrangement of database contents. Significantly, it also makes no provision for the conceptual difference between the format of a given database entry, and the format or structure of the database itself. To this extent, the Directive seems unhelpful, or even potentially dangerous, because it becomes a matter of how it is interpreted in special cases like PRINTS.

However, in this case a number of issues of general importance to database builders do seem to be worthy of further discussion, namely:

Under the sui generis case, it seems clear that some databases can change owners, and it would thus seem clear that in the case of PRINTS, although Leeds University is claiming copyright, the content has been updated to such a degree that the ownership could also be claimed by UCL (or the author).

Furthermore, as mentioned earlier, it is also clear that a full-blown legal dispute would be pointless: if the institute in question won, what they would own would be a "dead" database, and if the individual won then the delays and costs incurred would also have reduced the competitive value of the product. Certainly science would have lost and all the investment made todate would be lost.

(In this case it is perhaps also pertinent to remember that the predecessor to this database was "born" in 1991 -6 years before the EC Directive came into being - and so much of the debate is taking place in the context of subsequent legislative developments. There is little doubt that more examples will appear before correct procedures for protecting all those involved are in place, i.e. there is little doubt that the database environment will have changed a great deal more in the coming 8 years and so it is increasingly essential to ensure that ownership details are fixed to avoid continual disputes.)

Rights and Rights Holders

In any event, it seems that "ownership" is probably the wrong concept; what is more important is to ascertain who is the "Rights Holder" defined as the natural person/s who creates the database (or their employer). In many bioinformatics databases, the scientific community, through publications and grant awards, will clearly identify the curator/creator and this pedigree will probably remain highly important given that the quality and added value of these databases are paramount. The curator will then have to ensure that they have the rights or the recompense they feel is appropriate.

The EC directive lays down clear guidelines as to what rights an owner of a database or the data can confer on others. There are some divided opinions on these rights, as many people, especially in the USA, feel that science could be damaged by proprietors protecting their databases to such an extent that data cannot be used for research and teaching uses. These voices have so far managed to delay the implementation of a similar set of rules into US legislation (see footnote to Paul Uhlir's article) but database protection legislation is expected to be taken up again by the Congress in 1999.

The main concern is that the new directive, and its American equivalent, will restrict the "fair use" of databases by academics but, in Europe anyway, little has changed and it seems highly unlikely that database producers will restrict academic use. Furthermore, the American viewpoint is not so clear cut that there is a unanimous opinion against the new rules. Several academic institutions welcome the chance to exploit their intellectual property and the need is clearly for a balanced debate and a balanced outcome.

The meeting also concluded that several databases might actually benefit if they were charged for: there is still a feeling among many users that "free" means "low quality" and it is essential that the scientific arena develops a series of recognised, respected, databases, much as they have in the world of journals, where the science can be guaranteed and the medium (the database) will be sustained. In many ways the database is following the same route that the journal did - starting as a medium for interested colleagues (who often then formed into a society) - before growing into an international collection of refereed material which became a respected resource. Database builders will therefore have to make their refereeing and validation procedures more transparent and will have to ensure these are sustainable. The user will also want to know where the data came from. There will increasingly be a need for scientific validation of database materials and there again issues of copyright and ownership will return.

Where does the data in databases come from?

There are indeed many similarities between today’s databases and scientific journals; both receive data which has to be edited - validated, refereed, formatted etc. - and both are open to large user groups who can benefit from accessing what they require from the collections of ordered information. (However, it is interesting to note that while many people complain at the price of journals, few feel that they should actually be free; many scientists, especially in biotechnology but not in, say, chemistry, appear to feel that their databases should be free although the arguments that validation and production process should be paid for should increasingly apply).

Journals are collections of validated articles and most journal publishers request that authors transfer the copyright of their articles to the title on acceptance. This action is sometimes contested, and in the USA some grants forbid the transfer of copyright so the article has to remain in the public domain; but there are many arguments for the convention, not the least being that a journal can protect the integrity of an article far better than can an individual.

Database producers should follow these arguments, the more so now that the content will increasingly come from different sources and some of which will not be protected. The nucleotide sequence databanks hold the copyright on their deposited sequences and the ETI go to great lengths to clear the copyright on the materials they place in their products (up to 30% of the effort put into building an ETI CD ROM product goes on clearing copyright). Individual scientists who wish to deposit their data in these central repositories should therefore protect their information for their own use in other ways; for instance through a patent.

Certainly new compilations of data, like the BioImage Database, which is collecting images relevant to biological research, will have to assume responsibility to protect their contents and the database. Here, materials will be submitted from many thousands of scientists in many different forms. Another complication could be that while the database will store, for instance, materials as 3-D computer generated images, the same "image" might have been published in 2-D form as a photograph in a journal. According to European opinion, both forms can be copyrighted but there will be a need for flexibility and cooperation between the author, the database and the other organisations - e.g. journals - and with the institution where the work was carried out.

This database could be completely paralysed unless the ownership and administration details are secured before it is exploited. Two immediate models spring to mind: one that all submitting scientists "own" the database, but perhaps waive their rights to an income in the name of science; or a model similar to the ETI scheme where a group develop, maintain and exploit the database and scientists can benefit from having their materials involved (and perhaps using elements from the database when they require).

The ETI model clearly lays down rights and obligations to all those who use their Linnaeus II software. Such pre-arrangements are essential if annoyance and anger are to be avoided and other databases should follow similar procedures. Databases should also recognise that their situation might change. While today’s R&D funding may allow many products to be developed and disseminated for free, this might not always be so. Funding may be reduced, the database could become too expensive to maintain, or the "owning institution" could change it financial demands. Thus a database has to ensure that IF it later wishes to charge for access to, or usage of, its materials, it has the rights to do so and has the copyright needed to allow a fee to be recovered without interminable demands and litigation. It must also make sure that it does not antagonise those that contributed freely to the database by later claiming copyright and ownership of materials that had been submitted in a different spirit (this is not a new phenomenon: the Chemical Registry Numbers of the American Chemical Society which were developed on the back of an international database effort, were later copyrighted and protected by the ACS; many feel that these were an international resource that should have remained in the public domain).

The European case is that the submitting individuals in such cases can own the database but they will have to establish a structure and a legal agreement which defines their rights and obligations. Certainly the EU allows project results that have been developed with EC grant money to be owned and exploited by the grant recipients. The meeting stressed that such regulations have to be agreed before the project is started; it is also essential that ownership rights are clear and agreed - the more so now that many databases interlink and rely upon others for part of their value.

What may also happen is that journals and databases become increasingly inter-dependent, and such an increased service aspect could offer someone (publishers ?) a new role: to arrange and package data for use. There is a great deal of activity in data integration and it would be a pity if copyright laws limited the fullest useful exploitation of the data. Ways of allowing the use and re-use of data in different settings must be found even though this means that different content owners/providers will be involved in the same information process (e.g., some journals already use videos and other multi-media records in a composite format. In these cases, the journal copyright will be far ranging but the publisher will have to make sure that the stored materials can be accessed and disseminated if it is to receive the full protection of the EC law. This might mean interacting with more specialised databases in this sector).

Subsidiary data are also increasingly needed for a journal to referee a paper. This is especially so in the case of structural biology papers where coordinate and other related data are required to check the measurements and decisions made. All too often this material is difficult to locate and use; but steps should be taken to make it available - if necessary under controlled conditions (for instance, journal referees might be placed under an embargo not to release that data until the article has appeared, thus guaranteeing precedence rights).

Databases that are here this (micro) second, gone the next, and databases that are called journals.

Database users wish to link databases together and, increasingly, to "slice through" different databases collecting relevant pieces from different inputs so as to build up a composite answer to a question. Thus database hosts such as DIMDI can now set up automated searches that scan, say, four databases and combine pieces of the records into a composite record with additional value. There is no legal problem in doing this if all the rights of the four databases are honoured and dues are paid, and again, if the new record is not exploited without the express permission of the individual rights holders.

Not only commercial database hosts offer the facilities to mix and slice data. The SRS software also allows one to prepare "virtual databases" from the databases under its control. Servers allowing this must take care that the individual rights holders of the databases under their control are not disadvantaged as it is simple to develop new products from the combinations of data being examined. Some such compilations might, in the future, become databases in their own right but great care is needed from the compiler - not to abuse other copyrights, and the rights holders - not to prevent new, more powerful data compilations from being put together. As data mining becomes ever more simple this danger will increase.

Public today, private tomorrow

Two house-hold names in bioinformatics - SWISS-PROT, and SRS, are about to be taken out of the public domain. The message is clear: public funding does not guarantee sustainability.

Some discussants, while sympathetic to the needs and the chosen cure, nevertheless expressed concern that such "central pillars" could undergo such a change in status. For instance, many EU funded projects had been encouraged to use SWISS-PROT and SRS by colleagues and grant providers so as to cut costs and concentrate resources. Some of those involved were commercial organisations and so would now be charged a licence fee to use the research tools concerned. Had this been known at the start of the project they might have looked for another solution. This situation is being exacerbated by the EC’s push for the exploitation of research results and, while this is laudable, the implications should also be explained to those involved. At the very least, those that had earlier versions of the database, or of SRS, should be allowed to use these in the future, even if this slightly or transiently diminishes the market potential for the future commercial product or service. Furthermore, funders will have to take great care not to link commercial and non-commercial and to-be-commercial projects together, unless they have clear plans for the future integration of such combined efforts (e.g. the InterPro project where PROSITE and PRINTS will be linked despite the possible confusions that will occur if and when PROSITE, as part of the SWISS-PROT family, is privatised - i.e. access to the constituent databases might be compromised if one charges and the other does not). Sites such as ADLIB, which offer a platform for mixed-background databases, might offer a service to manage this potential problem.


Back to Workshop Contents Page