Biotechnology Information:
Access, Storage, Validation and Security

A Workshop organised by The Biotechnology Information Strategic Forum, with support from DGXII of the Commission of the European Communities, and held at CAB International, Wallingford, Oxon, UK, October 1996


Discussion

The Panel:

E. Scott, R. Apweiler,P. van Wiechen, P. Scott, A. Doyle, C. Rusbridge, A. Parsons,

Facilitator: J. Franklin, Chairman: J. Gilmore.

While the title of this workshop is: Access, Storage, Validation and Security, it makes more sense to reverse the order of these key-words in terms of setting out needs and discussion points.

Security is an issue that confronts the information industry and the information user. In biotechnology information security comes down to two things, namely to avoid :

* malicious damage - how to avoid someone breaking into the computer system and wiping out files and data; as well as

* information loss - how to prevent data from being stolen, or rendered "un-patentable".

But security problems can occur in the most unforeseen areas. Analysing the route taken across a series of WWW servers might give a not-so-innocent observer an insight into the information needs, and therefore the R&D interests, of a competitor. Analysing a search in a database would give even more information. To-date, the latter has been prevented through strict "rules of conduct" for database hosts, but the increasing need to use databases on academic hosts where specialised software is available means that this protection is lost; and this can also cause difficulties with regard to patent applications. The cost of drug and other advanced biotechnology product research is such that the end result HAS to be patented!

This point is well illustrated by the use of databases in molecular biology. The databases and software required to manipulate the data are presently stored and disseminated from the EBI and the EMBnet nodes. Neither of these can offer the security industrial users require and so many commercial companies have become "EMBnet nodes" receiving the nightly updates their molecular biology users require. This means that companies themselves have to establish complex computer systems and find the staff required to maintain the services. As the growth in data and databases continues - some predict a 100 fold growth in protein-related data in the coming five years, not to mention the associated sequence databases, there is little doubt that fewer and fewer companies will be able to mirror the facilities required (after all, a decade ago many companies held their own literature databases; now they use protected services on industrial hosts).

There is therefore every sign that some form of "protected database host" offering industrial users the security they need and require for patentable activities, will have to evolve. This will introduce questions of costs, most EMBnet hosts are free, or charge a modest fee with few security facilities being offered in exchange. However, the premise that "added value can be paid for" could fit here - the added value being security.

Internal security is another issue which occupies centre-stage in many debates about the Internet. There are dangers but these can be over-emphasised; at the same time, in the words of one of the discussants, "a little paranoia can be excellent when looking at security issues" and there is no doubt that care has to be taken when opening computers to outside access and when sending confidential data around the networks. The degree of hype attached to what people can do on the network is great and people have placed all sorts of concerns in this medium while ignoring dangers in the more traditional manners of sending information. For instance, someone can easily intercept a letter with your credit card number inside; and many people freely quote their credit card number over the telephone when buying "online". There are also increasing calls for encryption services and these will come; but one can encrypt a fax, few do, trusting that this medium is somehow, "more secure". And, there appears to be a far greater danger of internal mistakes leading to security leaks than of determined hackers breaking in and, while the Internet has alerted many to the dangers of hackers and mail interception, industrial espionage has been around for a long time.

The simplest way to avoid this is to ensure that you do not place data in an unprotected environment. This danger is indeed relevant to the former, malicious damage, sector. Here the key is to ensure that the information is not connected to any outside connection, and that the computer operators are well trained and educated as to protecting their files. Protection should start inside - not outside the institute. This means that protocols have to be checked, that operators have to ensure that systems work (and do not just take the hardware/software deliverer's word). On balance, this question is not for the information industry to handle - it is more a "user" problem although "safer ways" of getting the data to the client in a safe manner could be of importance to the whole community.

Validation is definitely an increasingly difficult issue in these days of ever-changing data and databases. It is clear that some kind of full stop is required that means that a publication is validated to an accepted degree. At present the peer reviewed scientific article remains the basis of such scientific stories but it seems likely that computer programmes such as those used in SWISS-PROT will soon allow a similar kind of check of scientific data to be made. Thus "predicted" can be checked against "observed" and mathematical algorithms which allow error checking to be carried out with a high degree of accuracy will allow biotechnology data to be checked and validated. "Linking", which has been made a lot easier since technologies such as the WWW and specialist software such as SRS have emerged, offers an excellent method of validating as it allows the user to follow a story and look for inconsequential statements/conclusions and errors.

It is clear that the industry needs some form of "quality control" on databases so that the users can be guaranteed that the data they access is of a set standard and reproducible. The culture collections have already introduced standards and the database world would do well to consider a similar course of action. Databases should also adhere to commonly agreed standards for validity, timeliness etc.

This will become more important as the amount of information overburdens the market. Already some factual databases only offer a snap-shot of their content at any one time; the evolution of the data in that database is thus no longer searchable. Literature databases must also ensure that they point to stored materials; as the role of the secondary database is likely to grow as electronic sources become more widespread, then they must be able to guarantee that what they cover is "permanent".

In fact, the information world is moving so fast, one might be tempted to ask why "validate at all" ? Let the rules of the market, and open refereeing, judge the content. Furthermore, why "set anything in stone"? The short answer to both these questions is that scientific research has to be able to rest upon previous knowledge and that few users can validate everything they need for their work - i.e. there has to be a measureable factor that guarantees a minimum amount of security that the data is accurate and valid. The "publishing act" does this, and preserves the integrity of information, so ensuring that the content is accessible in the form it was written and prepared. There is however, at present, only a moral and no legal obligation to do this and again standards might be required to ensure that we do not lose essential information. The perceived wisdom is indeed "that 99% of science is not worth saving but no-one knows what this 99% is". At present the tendency is to save everything but as time goes on this will become more and more costly.

Publishers set these bricks of biotechnology and other "learned information" into a wall. To extend the metaphor, perhaps present technologies will allow the publisher to "loosely mortar" these stones in place; to keep them available by assigning them a position in time and space and to ensure that such material is worthy of being preserved.

There is therefore still the need for refereeing, or filtering. The debate surrounding peer reviewing has not changed. While it remains a poor standard, it is a standard; however, electronic distribution of electronic reports and records now allows pre-prints and pre-refereeing and this could increasingly be used. (It is worth noting that many industrial companies already execute formal refereeing and vetting of manuscripts before publication, and university groups, anxious to maintain a high R&D profile, do likewise. Such internal refereeing procedures could easily be expanded using the new technologies so that "expert networks" would re-invent the role of the journals in the early days of the learned societies : who provided a peer group to check and authorise a publication.)

Archiving

Biotechnology information has to be stored and kept accessible, but there is no clear division of responsibility in terms of who has to archive material. Publishers, to-date, have not played an archiving role; they have prepared the products and offered, in most cases insisted, that librarians and others keep that product indexed and accessible. Electronic archiving will allow and perhaps induce the publisher to archive as well as distribute and there is every likelihood that the publisher will actually refuse to allow others to archive electronic material while it has a commercial value. Instead, libraries will be asked to route requests for materials to the publisher's archive so that the articles can be sold. In this way the publishers can satisfy the users need for access to 'living science', and maximise their return on investment, but they will presumably want to deposit these electronic files in central archival files when they are no longer commercially viable. The European Commission and some national governments (e.g. The Netherlands) are looking at this activity and it is important that such examinations take account of the new forms of electronic data and databases used in biotechnology.

In such an environment, the publisher could take on an extra task and so might consider taking another step and collect, collate, distribute and archive material from a variety of sites and sources; offering the user a validated journey through these various steps. In such a scenario, today's bulletin boards, with added features such as refereeing and data locking, could become yesterday's journal.

Use and re-use

A key demand of the user is to be able to manipulate the data they have purchased. At present, many users feel that publishers are too restrictive - for instance, a recent example from one company asked for permission to store 110 papers, 80% refused and the average cost of the rest was $ 75.00 per paper. This restricts the use of information which, after all, was given free to the publisher and could lead to conflicts and to a wider spread of "alternative publishing routes" as centres protect their own information through copyright before passing it on to publishers for "re-use".

The key factor is "what use will the user make of the articles". Will that compete with the publisher? Most publishers recognise that users want to licence the materials for their own further use. As a start, a common standard for the use of say "reference publications" might be very useful.

Users admit they do not yet have a totally clear idea of what they want to do with the information. But they do understand the terrain they want to work in and will welcome the opportunity to debate boundaries with the publishers.

The future

Today's factual and literature databases are being built with the computer technologies of the 40's. Present moves to upgrade the technologies used have to ensure that data is not lost. This is possible through new systems such as CORBA but there is also the danger that changing technologies too swiftly will shut out users.

This increase in the power of technology means that the technical differences between the secondary and emerging primary (article) databases are blurring. The primary publishers can offer their materials to the market who can, in theory, access them without locating tools such as the literature databases. The present evidence however points to the need for locating tools, indexing removes false hits which, for instance, full text searching introduce (e.g. retrieving papers on hearing aids and AIDS) and offers the market access to all the published materials. Small industrial companies, without access to large libraries, need such locating tools and, if as expected, more and more people start to "publish", some form of "locating service" will be more, and not less, necessary. The indexed abstract offers this.

There is some evidence that the better availability of the abstracts of primary articles might improve the coverage of the secondary services so that they benefit the whole market. In this regard automatic indexing is an important aspect that should be better researched. A common service for smaller publishers to use would benefit the market.

Standardisation however, requires standards and publishers appear reluctant to use standards for their own projects. The ADLIB project has produced a Database Manual where the different databases cross-reference their structures and terminologies. Something akin to this in the primary field might help.

Users will also require more added value. Technology can and will offer such advances as automatic indexing, cross referencing and cross linkages but the primary product will have to be standardised if it is to benefit. A standard format would help cross searching.

Publishers might benefit from looking at the quality controls laid down for some of their customers, and authors. Culture collections have stringent regulations on storage, recording and distribution. Again standards are needed and the STM market would benefit - especially when factual and literature databases are involved, from common agreements and standards. A working party on this, perhaps drawn from the BTSF, would be an ideal solution. It should establish WHY standards are required and ensure that they are useful to all concerned - users and producers.

Links will add quality anyway. Navigation through databases is an ideal way of checking that data is sound and in agreement with other data. The growth of informal publications endangers this quality check. The need for scientists to refer to publications is still present however. In many ways we are seeing a repeat of the primary journal - established to tell peers what is going on. E journals are emerging and use, at present, the same criteria. There are also advantages, colour, the ability to integrate sources together into a composite "publication", the ability to easily add reviewer's comments etc. to the "living document".

This living document has to be archived and stored in a manner that allows us to "future proof". A central archiving centre might be established to which authors could submit materials for storage. How this could be controlled is a matter for the future, should it be controlled?


Back to Workshop Contents Page