Biotechnology Information:
Access, Storage, Validation and Security

A Workshop organised by The Biotechnology Information Strategic Forum, with support from DGXII of the Commission of the European Communities, and held at CAB International, Wallingford, Oxon, UK, October 1996


Preservation: who, how and for how long? -- Chris Rusbridge

It is quite clear that we have already lost many "treasures". Some of these date back to archaeological times but others are more recent. "Deciphering history" is always difficult but "written evidence", if we have it, is a wonderfully reliable way of gaining an insight to the past. Thus we know little of the life of Ludwig's fictitious younger brother, 'Wolfgang van Beethoven', but we do know that should we locate his manuscripts in some forgotten trunk we are likely to be able to read them.

More recently however, we have begun to see the dangers that technically-dependent solutions can bring. The music of Pierre Boulez' (equally imaginary) younger brother, Michel, an early computer hacker who wrote electronic music on lashups from the first microcomputers, would very likely not be readable even after a very much shorter time. And, lest you take this example with too much a pinch of salt, many serious examples of data having been lost or rendered unreadable already exist (for instance, the first Amazon rain forest satellite surveys are no longer usable, and parts of the 1960 US Census, and records from the 1971 UK Census have "disappeared", or cannot be used).

There is a generally agreed need for preservation, and for information central to scholarship to be archived. But many "jewels" of our present and past cultures are also "information" and it is clear that there is a general need to keep records and examples for future generations.

If one is to archive, then for how long? Would it be less important if the 1960 census data had been lost in 2060? Basically, we cannot expect to keep data for more than about 10 years without taking some actions to refresh it and therefore we face an enormous problem in the future.

Storage of information is difficult. There are preservation problems involving media stability (paper, tape and even the CD ROM all have "shelf lives") and there is also the environmental obsolescence of the various tools - hardware, software, the data structure and required documentation needed to access and use the stored materials. Thus any serious attempt to preserve "for ever" will require maintaining a complete set of the materials required to read/store that material now.

Some form of neutral "Esperanto" is probably required to enable data files to be stored and read in the future. Data might have to kept in a neutral format that can be refreshed as it ages. It is possible that several key technologies might also be maintained (e.g. a version of Windows plus the required tools to operate it). Other solutions include emulating technologies, or migrating data from one form to another, or storing data in "archaeological formats".

You may need to

* refresh data on storage media (e.g. magnetic tape) - migrate data from one medium to another. This is not enough if the technology changes, so you may also need to

* preserve the obsolete technology, or

* emulate the obsolete technology, or

* migrate the information content or meaning of the data forward to current technology (likely to imply information loss). Should all these have been neglected, for high value data there remains the possibility of 'data archaeology', extracting whatever information is possible from the obsolete medium and interpreting it using cryptologic techniques.

But these tasks all pose huge logistical responsibilities on whoever is charged with their execution.

Another question must be .....

Who is to do this?

Modern thinking points to a series of archivers being required such as:

Option 1: data owner

Option 1a: legal deposit library/archive

Option 2: voluntary deposit in archives

Option 3: fail-safe rescue powers of archives.

There are projects looking at these needs; for instance the eLib team in the UK has "digital preservation" as a major element of their next programme and some Dutch libraries and publishers are cooperating on future solutions. There are studies connected to the Legal Deposit proposals of central libraries such as the British Library in the UK. Other work is looking at the relevance of current archival practice to digital preservation, and costing models for long term preservation of digital materials, as well as possible methods, are being investigated.

Even so, time is not on the archiver's side. Attitudes towards the rights of owners to digital preservation are still not firm, and there are no policies for post hoc rescue, nor data archaeology. The immediate future will require a number of actions including the need to secure proposals from potential digital archives and to look for funding for proposals to advance digital archives. Only then can the first major experiments in archival application of technologies and services begin. Furthermore, before we can start on meaningful archiving, we also need to look at such issues as:

* Attitudes of rights owners to digital preservation

* Fail-safe mechanisms for aggressive rescue

* Forums on digital archives

* Standards, criteria etc. to certify repositories as archives

* Coordinating digital preservation in the US with overseas efforts

* Commission follow-on case studies of digital archiving.

Of particular relevance to this meeting is the fact that, in the past decade, biotechnology sequence and protein research has generated huge, and rapidly growing, databases which must be preserved for future use. These, like the other examples mentioned, require specific software and sometimes hardware tools for their analysis. They also require increasing degrees of annotation and so become increasingly complex.


Back to Workshop Contents Page