Proceedings of the Second ICSU/UNESCO International Conference on
Electronic Publishing in Science

Paris, 20-23 February 2001

E-BioSci: an Europe-based Platform for E-publishing and Information Integration in the Life Sciences

Les Grivell

European Molecular Biology Organisation, Heidelberg, Germany

Received 18 February 2001, published online 20 June 2001

©  2001   International Council for Science,
UNESCO and individual contributors


Contents


Abstract

Just as the emerging field of genomics is changing the way in which molecular biologists plan, execute and interpret their research, so is the transition from traditional to electronic publishing technologies changing the ways in which the results of this research is disseminated to and used by other scientists. Paper is turning out to be an inadequate medium for the flood of new data that often demands both further manipulation and new methods of visualization as an aid to interpretation. The typical research scientist therefore wishes to have free, or at least unhindered access to as wide a range of electronic information sources as possible, to be able to navigate effortlessly between them and to search, select, integrate and manipulate information without leaving his or her desk. It is against this background of changing practices and expectations that researchers in the bio-medical sciences have also come to question established editorial, reviewing and publishing practices and even to reconsider the nature of the publication itself. Following calls from the PubMed Central and the Public Library of Science initiatives in the USA, various publishers now offer more or less unrestricted access to their journal archives and they host e-journals with no, or low peer-review barriers to publication. However, these changes address only part of the problem. Much still remains to be done to ensure optimal accessibility of the biological literature and its linkage to databases containing sequences, sequence-related information or digital images. These are the goals of E-BioSci, a European initiative to provide a variety of high quality electronic services relating to full text access and retrieval in the life sciences. E-BioSci will involve a broad spectrum of publishers, service providers, scientific authors, and academic institutes. E-BioSci aims to foster trans-national cooperation in the search for effective solutions to the challenge of achieving effective information access and retrieval in the Life Sciences. The platform will act in a coordinative capacity to remove barriers to free exchange of information and to encourage the pooling of national expertises, the development and evaluation of common protocols for database linkage. It will work to promote the generation of complementarity in the further growth of national information archives.

In the summer of 1989, the Genetics Society of America organized one of its bi-annual meetings on Yeast Genetics. As a participant at that meeting, two sessions still stand out for me. The first, a part of the official programme, was a session entitled "Who's working on my/your gene?.'' The other was a brief, slightly conspiratorial meeting of a handful of European scientists gathered around the conference hotel's Steinway and was focused on preparations for an ambitious, EU-funded project aimed at the determination of the complete DNA sequence of the genome of the yeast Saccharomyces cerevisiae. The outcome of the first session was a long list of genes each remarkable for being involved in so many apparently different cellular processes that they had emerged time after time in different guises and under different names in different laboratories. The session clearly illustrated the benefits of sharing of information. It also highlighted the need for well-structured databases capable of allowing storage and retrieval of different types of information derived from many different experimental techniques in a way that would allow researchers to construct as complete picture of all the facets of a given gene and its functional relationships to others involved in the same or related cellular processes.


Genomics and the Information Explosion

fig01
Figure 1
The final outcome of the second session was the publication on 24th April 1996 of the complete sequence of the 13.3 M base pairs that make up the 16 chromosomes of Saccharomyces cerevisiae, the simple, single-cell organism that is so often used as model for more complex and experimentally less accessible eukaryotic cells [Goffeau, 2001]. The relatively brief period since this historic date has been one of unprecedented rapid progress, culminating on June 26 th of last year with the announcement of the first draft of the 3000 M base pairs of the human genome (see Figure 1). As of February 2001, the number of completely sequenced genomes stands at 800. The total number of base pairs of DNA sequence stored in the joint EMBL/DDBJ/GenBank databanks has risen to a mind boggling 11,526,750,544 base pairs in 10,711,124 records and continues
fig02
Figure 2
to increase at an exponential rate (see Figure 2).

The fields of genomics and bio-informatics have firmly established themselves in research programmes and teaching curricula, together with related areas of functional and structural genomics and their derived specializations of transcriptomics, proteomics and metabolomics, that deal with respectively all the RNAs, proteins and metabolites present in a cell.

Common to all these areas is the production of vast amounts of raw data. Common too, is the increasing dependence on the internet as a means of disseminating or acquiring data and of providing access to specialized software for analysis. Paper is turning out to be an inadequate medium for the flood of new data that often demands both further manipulation and new methods of visualization as an aid to interpretation. More often than not, paper publications contain little more than summary pointers to data tables that are too large to print, or to videos and multi-dimensional images that cannot be printed. It is against this background of changing practices and expectations that bio-medical researchers have also come to question established editorial, reviewing and publishing practices and even to reconsider the nature of the publication itself.


From Data to Knowledge

For some, this veritable embarrass de richesse of data is seen as the death knell for hypothesis-driven research and the dawn of an era in which data-mining will generate novel leads and concepts for innovative research. For others, it signals just the opposite - a means of enabling biologists to construct for the first time precise, detailed and experimentally verifiable models of cellular function. Either way, success of data analysis depends on the ready availability of as complete a set of data as possible. Several recent developments are likely to contribute to the achievement of this ideal:

Encouraging though these developments are, a number of hurdles still have to be taken. They include:

With respect to this last point, PubMed ( http://www.ncbi.nlm.nih.gov:80/entrez/query/static/overview.html in the USA has provided a first set of tools for searching and retrieval of information from the MEDLINE collection of abstracts ( http://www.nlm.nih.gov/databases/freemedl.html) that is linked to DNA and protein sequence databases. The system is interlinked at the level of keywords and identifiers. There are, however, clear needs for innovation and refinement: needs to increase the sophistication of search algorithms, to develop methods for searching of full text publications, to develop better discriminative criteria for interlinking and establishing relationships between published documents and to link publications with data in a variety other formats that includes, structures, images and animations. E-BioMed (subsequently PubMed Central - http://www.pubmedcentral.nih.gov/), the first attempt to establish a single site for the storage and retrieval of electronic text and data, was an important initiative of the NIH in the USA. Unfortunately, however, the controversial aspects of this proposal with respect to the absence or possibly only optional presence of peer review and to lack of realism in terms of the aim of distribution without charge of content owned by others, prevented implementation as originally conceived. Even now, the more important issues mentioned above are being pushed into the background by a controversial call ( http://www.publiclibraryofscience.org/) for open access to the published literature aimed at individual scientists by a number of the original proponents of PubMed Central. The call, in the form of an open letter to publishing organisations, encourages scientists to show their support for open access by pledging to publish in, review or edit for only those journals that grant unrestricted distribution rights to PubMed Central and similar entities within 6 months of publication. Inexplicably, the call focuses only on primary journal publications, ignoring a potentially much more serious problem concerning a growing tendency towards limited access to database information. This problem has been highlighted recently by acceptance by the publisher of Science journal of Celera's terms ( http://www.celera.com/) for the release of their human genome sequence data. The data will not be submitted to public databanks and access at Celera's own site will be restricted to those agreeing not to "redistribute'' the information. The implications of the latter restriction are crippling, since, depending on the exact interpretation of what is meant by redistribution, they may well extend to severe limitation of freedom to publish studies based on the data, to carry out large-scale bio-informatic analysis and to incorporate derived data into other databases.

The E-BioSci Platform for Information Access and Retrieval

It is against this confused background that EMBO, the European Molecular Biology Organisation, has decided to take the lead in a collaborative effort to establish E-BioSci as a European-based information resource network with a global role. A series of discussions (see http://www.embo.org/E_Pub_pages.html) with interested parties (including research organisations, learned societies, publishers, individual research scientists and representatives of a large number of EU member states) identified the shortcomings of earlier proposals and led to the formulation of the current initiative. This defines E-BioSci as a networked platform that will extensively combine the skills and content already present, or being developed in various centres in Europe. It will work in harmony with other global initiatives such as PubMed Central, publishers and other information providers. Although superficially more complex, this setup more accurately reflects the European dimension of the project. Additionally, it offers potential advantages in terms of speed of access, provision of backup or secure storage facilities and it will allow queries to be performed in different language formats.

By providing an extensive set of linkages through the biological information chain E-BioSci will:

The E-BioSci network will:

As indicated in these last two points, E-BioSci will, besides acting as an information portal, provide hosting services for electronic publications. The aim here will be to provide a platform for the dissemination of material that has previously undergone peer review and authentication by an independent body. E-BioSci need not be the sole repository of such material and authors may choose to submit their reviewed and authenticated manuscripts to as many sites as they wish. This emphasis on a reliable form of quality assessment and control distinguishes E-BioSci from a number of other e-publishing initiatives, including those modeled on the Los Alamos Physics Archive (e.g. the eprint based Cogprints server - http://www.eprints.org/), or commercially based services such as those offered by BioMed-Central ( http://www.biomedcentral.com/). One of the main issues here is that authors rely on the perceived quality of their publications as support for funding applications and career advancement and are thus likely to be reticent to abandon a tried and trusted model of assessment in the absence of reliable and widely accepted alternatives. Additionally, from the reader's point of view, some degree of editorial control is, at least in part, a guarantee that technical standards have been met, that the conclusions are adequately supported by the experimental data and that the presentation meets acceptable standards of clarity. In cases in which a submission is accompanied by significant amounts of supplementary data, the peer review process also provides an appropriate opportunity for watermarking of both manuscript and data to protect against tampering at a later stage.


Conclusions and Prospects

Just as the emerging field of genomics is changing the way in which molecular biologists plan, execute and interpret their research, so is the transition from traditional to electronic publishing technologies changing the ways in which the results of this research is disseminated to and used by other scientists. In this brief overview, I have presented a perspective largely based on that of the individual scientist, who wishes to have free, or at least unhindered access to as wide a range of electronic information sources as possible, to be able to navigate effortlessly between them and to search, select, integrate and manipulate information without leaving his or her desk. I have outlined a number of recent developments that will contribute to the achievement of this goal. E-BioSci is one of these. Much still remains to be done, however, and a brief wish list of a typical user might include:


References

Goffeau A., Four years of post-genomic life with 6,000 yeast genes. Review. FEBS Lett., 480(1), 2001, pp. 37-41.

Etzold T., A. Ulyanov, P. Argos, SRS: Information retrieval system for molecular biology data banks. Methods Enzymol.; 266, 1996, pp. 114-128.

Carazo J. M., E. H. Stelzer, The BioImage Database Project: organizing multidimensional biological images in an object-relational database. J Struct Biol; 125, 1999, pp. 97-102.

Gonzalez-Couto E., B. Hayes, A. Danckaert, The life sciences global image database. Nucleic Acids Res., 29, 2001, pp. 336-339.

Discala C., X. Benigni, E. Barillot, G. Vaysseix, DBcat: a catalog of 500 biological databases. Nucleic Acids Res; 28, 2000, pp. 8-9.

Eisen M. B., P. T. Spellman, P. O. Brown, D. Botstein, Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A, 95, 1998, pp. 14863-14868.

Schwikowski B., P. Uetz, S. Fields, A network of protein-protein interactions in yeast. Nat Biotechnol, 18, 2001, pp. 1257-1261.


 Load files for printing and local use.

Top

 

This document was generated by TeXWeb (Win32, v.1.4) on February 25, 2002.