Les Grivell
European Molecular Biology Organisation, Heidelberg, Germany
Received 18 February 2001, published online 20
June 2001
© 2001 International Council for Science,
UNESCO and individual contributors
Abstract | ![]() |
In the summer of 1989, the Genetics Society of America organized one of its bi-annual meetings on Yeast Genetics. As a participant at that meeting, two sessions still stand out for me. The first, a part of the official programme, was a session entitled "Who's working on my/your gene?.'' The other was a brief, slightly conspiratorial meeting of a handful of European scientists gathered around the conference hotel's Steinway and was focused on preparations for an ambitious, EU-funded project aimed at the determination of the complete DNA sequence of the genome of the yeast Saccharomyces cerevisiae. The outcome of the first session was a long list of genes each remarkable for being involved in so many apparently different cellular processes that they had emerged time after time in different guises and under different names in different laboratories. The session clearly illustrated the benefits of sharing of information. It also highlighted the need for well-structured databases capable of allowing storage and retrieval of different types of information derived from many different experimental techniques in a way that would allow researchers to construct as complete picture of all the facets of a given gene and its functional relationships to others involved in the same or related cellular processes.
Genomics and the Information Explosion | ![]() |
|
| Figure 1 |
|
| Figure 2 |
The fields of genomics and bio-informatics have firmly established themselves in research programmes and teaching curricula, together with related areas of functional and structural genomics and their derived specializations of transcriptomics, proteomics and metabolomics, that deal with respectively all the RNAs, proteins and metabolites present in a cell.
Common to all these areas is the production of vast amounts of raw data. Common too, is the increasing dependence on the internet as a means of disseminating or acquiring data and of providing access to specialized software for analysis. Paper is turning out to be an inadequate medium for the flood of new data that often demands both further manipulation and new methods of visualization as an aid to interpretation. More often than not, paper publications contain little more than summary pointers to data tables that are too large to print, or to videos and multi-dimensional images that cannot be printed. It is against this background of changing practices and expectations that bio-medical researchers have also come to question established editorial, reviewing and publishing practices and even to reconsider the nature of the publication itself.
From Data to Knowledge | ![]() |
For some, this veritable embarrass de richesse of data is seen as the death knell for hypothesis-driven research and the dawn of an era in which data-mining will generate novel leads and concepts for innovative research. For others, it signals just the opposite - a means of enabling biologists to construct for the first time precise, detailed and experimentally verifiable models of cellular function. Either way, success of data analysis depends on the ready availability of as complete a set of data as possible. Several recent developments are likely to contribute to the achievement of this ideal:
Encouraging though these developments are, a number of hurdles still have to be taken. They include:
|
| Figure 3 |
|
| Figure 4 |
With respect to this last point, PubMed ( http://www.ncbi.nlm.nih.gov:80/entrez/query/static/overview.html in the USA has provided a first set of tools for searching and retrieval of information from the MEDLINE collection of abstracts ( http://www.nlm.nih.gov/databases/freemedl.html) that is linked to DNA and protein sequence databases. The system is interlinked at the level of keywords and identifiers. There are, however, clear needs for innovation and refinement: needs to increase the sophistication of search algorithms, to develop methods for searching of full text publications, to develop better discriminative criteria for interlinking and establishing relationships between published documents and to link publications with data in a variety other formats that includes, structures, images and animations. E-BioMed (subsequently PubMed Central - http://www.pubmedcentral.nih.gov/), the first attempt to establish a single site for the storage and retrieval of electronic text and data, was an important initiative of the NIH in the USA. Unfortunately, however, the controversial aspects of this proposal with respect to the absence or possibly only optional presence of peer review and to lack of realism in terms of the aim of distribution without charge of content owned by others, prevented implementation as originally conceived. Even now, the more important issues mentioned above are being pushed into the background by a controversial call ( http://www.publiclibraryofscience.org/) for open access to the published literature aimed at individual scientists by a number of the original proponents of PubMed Central. The call, in the form of an open letter to publishing organisations, encourages scientists to show their support for open access by pledging to publish in, review or edit for only those journals that grant unrestricted distribution rights to PubMed Central and similar entities within 6 months of publication. Inexplicably, the call focuses only on primary journal publications, ignoring a potentially much more serious problem concerning a growing tendency towards limited access to database information. This problem has been highlighted recently by acceptance by the publisher of Science journal of Celera's terms ( http://www.celera.com/) for the release of their human genome sequence data. The data will not be submitted to public databanks and access at Celera's own site will be restricted to those agreeing not to "redistribute'' the information. The implications of the latter restriction are crippling, since, depending on the exact interpretation of what is meant by redistribution, they may well extend to severe limitation of freedom to publish studies based on the data, to carry out large-scale bio-informatic analysis and to incorporate derived data into other databases.
The E-BioSci Platform for Information Access and Retrieval | ![]() |
It is against this confused background that EMBO, the European Molecular Biology Organisation, has decided to take the lead in a collaborative effort to establish E-BioSci as a European-based information resource network with a global role. A series of discussions (see http://www.embo.org/E_Pub_pages.html) with interested parties (including research organisations, learned societies, publishers, individual research scientists and representatives of a large number of EU member states) identified the shortcomings of earlier proposals and led to the formulation of the current initiative. This defines E-BioSci as a networked platform that will extensively combine the skills and content already present, or being developed in various centres in Europe. It will work in harmony with other global initiatives such as PubMed Central, publishers and other information providers. Although superficially more complex, this setup more accurately reflects the European dimension of the project. Additionally, it offers potential advantages in terms of speed of access, provision of backup or secure storage facilities and it will allow queries to be performed in different language formats.
By providing an extensive set of linkages through the biological information chain E-BioSci will:
The E-BioSci network will:
As indicated in these last two points, E-BioSci will, besides acting as an information portal, provide hosting services for electronic publications. The aim here will be to provide a platform for the dissemination of material that has previously undergone peer review and authentication by an independent body. E-BioSci need not be the sole repository of such material and authors may choose to submit their reviewed and authenticated manuscripts to as many sites as they wish. This emphasis on a reliable form of quality assessment and control distinguishes E-BioSci from a number of other e-publishing initiatives, including those modeled on the Los Alamos Physics Archive (e.g. the eprint based Cogprints server - http://www.eprints.org/), or commercially based services such as those offered by BioMed-Central ( http://www.biomedcentral.com/). One of the main issues here is that authors rely on the perceived quality of their publications as support for funding applications and career advancement and are thus likely to be reticent to abandon a tried and trusted model of assessment in the absence of reliable and widely accepted alternatives. Additionally, from the reader's point of view, some degree of editorial control is, at least in part, a guarantee that technical standards have been met, that the conclusions are adequately supported by the experimental data and that the presentation meets acceptable standards of clarity. In cases in which a submission is accompanied by significant amounts of supplementary data, the peer review process also provides an appropriate opportunity for watermarking of both manuscript and data to protect against tampering at a later stage.
Conclusions and Prospects | ![]() |
Just as the emerging field of genomics is changing the way in which molecular biologists plan, execute and interpret their research, so is the transition from traditional to electronic publishing technologies changing the ways in which the results of this research is disseminated to and used by other scientists. In this brief overview, I have presented a perspective largely based on that of the individual scientist, who wishes to have free, or at least unhindered access to as wide a range of electronic information sources as possible, to be able to navigate effortlessly between them and to search, select, integrate and manipulate information without leaving his or her desk. I have outlined a number of recent developments that will contribute to the achievement of this goal. E-BioSci is one of these. Much still remains to be done, however, and a brief wish list of a typical user might include:
References | ![]() |
Goffeau A., Four years of post-genomic life with 6,000 yeast genes. Review. FEBS Lett., 480(1), 2001, pp. 37-41.
Etzold T., A. Ulyanov, P. Argos, SRS: Information retrieval system for molecular biology data banks. Methods Enzymol.; 266, 1996, pp. 114-128.
Carazo J. M., E. H. Stelzer, The BioImage Database Project: organizing multidimensional biological images in an object-relational database. J Struct Biol; 125, 1999, pp. 97-102.
Gonzalez-Couto E., B. Hayes, A. Danckaert, The life sciences global image database. Nucleic Acids Res., 29, 2001, pp. 336-339.
Discala C., X. Benigni, E. Barillot, G. Vaysseix, DBcat: a catalog of 500 biological databases. Nucleic Acids Res; 28, 2000, pp. 8-9.
Eisen M. B., P. T. Spellman, P. O. Brown, D. Botstein, Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A, 95, 1998, pp. 14863-14868.
Schwikowski B., P. Uetz, S. Fields, A network of protein-protein interactions in yeast. Nat Biotechnol, 18, 2001, pp. 1257-1261.
![]()
|