BIOZON: a unified knowledge resource on DNA sequences, proteins, complexes and cellular pathways

Golan Yona
Cornell University
Computer Science

The function of genes depends on their extended biological context - their relations to other genes, the set of interactions they form, the
pathways they participate in, their subcellular location, and so on. In this view, there is a growing need to corroborate and integrate
data from different resources and aspects of biological systems in order to analyze effectively new genes. Addressing this urgent need, the aim of the BIOZON project is to construct a new unified biological
resource and a comprehensive protein and DNA characterization, classification and management system that analyzes biological entities
from genes to protein families, biochemical pathways and organisms. BIOZON is based on an extensive database schema that integrates information at the macro-molecular level as well as at the cellular level, from a variety of resources.

This resource already stores extensive information about more than 5,000,000 protein and DNA sequences (integrating sequence, structure,
protein-protein interactions, pathways and expression data) totaling to about 35 million documents from several different databases as well
as from in-house computations, and 2.5 billion relations between documents (including explicit relations between objects, and derived
or computed relations based on sequence similarities, profile-profile similarities, structural similarities and more). This knowledge
resource will be made available in early 2004 (a preliminary beta version is accessible at biozon.org) with the expectation that other
existing data types will be integrated gradually (by hosting other databases). Since new technologies keep generating new data types, the
database was designed as general as possible, to allow easy integration of future databases.

One of the unique aspects of BIOZON is that it allows complex queries that span multiple data types (e.g. a protein sequence, a structure,
and a pathway). For example, using the web interface, one can easily form a query that will return all proteins that are known to
participate in known pathways and have a solved 3D structure. Or, one can ask for all protein structures of proteins that are involved in
known interactions. Another example is of all DNA sequences that code for protein kinases that are part of enzyme families that catalyze
reactions in known pathways. And so on. With the integration of similarity data we intend to extend these queries to accommodate fuzzy
relations.

In this seminar I will present several elements of the BIOZON system. The system uses algorithms and mathematical models that we have
developed for detection of domains and of similarities between proteins and protein families, and novel embedding techniques that we
have developed and are used to construct a complete "road map" of the protein universe.

Presentation (PDF File) Presentation (PowerPoint File)

Back to Workshop I: High Throughput Technologies and Methods of Analysis