Data Space: Protocols and Services for Distributed Data Mining and Remote Data Analysis
R. L. Grossman, M. Mazzucco, and H. Sivakumar
Laboratory for Advanced Computing
University of Illinois at Chicago
Although 10 GB data sets are common today and 100 GB - 1 TB and larger data sets are becoming common, we do not have the technology today to casually explore remote data nor to mine distributed data of this size. Over the next several years, as the amount of dark fiber grows, bandwidth is expected to become a commodity. In this environment, it is possible to imagine an infrastructure which allows the casual exploration and distributed mining of large data sets. In this talk, we introduce such an infrastructure.
Our philosophy is to develop an infrastructure similar to the web, but designed to support the casual exploration of remote columns of data and the mining of distributed columns. In this talk, we introduce Data Space, an infrastructure for creating a web of data instead of documents and illustrate some Data Space tools for the mining and analysis of remote and distributed scientific data.
We describe protocols for viewing, exploring, and mining remote and distributed data. We and establish that these protocols are effective for distributed workstation clusters connected with high performance networks (super-clusters) and with commodity networks (meta-clusters). We also introduce tools that allow the casual analysis and mining of distributed data with a point and click interface over the commodity internet. We cover the design and implementation of these tools and services.
We also describe the Terra Wide Data Mining (TWDM) Testbed, which is an open, distributed testbed for Data Space tools, services, and protocols. It consists of ten sites distributed over three continents connected by high performance links. The testbed includes a variety of distributed scientific data mining applications, including the analysis of high energy physics data, earth science data, and network weather data.