The demands for and on data mining continue to multiply. The timely extraction of useful insight from data collections is widely recognized to be critical to business and government decision making. Yet, the scale and complexity of data sets are growing rapidly. For time-critical mining of large data sets, high performance computing is becoming essential.
The existing high performance computing world evolved to serve the needs of scientific and engineering computations. However, the computational needs of data mining can be quite different. How well can existing high performance machines and paradigms address the needs of data miners? This talk will attempt to answer this question for one, particularly challenging realm of data mining - graph analysis. For very unstructured data sets, graph algorithms have a number of features that are a poor match for traditional high performance machines. Their data access patterns don't map well to caches and heirarchical memories. Their parallelism tends to be fine-grained and variable. All of these features have impeded successful parallelization of graph algorithms on traditional supercomputers. However, a non-traditional approach to high performance computing known as massive multithreading holds promise for these kinds of applications. The talk will describe work applying massive multithreading to graph analysis problems, and suggest larger lessons and research directions for the broader data mining community.
Audio (MP3 File, Podcast Ready)