White Paper: Science at Extreme Scales: Where Big Data Meets Large-Scale Computing
This white paper is an outcome of IPAM’s fall 2018 long program, Science at Extreme Scales: Where Big Data Meets Large-Scale Computing.
Computing has revolutionized science: simulation, or model-based computing, has allowed us to investigate phenomena much more complex than theoretical analysis alone can access, and now data-based computing is allowing us to more effectively explore, understand, and use data resulting from experiments, observation, and simulation. The simulation community has for long driven High Performance Computing (HPC), and the data science community has driven the Big Data (BD) revolution. These two computing approaches have usually been addressed independently, but the need for HPC in data-based computing and the overwhelming availability of data in model-based computing indicates that the integration of HPC and BD can form even stronger links between theory and experiment or observation, a bridge enabling the fusion of the two communities’ methodologies. By hosting this Long Program, IPAM took a lead in fostering fruitful conversations across a range of scientific disciplines, allowing the mathematical and neighboring sciences communities to consider this topic of groundbreaking potential both in a deeper and a broader manner.
This convergence comes at a critical time. We are entering a period where we have the capability to transition from interpretive qualitative models to truly predictive quantitative models of complex systems through computing. However, to realize this goal, we must deal with increasing complexity in models, algorithms, software, and hardware. The exponential growth in computing capability driven by Moore’s law is stalling, and HPC hardware architectures are necessarily becoming more complex. Accordingly, to implement the more complex models as well as to make good use of these new architectures, algorithms are becoming more complex, too. The fusion of theory- and data-based computing concepts will help tame this complexity and access more of the available computing power.
The new computational science, which will advance by leveraging the best of simulation and data science, is not, however, a foregone conclusion. Much work remains to be done. To further advance scientific understanding through data analysis techniques such as Machine Learning (ML), methods must be devised to exploit domain knowledge and to enable the interpretation of ML-derived models. Information integration is a paramount step towards a predictive science, but existing approaches have limitations that become evident in large-scale applications. The logistics of data management are not to be overlooked as an area in need of advances; data must be findable, accessible, interoperable, and re-usable to enable this theory- and data-driven future. Data and information are not the same thing, and we must make judicious use of all forms of data reduction techniques to preserve the information content efficiently. Finally, both the changing hardware landscape and the increased availability and use of data will require the development of new algorithms. In this report, we summarize the ideas that came up during at the many discussions during the IPAM Long Program, describing in more detail these topics and the outlook for a new computational science paradigm.
Read the full report.