This talk will survey two areas from the atmospheric sciences where interesting scientific problems hinge on the mining and modeling of large and complicated data sets. These examples help to point out some emerging areas of useful statistical methodology and the diversity of data, from observational datasets to model output.
To illustrate the close connection between empirical statistical models and large datasets we discuss the problem of building weather generators for daily meteorological data for the Southeastern US. Here the goal is a stochastic model to provide weather inputs to crop yield models. In a broad sense we must mine the observational record to find parsimonious representations of daily weather. These models must reproduce non-Gaussian distributions of weather variables, temporal dependence and the patchy, spatial dependence of precipitation. We discuss the use of hierarchical and latent variable models for this purpose and a new addition is the use of an observation driven model to generate the temporal dependence in precipitation incidence.
To mark an new area for statisticians we also present the analysis of deterministic numerical model output. Here we search for dominant regimes in the (simulated) atmosphere from a long (1M years) run of the Community Climate Model, an early general circulation model (GCM). GCM's lie at the heart of climate models and so are a fundamental system to understand. To a data miner this is a clustering exercise, but we make use of the fact that our data is from a dynamical system. The statistical approach is to link together several well known techniques, neural network regression, K-means clustering and cross-validation, in a way to achieve some new results. Here we deal with the five-dimensional multivariate times series from projecting the wind field (500 mb stream function) on its first five principle components. Neural network regression models are used to determine the nonlinear autoregressive component in this multivariate time series. The Jacobeans from the nonlinear map are then used to cluster the state space into regions with distinct dynamics. Having identified these regions one can estimate average transit times between regimes and other useful statistics.