I will start this talk by describing a recurring problem in unsupervised data exploration. Clustering is the task of grouping a set of objects in such a way that objects in the same group or cluster are more similar to each other (in some sense) than to those in other clusters. Different techniques lead to different solutions: which one is the right one? We formulate a consensus clustering solution that integrates these different solutions, distilling all their good qualities. I will present this novel framework, based on nonnegative matrix factorization (NMF), and show how to handle large scale problems. This led to new computational algorithms in big data NMF, one of the most critical tools in data science. These algorithms exploit structured random compression, and were released as a very efficient software package for large data analysis challenges.
Bonus track:
I will briefly describe a unique art+technology project. We use emerging technologies for developing a new model of the "engaged museum" that reaches out to involve the public of all ages in "reconnecting" works of art to their original context through interactive and gaming displays. The first of these prototypes is now part of the Nasher Museum's permanent collection at Duke University.