Kernel classifiers and regressors designed for structured data, such as molecular structures, have significantly advanced a number of interdisciplinary areas such as computational biology, materials and drug design. Typically, a pairwise similarity function, called the kernel, is designed beforehand based on the structure of the data which either exploit statistics of the structures or make use of probabilistic generative models, and then a discriminative classifier is learned based on the kernels via convex optimization. However, such an elegant two-stage approach also limited kernel methods from scaling up to millions of data points, and exploiting discriminative information to learn feature representations.
In this talk, I will present structure2vec, an effective and scalable approach for representing molecular structure based on the idea of embedding latent variable models into feature spaces, and learning such feature spaces using discriminative information. Interestingly, structure2vec extracts features by performing a sequence of nonlinear function mappings in a way similar to graphical model inference procedures, such as mean field and belief propagation. In applications involving millions of solar panel materials, we showed that structure2vec runs 2 times faster, produces models which are 10, 000 times smaller, while at the same time achieving the state-of-the-art property prediction performance.