Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. In the first part of the talk, I'll introduce WSABIE, a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact an
notation given by a human labeler, it often predicts similar annotations. In the second part of the talk, I'll show how the same approach, WSABIE, can be extended to the multi-task case, where one learns simultaneously to embed in the same space various music related information such as artist names, music genres, and audio tracks in order to optimize different but related costs.