A common problem in modern statistical applications is to select, from a large set of candidates, a subset of variables which are important for determining an outcome of interest. For instance, the outcome may be disease status and the variables may be hundreds of thousands of single nucleotide polymorphisms on the genome. In this talk, we develop an entirely new read of the knockoffs framework of Barber and Candès (2015), which proposes a general solution to perform variable selection under rigorous type-I error control, without relying on strong modeling assumptions. We show how to apply this solution to a rich family of problems where the distribution of the covariates can be described by a hidden Markov model (HMM). In particular, we develop an exact and efficient algorithm to sample knockoff copies of an HMM, and then argue that combined with the knockoffs selective framework, they provide a natural and powerful tool for performing principled inference in genome-wide association studies with guaranteed FDR control. Finally, our methodology is applied to several datasets aimed at studying the Crohn's disease and several continuous phenotypes, e.g. levels of cholesterol.
This is joint work with Rina Barber, Yingying Fan, Lucas Janson, Jinchi Lv, Chiara Sabatti and Matteo Sesia.