I will take a pattern recognition perspective to discuss how documents should be represented for tasks such as classification, clustering, summarization, etc. The main focus will be on trying to formalize some of the ideas of context,
corpus dependent feature extraction, and conditioning that are critical to working
on such high dimensional and complex data. I will illustrate some of these ideas on a small collection of articles from Science News. The goal will be to provoke discussion and thought, as well as to explain the title of the talk.
Audio (MP3 File, Podcast Ready)