Processing and management of the ever-increasing amount of spoken and written information appears to be a huge challenge for statisticians, computer scientists, engineers and linguists; the situation is further aggravated by the explosive growth of the web, the largest known electronic document collection.
There is a pressing need for high-accuracy Information Retrieval (IR) systems, Speech Recognition systems, and “smart” Natural Language Processing (NLP) systems. For tackling many problems in these fields, most approaches rely on:
This workshop on Document Space has the goal of bringing together researchers in Mathematics, Statistics, Electrical Engineering, Computer Science and Linguistics; the hope is that a unified theory describing “document space” will emerge that will become the vehicle for the development of algorithms for tackling efficiently (both in accuracy and computational complexity) the challenges mentioned above.
Text documents are sequences of words, usually with high syntactic structure, where the number of distinct words per document ranges from a few hundreds to a few thousands. Much effort has been devoted to finding (e.g., through statistical means) useful low-dimensional representations of these inherently high-dimensional documents, that would facilitate NLP tasks such as document categorization, question answering, machine translation, unstructured information management, etc. Moreover, many of these tasks can be formulated as problems of clustering, outlier detection, and statistical modeling. Many important questions arise:
We expect that this workshop will lead the way toward well-justified answers (in terms of theory and experimental results) to the questions above, and, hopefully, contribute to a better understanding of the rich medium of language.
Damianos Karakos
(Johns Hopkins University)
Mauro Maggioni
(Yale University)
David Marchette
(Naval Surface Warfare Center)
Carey Priebe, Chair
(Johns Hopkins University)