Actions

CLIP Colloquium (Fall 2012)

Computational Linguistics and Information Processing

Revision as of 21:55, 23 October 2012 by Jimmylin (talk | contribs)

08/20/2012: TopSig – Signature Files Revisited

Speaker: Shlomo Geva, Queensland University of Technology, Australia
Time: Monday, August 20, 2012, 11:00 AM
Venue: AVW 2120

Abstract: Performance comparisons between File Signatures and Inverted Files for text retrieval have previously shown several significant shortcomings of file signatures relative to inverted files. The inverted file approach underpins most state-of-the-art search engine algorithms, such as Language and Probabilistic models. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a modern approach to the construction of file signatures - many advances in semantic hashing and dimensionality reduction have been made in recent times, but these were not so far linked to general purpose, signature file based, search engines. This paper introduces a different signature file approach that builds upon and extends these recent advances. We are able to demonstrate significant improvements in the performance of signature file based indexing and retrieval, performance that is comparable to that of state of the art inverted file based systems, including Language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and position the file signature model in the class of Vector Space retrieval models. TopSig is an open-source search engine from QUT and it can be discussed too if there is an interest.

About the Speaker: Associate Professor Shlomo Geva is the discipline leader for Computational Intelligence and Signal Processing in the Computer Science Department at the Queensland University of Technology in Brisbane, Australia. His research interests include clustering, cross-language information retrieval, focused information retrieval, link discovery, and xml indexing.

Host: Doug Oard, oard@umd.edu

09/05/2012: 5 Minute Madness (Part I)

09/12/2012: 5 Minute Madness (Part II)

09/19/2012: CoB: Pairwise Similarity on Large Text Collections with MapReduce

Speaker: Earl Wagner, University of Maryland
Time: Wednesday, September 19, 2012, 11:00 AM
Venue: AVW 3258

Faced with high-volume information streams, intelligence analysts often rely on standing queries to retrieve materials that they need to see. Results of these queries are currently extended by effective and efficient probabilistic techniques that find similar, non-matching content. We discuss research looking further afield to find additional useful documents via MapReduce techniques performing rapid clustering of documents. This approach is intended to provide an improved “peripheral vision” to overcome some blind spots, yielding both immediate utility (detection of documents that otherwise would not have been found) and the potential for improvements to specific standing queries.

About the Speaker: Earl J. Wagner is a Postdoctoral Research Associate at the University of Maryland, College Park in the College of Information Studies (Maryland's iSchool). He was previously a Research Assistant at Northwestern University where he earned his Ph.D. in Computer Science.

09/26/2012: Better! Faster! Stronger (theorems)! Learning to Balance Accuracy and Efficiency when Predicting Linguistic Structures

Speaker: Hal Daume III, University of Maryland
Time: Wednesday, September 26, 2012, 11:00 AM
Venue: AVW 3258

Viewed abstractly, many classic problems in natural language processing can be cast as trying to map a complex input (eg., a sequence of words) to a complex output (eg., a syntax tree or semantic graph). This task is challenging both because language is ambiguous (learning difficulties) and represented with discrete combinatorial structures (computational difficulties). I will describe my multi-pronged research effort to develope learning algorithms that explicitly learn to trade-off accuracy and efficiency, applied to a variety of language processing phenomena. Moreover, I will show that in some cases, we can actually obtain model that is faster and more accurate by exploiting smarter learning algorithms. And yes, those algorithms come with stronger theoretical guarantees too.

The key insight that makes this possible is a connection between the task of predicting structured objects (what I care about) and imitation learning (a subfield in robotics). This insight came about as a result of my work a few years ago, and has formed the backbone of much of my work since then. These connections have led other NLP and robotics researchers to make their own independent advances using many of these ideas.

At the end of the talk, I'll briefly survey some of my other contributions in the areas of domain adaptation and multilingual modeling, both of which also fall under the general rubric of "what goes wrong when I try to apply off-the-shelf machine learning models to real language processing problems?"

10/03/2012: Consistent and Efficient Algorithms for Latent-Variable PCFGs

Speaker: Shay Cohen, Columbia University
Time: Wednesday, October 3, 2012, 11:00 AM
Venue: AVW 3258

In the past few years, there has been an increased interest in the machine learning community in spectral algorithms for estimating models with latent variables. Examples include algorithms for estimating mixture of Gaussians or for estimating the parameters of a hidden Markov model.

The EM algorithm has been the mainstay for estimation with latent variables, but because it is not guaranteed to converge to a global maximum of the likelihood, it is not a consistent estimator. Spectral algorithms, on the other hand, are often shown to be consistent.

In this talk, I am interested in presenting a spectral algorithm for latent-variable PCFGs, a model widely used in the NLP community for parsing. This model, originally introduced by Matsuzaki et al. (2005), augments with a latent state the nonterminals in an underlying PCFG grammar. These latent states re-fine the nonterminal category in order to capture subtle syntactic nuances in the data. This model has been successfully implemented in state-of-the-art parsers such as the Berkeley parser (Petrov et al., 2006).

Our spectral algorithm for latent-variable PCFGs is based on a novel tensor formulation designed for inference with PCFGs. This tensor formulation yields an "observable operator model" for PCFGs which can be readily used for spectral estimation.

The algorithm we developed is considerably faster than EM, and makes only one pass over the data. Statistics are collected from the data in this pass, and singular value decomposition is performed on matrices containing these statistics. Our algorithm is also provably consistent in the sense that, given enough samples, it will estimate probabilities for test trees close to their true probabilities under the latent-variable PCFG model.

If time permits, I will also present a method to improve the efficiency of parsing with latent-variable PCFGs. This method relies on tensor decomposition of the latent-variable PCFG. The tensor decomposition is approximate, and therefore the new parser is an approximate parser as well. Still, the quality of approximation can be guaranteed theoretically by inspecting how errors from the approximation propagate in the parse trees.

10/10/2012: Beyond MaltParser - Advances in Transition-Based Dependency Parsing

Speaker: Joakim Nivre, Uppsala University / Google
Time: Wednesday, October 10, 2012, 11:00 AM
Venue: AVW 3258

The transition-based approach to dependency parsing has become popular thanks to its simplicity and efficiency. Systems like MaltParser achieve linear-time parsing with projective dependency trees using locally trained classifiers to predict the next parsing action and greedy best-first search to retrieve the optimal parse tree, assuming that the input sentence has been morphologically disambiguated using a part-of-speech tagger. In this talk, I survey recent developments in transition-based dependency parsing that address some of the limitations of the basic transition-based approach. First, I show how globally trained classifiers and beam search can be used to mitigate error propagation and enable richer feature representations. Secondly, I discuss different methods for extending the coverage to non-projective trees, which are required for linguistic adequacy in many languages.Finally, I present a model for joint tagging and parsing that leads to improvements in both tagging and parsing accuracy as compared to the standard pipeline approach.

About the Speaker: Joakim Nivre is Professor of Computational Linguistics at Uppsala University and currently visiting scientist at Google, New York. He holds a Ph.D. in General Linguistics from the University of Gothenburg and a Ph.D. in Computer Science from Växjö University. Joakim's research focuses on data-driven methods for natural language processing, in particular for syntactic and semantic analysis. He is one of the main developers of the transition-based approach to syntactic dependency parsing, described in his 2006 book Inductive Dependency Parsing and implemented in the MaltParser system. Joakim's current research interests include the analysis of mildly non-projective dependency structures, the integration of morphological and syntactic processing for richly inflected languages, and methods for cross-framework parser evaluation. He has produced over 150 scientific publications, including 3 books, and has given nearly 70 invited talks at conferences and institutions around the world. He is the current secretary of the European Chapter of the Association for Computational Linguistics.

Host: Hal Daume III, hal@umd.edu