Actions

Events: Difference between revisions

Computational Linguistics and Information Processing

No edit summary
No edit summary
Line 12: Line 12:
}}
}}
__NOTOC__
__NOTOC__
== 12/05/2012: Combining Statistical Translation Techniques for Cross-Language Information Retrieval ==
'''Speaker:''' [http://www.cs.umd.edu/~fture/Home.html Ferhan Ture],  University of Maryland<br/>
'''Time:''' Wednesday, December 5, 2012, 11:00 AM<br/>
'''Venue:''' AVW 3258<br/>
Cross-language information retrieval today is dominated by techniques
that rely principally on context-independent token-to-token mappings
despite the fact that state-of-the-art statistical machine translation
systems now have far richer translation models available in their
internal representations. This paper explores combination-of-evidence
techniques using three types of statistical translation models:
context-independent token translation, token translation using
phrase-dependent contexts, and token translation using
sentence-dependent contexts. Context-independent translation is
performed using statistically-aligned tokens in parallel text,
phrase-dependent translation is performed using aligned statistical
phrases, and sentence-dependent translation is performed using those
same aligned phrases together with an $n$-gram language model.
Experiments on retrieval of Arabic, Chinese, and French documents
using English queries show that no one technique is optimal for all
queries, but that statistically significant improvements in mean
average precision over strong baselines can be achieved by combining
translation evidence from all three techniques. The optimal
combination is, however, found to be resource-dependent, indicating
a need for future work on robust tuning to the characteristics of
individual collections.
This is a practice talk for COLING 2012.
== 01/30/2013: Human Translation and Machine Translation ==
'''Speaker:''' [http://homepages.inf.ed.ac.uk/pkoehn/ Philipp Koehn],  University of Edinburgh<br/>
'''Time:''' Wednesday, January 30, 2013, 11:00 AM<br/>
'''Venue:''' AVW 3258<br/>


== 04/10/2013: Learning with Marginalized Corrupted Features ==
== 04/10/2013: Learning with Marginalized Corrupted Features ==

Revision as of 13:00, 25 November 2012

The CLIP Colloquium is a weekly speaker series organized and hosted by CLIP Lab. The talks are open to everyone. Most talks are held at 11AM in AV Williams 3258 unless otherwise noted. Typically, external speakers have slots for one-on-one meetings with Maryland researchers before and after the talks; contact the host if you'd like to have a meeting.

If you would like to get on the cl-colloquium@umiacs.umd.edu list or for other questions about the colloquium series, e-mail Jimmy Lin, the current organizer.


{{#widget:Google Calendar |id=lqah25nfftkqi2msv25trab8pk@group.calendar.google.com |color=B1440E |title=Upcoming Talks |view=AGENDA |height=300 }}

12/05/2012: Combining Statistical Translation Techniques for Cross-Language Information Retrieval

Speaker: Ferhan Ture, University of Maryland
Time: Wednesday, December 5, 2012, 11:00 AM
Venue: AVW 3258

Cross-language information retrieval today is dominated by techniques that rely principally on context-independent token-to-token mappings despite the fact that state-of-the-art statistical machine translation systems now have far richer translation models available in their internal representations. This paper explores combination-of-evidence techniques using three types of statistical translation models: context-independent token translation, token translation using phrase-dependent contexts, and token translation using sentence-dependent contexts. Context-independent translation is performed using statistically-aligned tokens in parallel text, phrase-dependent translation is performed using aligned statistical phrases, and sentence-dependent translation is performed using those same aligned phrases together with an $n$-gram language model. Experiments on retrieval of Arabic, Chinese, and French documents using English queries show that no one technique is optimal for all queries, but that statistically significant improvements in mean average precision over strong baselines can be achieved by combining translation evidence from all three techniques. The optimal combination is, however, found to be resource-dependent, indicating a need for future work on robust tuning to the characteristics of individual collections.

This is a practice talk for COLING 2012.

01/30/2013: Human Translation and Machine Translation

Speaker: Philipp Koehn, University of Edinburgh
Time: Wednesday, January 30, 2013, 11:00 AM
Venue: AVW 3258

04/10/2013: Learning with Marginalized Corrupted Features

Speaker: Kilian Weinberger, Washington University in St. Louis
Time: Wednesday, April 10, 2013, 11:00 AM
Venue: AVW 3258

If infinite amounts of labeled data are provided, many machine learning algorithms become perfect. With finite amounts of data, regularization or priors have to be used to introduce bias into a classifier. We propose a third option: learning with marginalized corrupted features. We corrupt existing data as a means to generate infinitely many additional training samples from a slightly different data distribution -- explicitly in a way that the corruption can be marginalized out in closed form. This leads to machine learning algorithms that are fast, effective and naturally scale to very large data sets. We showcase this technology in two settings: 1. to learn text document representations from unlabeled data and 2. to perform supervised learning with closed form gradient updates for empirical risk minimization.

Text documents (and often images) are traditionally expressed as bag-of-words feature vectors (e.g. as tf-idf). By training linear denoisers that recover unlabeled data from partial corruption, we can learn new data-specific representations. With these, we can match the world-record accuracy on the Amazon transfer learning benchmark with a simple linear classifier. In comparison with the record holder (stacked denoising autoencoders) our approach shrinks the training time from several days to a few minutes.

Finally, we present a variety of loss functions and corrupting distributions, which can be applied out-of-the-box with empirical risk minimization. We show that our formulation leads to significant improvements in document classification tasks over the typically used l_p norm regularization. The new learning framework is extremely versatile, generalizes better, is more stable during test-time (towards distribution drift) and only adds a few lines of code to typical risk minimization.

About the Speaker: Kilian Q. Weinberger is an Assistant Professor in the Department of Computer Science & Engineering at Washington University in St. Louis. He received his Ph.D. from the University of Pennsylvania in Machine Learning under the supervision of Lawrence Saul. Prior to this, he obtained his undergraduate degree in Mathematics and Computer Science at the University of Oxford. During his career he has won several best paper awards at ICML, CVPR and AISTATS. In 2011 he was awarded the AAAI senior program chair award and in 2012 he received the NSF CAREER award. Kilian Weinberger's research is in Machine Learning and its applications. In particular, he focuses on high dimensional data analysis, metric learning, machine learned web-search ranking, transfer- and multi-task learning as well as bio medical applications.


Previous Talks