CLIP Colloquium (Spring 2012)

Computational Linguistics and Information Processing

June 14: Practice Talks


  • 13:00 Jordan: Besting the Quiz Master: Crowdsourcing Incremental Classification Games
  • 13:30 Jags : Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
  • 14:00 An : SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations
  • 14:30 Jordan: Topic Models for Dynamic Translation Model Adaptation
  • 14:50 Yuening: Efficient Tree-Based Topic Modeling
  • 15:10 Ke : Modeling Images using Transformed Indian Buffet Processes
  • 15:30 Michael: Thesis defense practice talk

May 16: Dan Goldwasser: Towards Natural Instructions based Machine Learning


Machine learning is traditionally formalized and researched as the study of learning concepts and decision functions from labeled examples. We are interested in providing an alternative way of communicating knowledge to an automated system, by allowing a human teacher to interact with an automated learner using natural language instructions, thus allowing the teacher to communicate the relevant domain expertise to the learner without necessarily knowing anything about the internal representations used in the learning process. The process of learning a decision function is therefore viewed as a natural language lesson interpretation problem instead of learning from labeled examples. The lesson interpretation problem, framed as a structure prediction problem, is typically approached using supervised machine learning techniques which are often as costly and difficult as learning the original decision function.

In this talk I will discuss how to approach this learning problem without direct supervision, and present learning protocols which rely on indirect supervision originating from evaluating the learner's performance on the final concept taught. This learning scenario which relies on the connection between the two learning tasks, can be generally applied to other learning problems in natural language processing. I will also discuss how to model this problem more broadly and show results in several natural language processing domains, such as textual entailment, transliteration and semantic parsing.

May 9: Planning for Fall

We'll discuss the future of the CLIP colloquium. On the agenda:

  • Whom to invite
  • How to run the colloquium
  • When to have the colloquium

May 2: Jacob Eisenstein: Searching for social meanings in social media

Social interaction is increasingly conducted through online platforms such as Facebook and Twitter, leaving a recorded trace of millions of individual interactions. While some have focused on the supposed deficiencies of social media with respect to more traditional communication channels, language in social media features the same rich connections with personal and group identity, style, and social context. However, social media's unique set of linguistic affordances causes social meanings to be expressed in new and perhaps surprising ways. This talk will describe research that builds on large-scale social media corpora using analytic tools from statistical machine learning. I will focus on some of the ways in which social media data allow us to go beyond traditional sociolinguistic methods, but I will also discuss lessons from the sociolinguistics literature that the new generation of "big data" research might do well to heed.

This research includes collaborations with David Bamman, Brendan O'Connor, Tyler Schnoebelen, Noah A. Smith, and Eric P. Xing.

Bio: Jacob Eisenstein is an Assistant Professor in the School of Interactive Computing at Georgia Tech. He works on statistical natural language processing, focusing on social media analysis, discourse, and non-verbal communication. Jacob was a Postdoctoral researcher at Carnegie Mellon and the University of Illinois. He completed his Ph.D. at MIT in 2008, winning the George M. Sprowls dissertation award.

April 25: Youngjoong Ko: A Wikipedia Based Query Translation and Expansion Technique for Cross-Language Information Retrieval

Translation resources, such as bilingual word lists, parallel and comparable corpora and machine translation systems, are important factors affecting the effectiveness of cross-language information retrieval systems. However, fully developed machine translation systems, and even topic-appropriate parallel and comparable corpora, will sometimes not be available. Therefore, I will present a CLIR system that uses Wikipedia as a bilingual resource for query translation and expansion in this presentation. Bilingual word lists including bilingual pairs, same-language synonymy sets based on redirect pages, and polysemy sets based on disambiguation pages, are first automatically extracted from Wikipedia. The bilingual word lists are then used as a basis for query translation, and a concept link graph is automatically created from Wikipedia link information for use in query expansion using a random walk algorithm. Evaluation results on the NTCIR-5 English-Korean test collection indicate significant improvements over systems that use bilingual word lists extracted from machine-readable dictionaries and overall results that are comparable to that of monolingual retrieval.

Bio: Youngjoong Ko is an associate professor of Computer Engineering at Dong-A University in Korea. He received his PhD in 2003 at Sogang University. His research focuses on Information Retrieval (CLIR, text classification/summarization), Opinion Mining (comparative mining, aspect/sentiment classification), Topic Modeling, and Dialogue System (speech-act analysis, dialogue modeling). He is currently at CLIP laboratory of UMIACS in University of Maryland as a visiting scholar.

April 18: Amit Goyal, Streaming and Sketch Algorithms for Large Data NLP

Streaming and sketch algorithms provide a memory-efficient solution to an average desktop (8 GB of memory) user to exploit large data. At the same time, these algorithms can easily be implemented in distributed setting and hence provide memory- and time-efficient solution for commodity cluster users. The availability of large and rich examples of text data is due to the emergence of the World Wide Web, social media and mobile devices. Such vast data sets have led to leaps in the performance of many language-based tasks: the concept is that simple models trained on big data can outperform more complex models with fewer examples. Streaming and sketch algorithms process the large data sets in one pass and represent a large data set with a compact summary, typically much smaller than the full size of the input. However, the memory and time savings come at the expense of approximate solutions; though I will demonstrate that approximate solutions achieved on large data are comparable to exact solutions on large data, and outperform exact solutions on smaller data.

First, I show the empirical effectiveness of approximate streaming language models on Statistical Machine Translation task. Second, I demonstrate a version of the Count-Min sketch accurately solves three large-scale NLP problems using small bounded memory footprint. Third, I conduct a systematic study and compare many `sketch' algorithms that approximate count of items with focus on large-scale NLP tasks. Finally, I will talk about my future work on constructing large nearest-neighbor graphs in a memory- and time-efficient manner.

April 11: Jordan Boyd-Graber, Mr. LDA: A Flexible Large Scale Topic Modeling Package using Variational Inference in MapReduce

Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference for LDA. In this paper, we introduce a novel and flexible large scale topic modeling package in MapReduce (Mr. LDA). As opposed to other techniques which use Gibbs sampling, our proposed framework uses variational inference, which easily fits into a distributed environment. More importantly, this variational implementation, unlike highly tuned and specialized implementations based on Gibbs sampling, is easily extensible. We demonstrate two extensions of the models possible with this scalable framework: informed priors to guide topic discovery and extracting topics from a multilingual corpus. We compare the scalability of Mr. LDA against Mahout, an existing large scale topic modeling package.

This is a practice talk for WWW 2012.

April 4: Doug Oard, Evaluating E-Discovery Search: The TREC Legal Track

Civil litigation in this the USA relies on each side making relevant evidence available to the other, a process known as "discovery." The explosive growth of information in digital form has led to an increasing focus on how search technology can best be applied to balance costs and responsiveness in what has come to be known as "e-discovery". This is now a multi-billion dollar business, one in which new vendors are entering the market frequently, usually with impressive claims about the efficacy of their products or services. Courts, attorneys, and companies are actively looking to understand what should constitute best practice, both in the design of search technology and in how that technology is employed. In this talk I will provide an overview of the e-discovery process, and then I will use that background to motivate a discussion of which aspects of that process the TREC Legal Track is seeking to model. I will then spend most of the talk describing two novel aspects of evaluation design: (1) recall-focused evaluation in large collections, and (2) modeling an interactive process for "responsive review" with fairly high fidelity. Although I will draw on the results of participating teams to illustrate what we have learned, my principal focus will be on discussing what we presently understand to be the strengths and weaknesses of our evaluation designs.

Bio: Douglas Oard is a Professor at the University of Maryland, College Park, with joint appointments in the College of Information Studies and the Institute for Advanced Computer Studies, where he currently serves as director of the Computational Linguistics and Information Processing lab. He earned his Ph.D. in Electrical Engineering from the University of Maryland. Dr. Oard’s research interests center around the use of emerging technologies to support information seeking by end users. His recent work has focused on interactive techniques for cross-language information retrieval, searching conversational media such as speech and email, evaluation design for e-discovery in the TREC Legal Track, and support for sense-making in large digital archival collections. Additional information is available at

March 28: Noah Smith, Linguistic Structure Prediction with AD3

In this talk, I will present AD3 (Alternating Directions Dual Decomposition), an algorithm for approximate MAP inference in loopy graphical models with discrete random variables, including structured prediction problems. AD3 is simple to implement and well-suited to problems with hard constraints expressed in first-order logic. It often finds the exact MAP solution, giving a certificate when it does; when it doesn’t, it can be embedded within an exact branch and bound technique. I’ll show experimental results on two natural language processing tasks, dependency parsing and frame-semantic parsing. This work was done in collaboration with Andre Martins, Dipanjan Das, Pedro Aguiar, Mario Figueiredo, and Eric Xing.

Bio: Noah Smith is the Finmeccanica Associate Professor of Language Technologies and Machine Learning in the School of Computer Science at Carnegie Mellon University. He received his Ph.D. in Computer Science, as a Hertz Foundation Fellow, from Johns Hopkins University in 2006 and his B.S. in Computer Science and B.A. in Linguistics from the University of Maryland in 2001. His research interests include statistical natural language processing, especially unsupervised methods, machine learning for structured data, and applications of natural language processing. His book, Linguistic Structure Prediction, covers many of these topics. He serves on the editorial board of the journal Computational Linguistics and the Journal of Artificial Intelligence Research and received a best paper award at the ACL 2009 conference. His research group, Noah's ARK, is supported by the NSF (including an NSF CAREER award), DARPA, Qatar NRF, IARPA, ARO, Portugal FCT, and gifts from Google, HP Labs, IBM Research, and Yahoo Research.

March 14: Michael Bloodgood, Stopping Active Learning based on Stabilizing Predictions

Abstract: How do you know when to stop annotating additional training data? The marginal value of additional data can vary a lot. In the context of active learning, where one is trying to optimally select training data so as to minimize annotation cost, I will discuss methods for determining when to stop the annotation process. I'll propose a method based on stabilizing predictions; give empirical results for this approach; and provide analytical explanation for understanding its behavior.

Bio: Michael Bloodgood is a researcher at the Center for Advanced Study of Language.

March 12: George Foster, Instance Weighting meets MIRA: Domain Adaptation in Statistical Machine Translation via Phrase-Level Features

In previous work on SMT domain adaptation (Foster et al, EMNLP 2010), we demonstrated the effectiveness of weighting phrase pairs according to generality and domain relevance. In this talk I will describe recent work that extends this approach in two ways. First, rather than assuming a substantial amount of in-domain material and an undifferentiated out-of-domain corpus, we consider a more general setting with an in-domain development set but otherwise heterogeneous training material. Second, rather than adjusting phrase-pair counts using maximum likelihood in order to circumvent MERT limitations, we incorporate adaptive features directly into the log-linear model, and tune with MIRA. I will give empirical results for this approach, and compare them to strong mixture-model baselines.

Bio: George Foster is a senior researcher at the National Research Council of Canada. He is on the board of AMTA, and the editorial boards of Computational Linguistics and Machine Translation. His research has mainly focused on applications for translation technology, beginning with tools for translators, and evolving as statistical MT has become more viable. His doctoral work led to the TransType project on interactive MT via sentence completion. In 2003 he led a JHU CLSP workshop on confidence estimation for SMT. More recently he has worked on adaptation and discriminative estimation. George is the proud originator of the "desperation-based MT" technique, first deployed by the University of Montreal for the NIST 2003 evaluation, and widely adopted by participants in subsequent evaluations.

March 7: Slav Petrov, Fast, Accurate and Robust Multilingual Syntactic Analysis

To build computer systems that can 'understand' natural language, we need to go beyond bag-of-words models and take the grammatical structure of language into account. Part-of-speech tag sequences and dependency parse trees are one form of such structural analysis that is easy to understand and use. This talk will cover three topics. First, I will present a coarse-to-fine architecture for dependency parsing that uses linear-time vine pruning and structured prediction cascades. The resulting pruned third-order model is twice as fast as an unpruned first-order model and compares favorably to a state-of-the-art transition-based parser in terms of speed and accuracy. I will then present a simple online algorithm for training structured prediction models with extrinsic loss functions. By tuning a parser with a loss function for machine translation reordering, we can show that parsing accuracy matters for downstream application quality, producing improvements of more than 1 BLEU point on an end-to-end machine translation task. Finally, I will present approaches for projecting part-of-speech taggers and syntactic parsers across language boundaries, allowing us to build models for languages with no labeled training data. Our projected models significantly outperform state-of-the-art unsupervised models and constitute a first step towards an universal parser.

This is joint work with Ryan McDonald, Keith Hall, Dipanjan Das, Alexander Rush, Michael Ringgaard and Kuzman Ganchev (a.k.a. the Natural Language Parsing Team at Google).

Bio: Slav Petrov is a Senior Research Scientist in Google's New York office. He works on problems at the intersection of natural language processing and machine learning. He is in particular interested in syntactic parsing and its applications to machine translation and information extraction. He also teaches a class on Statistical Natural Language Processing at New York University every Fall.

Prior to Google, Slav completed his PhD degree at UC Berkeley, where he worked with Dan Klein. He holds a Master's degree from the Free University of Berlin, and also spent a year as an exchange student at Duke University. Slav was a member of the FU-Fighters team that won the RoboCup 2004 world championship in robotic soccer and recently won a best paper award at ACL 2011 for his work on multilingual syntactic analysis.

Slav grew up in Berlin, Germany, but is originally from Sofia, Bulgaria. He therefore considers himself a Berliner from Bulgaria. Whenever Bulgaria plays Germany in soccer, he supports Bulgaria.

February 29: Vlad Eidelman, Unsupervised Textual Analysis with Rich Features

Learning how to properly partition a set of documents into categories in an unsupervised manner is quite challenging, since documents are inherently multidimensional, and a given set of documents can be correctly partitioned along a number of dimensions, depending on the criterion. Since the partition criterion for a supervised model is encoded in the data via the class labels, even the standard information retrieval representation of a document as a vector of term frequencies is sufficient for many state-of-the-art classification models. This representation is especially well suited for the most common application: topic (or thematic) analysis, where term presence is highly indicative of class. Furthermore, for tasks where term presence may not be adequate, such as sentiment or perspective analysis, discriminative models have the ability to incorporate complex features, allowing them to generalize and adapt to the specific domain. In the case where we do not have access to resources for supervised training, we must turn to unsupervised clustering models. Clustering models rely almost exclusively on a simple bag-of-words vector representation, which performs well for topic analysis, but unfortunately, is not guaranteed to perform well for a different task.

In this talk, I will present a feature-enhanced unsupervised model for categorizing textual data. The presented model allows for the integration of arbitrary features of the observations within a document. While in generative models the observed context is usually a single unigram, or bigram, our model can robustly expand the context to extract features from a block of text of larger size. After presenting the model derivation, I will describe the use of complex automatically derived linguistic and statistical features across three practical tasks with different criterion: perspective, sentiment, and topic analysis. I show that by introducing domain relevant features, we can guide the model towards the task-specific partition we want to learn. For each task, our feature enhanced model outperforms strong baselines and state-of-the-art models.

Bio: Vladimir Eidelman is a fourth-year Ph.D. student in the Department of Computer Science at the University of Maryland, working primarily with Philip Resnik. He received his B.S. in Computer Science and Philosophy from Columbia University in 2008 and a M.S in Computer Science from UMD in 2010. His research interests are in machine learning and natural language processing problems, such as machine translation, structured prediction, and unsupervised learning. He is the recipient of the National Science Foundation Graduate Research and National Defense Science and Engineering Graduate Fellowships.

Feb 22: Rebecca Hwa, The Role of Machine Translation in Modeling English as a Second Language (ESL) Writings

Much of the English text on the web is written by people whose native language is not English. Many English as a Second Language (ESL) writers, even those with a high level of proficiency, make common grammatical mistakes. In this talk, I will present some of our recent work on modeling ESL writings. I will first talk about characterizing ESL students’ writing revision process with syntax-driven machine translation models. Then, I will focus on modeling similarities in ESL writers’ preposition selection mistakes using a distance metric learning algorithm.

BIO: Rebecca Hwa is an Associate Professor in the Department of Computer Science at the University of Pittsburgh. Before joining Pitt, she was a postdoc at University of Maryland. She received her PhD in Computer Science from Harvard University in 2001 and her B.S. in Computer Science and Engineering from UCLA in 1993. Dr. Hwa's primary research interests include multilingual processing, machine translation, and semi-supervised learning methods. She is a recipient of the NSF CAREER Award. Her work has also been supported by NIH and DARPA. Dr. Hwa is a past chair of the executive board of the North American Chapter of the Association for Computational Linguistics.

Feb 15: Lidan Wang, Learning to Efficiently Rank (AVW 3258)

Technological advances have led to increases in the types and amounts of data, and there is a great need for developing methods to manage and find relevant information from such data to satisfy user's information needs. Learning to rank is an emerging discipline at the intersection of machine learning, data mining, and information retrieval. It develops principled machine learning algorithms to construct ranking (i.e., retrieval) models, for finding and ranking relevant information to user queries over large amounts of data. Although learning to rank approaches are capable of learning highly effective ranking functions, they have mostly ignored the important issue of model efficiency (i.e., model speed). Given that efficiency and effectiveness are competing forces that often counteract each other, models that are optimized for effectiveness alone may not meet the strict efficiency requirements when dealing with real-world large-scale datasets.

My Ph.D. thesis introduces the Learning to Efficiently Rank framework for learning large-scale ranking models that facilitate fast and effective retrieval, by exploiting and optimizing the tradeoffs between model complexity (i.e., speed) and accuracy. At a basic level, this framework learns ranking models whose speed and accuracy can be explicitly controlled. I proposed and designed solutions for three problems within this framework: 1) learning large-scale ranking models according to a desired tradeoff between model speed and accuracy; 2) constructing temporally-constrained models capable of returning results under time budgets; 3) breaking through the speed/accuracy tradeoff barrier by developing a novel cascade ranking model, and learning the cascade model structure and parameters with a novel boosting-based learning algorithm. My research extends the conventional effectiveness-centric approach in model learning and takes an efficiency-minded look at building effective retrieval models. Results show that models learned this way significantly outperform traditional machine-learned models in terms of speed without sacrificing result effectiveness. Moreover, the new models work particularly well when users impose stringent time requirements for ranked retrieval on very large datasets.

Bio: I am a final-year Ph.D. student in the Computer Science Department at the University of Maryland, College Park. My main research interests are in information retrieval, machine learning, and text mining, with an emphasis on efficient and scalable methods for managing, learning, and retrieving information from large-scale data. I have a Master's degree from the Computer Science Department, University of Wisconsin, Madison, and a Bachelor's degree from the Computer Science and Engineering Department, University of Florida. My primary advisor is Prof. Jimmy Lin.

Feb 8: Janyce Wiebe, Subjectivity and Sentiment Analysis: from Words to Discourse (AVW 3258)

There is growing interest in the automatic extraction of opinions, emotions, and sentiments in text (subjectivity analysis) to support natural language processing applications, ranging from mining product reviews and summarization, to automatic question answering and information extraction. In this talk, I will describe work on tasks in subjectivity analysis along a continuum: from subjectivity word sense labeling through discourse-level opinion interpretation and stance recognition.

Bio: Janyce Wiebe is Professor of Computer Science and Director of the Intelligent Systems Program at the University of Pittsburgh. Her research with students and colleagues has been in discourse processing, pragmatics, and word-sense disambiguation. A major concentration of her research is "subjectivity analysis", recognizing and interpretating expressions of opinions and sentiments in text, to support NLP applications such as question answering, information extraction, text categorization, and summarization. Her professional roles have included ACL Program Co-Chair, NAACL Program Chair, NAACL Executive Board member, Computational Linguistics and Language Resources and Evaluation Editorial Board member, AAAI Workshop Co-Chair, ACM Special Interest Group on Artificial Intelligence (SIGART) Vice-Chair, and ACM-SIGART/AAAI Doctoral Consortium Chair.