Computational Linguistics and Information Processing

Bayesian Modeling

Faculty Jordan Boyd-Graber, Naomi Feldman, Hal Daumé III, Philip Resnik
Postdocs Taesun Moon
Graduate Students Viet-An Nguyen Yuening Hu, Ke Zhai

What We Do

Bayesian modeling is a rigorous mathematical formalism that allows us to build systems that reflect our uncertainty about the world. Applied to language, they allow us to build models that reflect the "latent" aspects of communication such as topic, part of speech, syntax, or sentiment. Using posterior inference, we can use the models to discover the latent features that best explain observed language.

In the CLIP lab, we are interested in

  • building tools that make it easier for people to work with Bayesian models
  • scaling inference for Bayesian models up to Web scale
  • understanding how humans interpret and understand the latent variables in Bayesian models


Machine Translation

Bonnie Dorr interlingual and hybrid machine translation, MT evaluation
Mary Harper multilingual parsing, language modeling
Philip Resnik linguistically informed translation modeling, crowdsourcing and translation
Hal Daumé III domain adaptation for translation; translation with linguistic universals
Postdocs Kristy Hollingshead
Graduate Students Vladimir Eidelman

The CLIP Laboratory's current work in machine translation continues the lab's long tradition of research in this area. Like most of the field, we work within the framework of statistical MT, but with an emphasis on taking appropriate advantage of knowledge driven or linguistically informed model structures, features, and priors. Some current areas of research include syntactically informed language models, linguistically informed translation model features, the use of unsupervised methods in translation modeling, exploitation of large scale "cloud computing" methods, and human-machine collaborative translation via crowdsourcing.

Some Representative Publications:

Some Project Pages


Paraphrase, the ability to express the same meaning in multiple ways, is an active area of research within the NLP community and here in the CLIP Laboratory. Our work in paraphrase includes the use of paraphrase in MT evaluation and parameter estimation, lattice and forest translation, and collaborative translation, as well as research on lexical and phrasal semantic similarity measures, meaning preservation in machine translation and summarization, and large-scale document similarity computation via cloud computing methods.
Bonnie Dorr paraphrasing, summarization, language understanding
Philip Resnik linguistically informed NLP, paraphrasing
Students Olivia Buzek, Yakov Kronrod

Some Project Pages

Some Representative Publications

Text Summarization

Bonnie Dorr evaluation
David Zajic sentence compression, sentence selection
Hal Daumé III summarization of technical documents; sentence compression
Graduate Students

Text Summarization is the creation of a short document to serve as a surrogate for a longer document. The CLIP Laboratory's approach to summarization enhances the extractive method of selecting source document sentences for inclusion in a summary by using sentence compression to enlarge the pool of available sentences, and by combining fluent text with topic terms. Our sentence compression technology has encompassed both statistical and linguistic methodologies. We have developed an extrinsic evaluation measure for summarization, Relevance Prediction, which is grounded in a real-world task using summarized documents. The CLIP Laboratory, in collaboration with BBN, has been a regular participant in NIST's summarization evaluations (Document Understanding Conferences and Text Analysis Conferences), and has contributed summarization components to DARPA Translingual Information Detection, Extraction and Summarization (TIDES), Surprise Language Exercise (SLE), and Global Autonomous Language Exploitation (GALE) programs, and to the iOpener project.

Parsing and Tagging

Mary Harper latent variable parsing, speech
Graduate Students Vladimir Eidelman Zhongqiang Huang

Parsing involves

Representative Publications and Project Pages:

Computational Social Science

Jordan Boyd-Graber scientific literature analysis, persuasion
Bonnie Dorr sentiment analysis, scientific literature analysis
Jimmy Lin social media
Douglas W. Oard topical relation detection
Louiqa Raschid diffusion, prediction, event detection, recommendation
Philip Resnik sentiment, persuasion
Amy Weinberg sentiment, persuasion
Postdocs Stanley Kok
Graduate Students Asad Sayed Hassan Sayyadi Shanchan Wu

Computational social science involves the use of computational methods and models to leverage "the capacity to collect and analyze data at a scale that may reveal patterns of individual and group behaviors". Research in the CLIP Laboratory is at the forefront of this emerging area, and includes sentiment analysis (computational modeling and prediction of opinions, perspective, and other private states), automatic analysis and visualization of the scientific literature, modeling the diffusion of technological innovations, modeling and prediction of social goals and actions such as persuasion, monitoring and prediction (tracking events, predicting new links or articles) and recommendation (personalized recommendations, learning to rank).

Representative Publications and Project Pages:

Information Retrieval

Jimmy Lin
Douglas W. Oard
Postdocs William Webber ||
Earl Wagner
Graduate Students Mossaab Bagdouri, Sergey Golitsynskiy, Govind Kothari, Levon Mkrtchyan, Ferhan Ture, Lidan Wang, Tan Xu

The goal of information retrieval is to help people find what they are looking for. Information retrieval research in the CLIP lab focuses principally on retrieval based on the language contained in text, in speech, and in document images. We work across a broad range of content types, from tweets to tomes, from talking to texting, and from Cebuano to Chinese. Three perspectives inform our work:

  • we integrate a broad range of computational linguistics techniques,
  • we focus on scalable techniques that can accommodate very large collections
  • we sometimes draw the boundaries of our “systems” very broadly to include both the automated tools that we create and the process by which users can best employ those tools.

One example that illustrates these perspectives is our work with “cross-language information retrieval,” in which close coupling of machine translation and information retrieval techniques make it possible for people to find and use information written in languages that they can neither read nor write. Another example is our work on the design and evaluation of “question answering” systems that can automatically find and present answers to complex questions, which serves as a bridge between our work on information retrieval and summarization.

Representative Publications and Project Pages:


Jordan Boyd-Graber
Judith Klavans
Philip Resnik
Graduate Students Raul David Guerra

Disambiguation is the process of determining the meaning or senses of a word in its context; disambiguation remains one of the most challenging NLP problems since discovering word senses involves syntactic, semantic and pragmatic contextual inferencing, along with a rich knowledge base to base selection upon. For example, the word "wing" in the theater differs from airplanes, yet another sense for furniture ("wing chair") applies to some usages. Often disambiguation can be based on windows of two and three words, but usually involves larger computation. Techniques for disambiguation range from the use of large scale thesaural resources (such as WordNet) to purely statistical methods.

Representative Publications and Project Pages:

Annotation and Sense-making

Judith Klavans
Louiqa Raschid
Graduate Students Hassan Sayyadi Shanchan Wu

Annotation and tagging are ways to enhance knowledge in structured or semi-structured resources. Annotation typically references terms from a controlled vocabulary or ontology and is popular in bibliographic, scientific or museum collections. Tagging is more common in social media to tag images and documents and of course the now ubiquitous hashtags tweets. Sense-making or discovery is the process of extracting knowledge from these annotated or tagged resources and could range from simple counting to data/text mining to graph pattern recognition.

In the CLIP lab, we are interested in the following tasks:

  • Tagging and sense-making
  • Pattern discovery in annotated graph datasets from the biomedical domain.
  • Data mining with Linked Data.

Representative Publications and Project Pages:

Recent Accomplishments in the last 12 months

Jordan Boyd-Graber
Hal Daume III
David Doermann
Bonnie Dorr
Jimmy Lin
Doug Oard
Louiqa Raschid
PattArAn: NSF grant and collaboration with plant biologists.
SM3: NSF grant and multiple papers in collaboration with Shanchan Wu and Hassan Sayyadi and Bill Rand.
PAnG: Tool for graph data mining of annotated graph datasets; collaboration with Samir Khuller and multiple papers.
Next Generation Financial Cyberinfrastructure: Workshops in July 2010 and July 2012 sponsored by the NSF and CRA/CCC.
Philip Resnik