Actions

Difference between revisions of "Events"

Computational Linguistics and Information Processing

(114 intermediate revisions by 9 users not shown)
Line 1: Line 1:
== Colloquia ==
+
<center>[[Image:colloq.jpg|center|504px|x]]</center>
  
''Titles and abstracts appear after the calendar.''  Talks are held at 11AM in AV Williams 3258 unless otherwise noted.  All are welcome.  Typically, external speakers have slots for one-on-one meetings with Maryland researchers.  Contact the host if you'd like to have a meeting.
+
== CLIP Colloquium ==
  
If you would like to get on the cl-colloquium@umiacs.umd.edu list or for other questions about the colloquium series, e-mail [mailto:jbg@umiacs.umd.edu Jordan Boyd-Graber].
+
The CLIP Colloquium is a weekly speaker series organized and hosted by CLIP Lab. The talks are open to everyone. Most talks are held on Wednesday at 11AM in AV Williams 3258 unless otherwise noted. Typically, external speakers have slots for one-on-one meetings with Maryland researchers.
  
=== Google Calendar for CLIP Speakers===
+
If you would like to get on the clip-talks@umiacs.umd.edu list or for other questions about the colloquium series, e-mail [mailto:oard@umiacs.umd.edu Doug Oard], the current organizer.
  
{{#widget:Google Calendar
+
For up-to-date information, see the [https://talks.cs.umd.edu/lists/7 UMD CS Talks page].  (You can also subscribe to the calendar there.)
|id=lqah25nfftkqi2msv25trab8pk@group.calendar.google.com
 
|color=B1440E
 
|title=CLIP Events
 
|view=AGENDA
 
|height=300
 
}}
 
  
== Summer 2012 Speakers ==
+
=== Colloquium Recordings ===
 +
* [[Colloqium Recording (Fall 2020)|Fall 2020]]
 +
* [[Colloqium Recording (Spring 2021)|Spring 2021]]
  
=== June 14: Practice Talks ===
+
=== Previous Talks ===
 +
* [[https://talks.cs.umd.edu/lists/7?range=past Past talks, 2013 - present]]
 +
* [[CLIP Colloquium (Spring 2012)|Spring 2012]]  [[CLIP Colloquium (Fall 2011)|Fall 2011]]  [[CLIP Colloquium (Spring 2011)|Spring 2011]]  [[CLIP Colloquium (Fall 2010)|Fall 2010]]
  
'''NOTE UNUSUAL TIME: 1PM (Same place)'''
+
== CLIP NEWS  ==
  
* Jordan (15): Topic Models for Dynamic Translation Model Adaptation
+
* News about CLIP researchers on the UMIACS website [http://www.umiacs.umd.edu/about-us/news]
* An (20): SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations
+
* Please follow us on Twitter @umdclip [https://twitter.com/umdclip?lang=en]
* Yuening (10): Efficient Tree-Based Topic Modeling
 
* Jags (20): Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
 
* Ke (15): Modeling Images using Transformed Indian Buffet Processes
 
* Michael (30): Thesis defense practice talk
 
* Jordan (20): Besting the Quiz Master: Crowdsourcing Incremental Classification Games
 
 
 
== Spring 2012 Speakers ==
 
 
 
=== May 16: Dan Goldwasser: Towards Natural Instructions based Machine Learning ===
 
 
 
'''NOTE UNUSUAL TIME: 3PM (Same place)'''
 
 
 
Machine learning is traditionally formalized and researched as the study of learning concepts and decision functions from labeled examples. We are interested in providing an alternative way of communicating knowledge to an automated system, by allowing a human teacher to interact with an automated learner using natural language instructions, thus allowing the teacher to communicate the relevant domain expertise to the learner without necessarily knowing anything about the internal representations used in the learning process. The process of learning a decision function is therefore viewed as a natural language lesson interpretation problem instead of learning from labeled examples. The lesson interpretation problem, framed as a structure prediction problem, is typically approached using supervised machine learning techniques which are often as costly and difficult as learning the original decision function.
 
 
 
In this talk I will discuss how to approach this learning problem without direct supervision, and present learning protocols which rely on indirect supervision originating from evaluating the learner's performance on the final concept taught. This learning scenario which relies on the connection between the two learning tasks, can be generally applied to other learning problems in natural language processing. I will also discuss how to model this problem more broadly and show results in several natural language processing domains, such as textual entailment, transliteration and semantic parsing.
 
 
 
=== May 9: Planning for Fall ===
 
 
 
We'll discuss the future of the CLIP colloquium.  On the agenda:
 
* Whom to invite
 
* How to run the colloquium
 
* When to have the colloquium
 
 
 
=== May 2: Jacob Eisenstein: Searching for social meanings in social media  ===
 
 
 
Social interaction is increasingly conducted through online
 
platforms such as Facebook and Twitter, leaving a recorded trace of
 
millions of individual interactions. While some have focused on the
 
supposed deficiencies of social media with respect to more traditional
 
communication channels, language in social media features the same
 
rich connections with personal and group identity, style, and social
 
context. However, social media's unique set of linguistic affordances
 
causes social meanings to be expressed in new and perhaps surprising
 
ways. This talk will describe research that builds on large-scale
 
social media corpora using analytic tools from statistical machine
 
learning. I will focus on some of the ways in which social media data
 
allow us to go beyond traditional sociolinguistic methods, but I will
 
also discuss lessons from the sociolinguistics literature that the
 
new generation of "big data" research might do well to heed.
 
 
 
This research includes collaborations with David Bamman, Brendan
 
O'Connor, Tyler Schnoebelen, Noah A. Smith, and Eric P. Xing.
 
 
 
Bio: Jacob Eisenstein is an Assistant Professor in the School of
 
Interactive Computing at Georgia Tech. He works on statistical natural
 
language processing, focusing on social media analysis, discourse, and
 
non-verbal communication. Jacob was a Postdoctoral researcher at
 
Carnegie Mellon and the University of Illinois. He completed his
 
Ph.D. at MIT in 2008, winning the George M. Sprowls dissertation
 
award.
 
 
 
=== April 25: Youngjoong Ko: A Wikipedia Based Query Translation and Expansion Technique for Cross-Language Information Retrieval  ===
 
Translation resources, such as bilingual word lists, parallel and
 
comparable corpora and machine translation systems, are important
 
factors affecting the effectiveness of cross-language information
 
retrieval systems. However, fully developed machine translation
 
systems, and even topic-appropriate parallel and comparable corpora,
 
will sometimes not be available. Therefore, I will present a CLIR system that uses Wikipedia as a bilingual resource for query
 
translation and expansion in this presentation. Bilingual word lists including bilingual pairs, same-language synonymy sets based on redirect pages, and
 
polysemy sets based on disambiguation pages, are first automatically
 
extracted from Wikipedia. The bilingual word lists are then used as a
 
basis for query translation, and a concept link graph is automatically
 
created from Wikipedia link information for use in query expansion
 
using a random walk algorithm. Evaluation results on the NTCIR-5
 
English-Korean test collection indicate significant improvements over
 
systems that use bilingual word lists extracted from machine-readable
 
dictionaries and overall results that are comparable to that of
 
monolingual retrieval.
 
 
 
Bio: Youngjoong Ko is an associate professor of Computer Engineering at Dong-A University in Korea. He received his PhD in 2003 at Sogang University. His research focuses on Information Retrieval (CLIR, text classification/summarization), Opinion Mining (comparative mining, aspect/sentiment classification), Topic Modeling, and Dialogue System (speech-act analysis, dialogue modeling). He is currently at CLIP laboratory of UMIACS in University of Maryland as a visiting scholar.
 
 
 
=== April 18: Amit Goyal, Streaming and Sketch Algorithms for Large Data NLP ===
 
 
 
Streaming and sketch algorithms provide a memory-efficient solution to
 
an average desktop (8 GB of memory) user to exploit large data. At the
 
same time, these algorithms can easily be implemented in distributed
 
setting and hence provide memory- and time-efficient solution for
 
commodity cluster users. The availability of large and rich examples
 
of text data is due to the emergence of the World Wide Web, social
 
media and mobile devices. Such vast data sets have led to leaps in the
 
performance of many language-based tasks: the concept is that simple
 
models trained on big data can outperform more complex models with
 
fewer examples.  Streaming and sketch algorithms process the large
 
data sets in one pass and represent a large data set with a compact
 
summary, typically much smaller than the full size of the input.
 
However, the memory and time savings come at the expense of
 
approximate solutions;  though  I will demonstrate that approximate
 
solutions achieved on large data are comparable to exact solutions on
 
large data, and outperform exact solutions on smaller data.
 
 
 
First, I show the empirical effectiveness of approximate streaming
 
language models on Statistical Machine Translation task. Second, I
 
demonstrate a version of the Count-Min sketch accurately solves three
 
large-scale NLP problems using small bounded memory footprint. Third,
 
I conduct a systematic study and compare many `sketch' algorithms that
 
approximate count of items with focus on large-scale NLP tasks.
 
Finally, I will talk about my future work on constructing large
 
nearest-neighbor graphs in a memory- and time-efficient manner.
 
 
 
=== April 11: Jordan Boyd-Graber, Mr. LDA: A Flexible Large Scale Topic Modeling Package using Variational Inference in MapReduce ===
 
 
 
Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for
 
exploring document collections.  Because of the increasing prevalence of large
 
datasets, there is a need to improve the scalability of inference for LDA. In
 
this paper, we introduce a novel and flexible large scale topic modeling
 
package in MapReduce (Mr. LDA). As opposed to other techniques which use Gibbs
 
sampling, our proposed framework uses variational inference, which easily fits
 
into a distributed environment. More importantly, this variational
 
implementation, unlike highly tuned and specialized implementations based on
 
Gibbs sampling, is easily extensible. We demonstrate two extensions of the
 
models possible with this scalable framework: informed priors to guide topic
 
discovery and extracting topics from a multilingual corpus. We compare the
 
scalability of Mr. LDA against Mahout, an existing large scale topic modeling
 
package.
 
 
 
This is a practice talk for WWW 2012.
 
 
 
=== April 4: Doug Oard, Evaluating E-Discovery Search: The TREC Legal Track ===
 
 
 
Civil litigation in this the USA relies on each side making relevant evidence available
 
to the other, a process known as "discovery." The explosive growth of information in
 
digital form has led to an increasing focus on how search technology can best be applied
 
to balance costs and responsiveness in what has come to be known as "e-discovery".
 
This is now a multi-billion dollar business, one in which new vendors are entering the
 
market frequently, usually with impressive claims about the efficacy of their products
 
or services. Courts, attorneys, and companies are actively looking to understand what
 
should constitute best practice, both in the design of search technology and in how
 
that technology is employed. In this talk I will provide an overview of the e-discovery
 
process, and then I will use that background to motivate a discussion of which aspects
 
of that process the TREC Legal Track is seeking to model. I will then spend most of the
 
talk describing two novel aspects of evaluation design: (1) recall-focused evaluation in
 
large collections, and (2) modeling an interactive process for "responsive review" with
 
fairly high fidelity. Although I will draw on the results of participating teams to illustrate
 
what we have learned, my principal focus will be on discussing what we presently
 
understand to be the strengths and weaknesses of our evaluation designs.
 
 
 
Bio: Douglas Oard is a Professor at the University of Maryland, College Park, with joint
 
appointments in the College of Information Studies and the Institute for Advanced
 
Computer Studies, where he currently serves as director of the Computational Linguistics
 
and Information Processing lab. He earned his Ph.D. in Electrical Engineering from
 
the University of Maryland. Dr. Oard’s research interests center around the use of
 
emerging technologies to support information seeking by end users. His recent work
 
has focused on interactive techniques for cross-language information retrieval, searching
 
conversational media such as speech and email, evaluation design for e-discovery in the
 
TREC Legal Track, and support for sense-making in large digital archival collections.
 
Additional information is available at http://terpconnect.umd.edu/~oard/.
 
 
 
=== March 28: Noah Smith, Linguistic Structure Prediction with AD3 ===
 
 
 
In this talk, I will present AD3 (Alternating Directions Dual Decomposition), an algorithm for approximate MAP inference in loopy graphical models with discrete random variables, including structured prediction problems.  AD3 is simple to implement and well-suited to problems with hard constraints expressed in first-order logic.  It often finds the exact MAP solution, giving a certificate when it does; when it doesn’t, it can be embedded within an exact branch and bound technique.  I’ll show experimental results on two natural language processing tasks, dependency parsing and frame-semantic parsing.  This work was done in collaboration with Andre Martins, Dipanjan Das, Pedro Aguiar, Mario Figueiredo, and Eric Xing.
 
 
 
Bio: Noah Smith is the Finmeccanica Associate Professor of Language Technologies and Machine Learning in the School of Computer Science at Carnegie Mellon University. He received his Ph.D. in Computer Science, as a Hertz Foundation Fellow, from Johns Hopkins University in 2006 and his B.S. in Computer Science and B.A. in Linguistics from the University of Maryland in 2001. His research interests include statistical natural language processing, especially unsupervised methods, machine learning for structured data, and applications of natural language processing. His book, Linguistic Structure Prediction, covers many of these topics. He serves on the editorial board of the journal Computational Linguistics and the Journal of Artificial Intelligence Research and received a best paper award at the ACL 2009 conference. His research group, Noah's ARK, is supported by the NSF (including an NSF CAREER award), DARPA, Qatar NRF, IARPA, ARO, Portugal FCT, and gifts from Google, HP Labs, IBM Research, and Yahoo Research.
 
 
 
=== March 14: Michael Bloodgood, Stopping Active Learning based on Stabilizing Predictions ===
 
 
 
Abstract: How do you know when to stop annotating additional training data?
 
The marginal value of additional data can vary a lot.
 
In the context of active learning, where one is trying to optimally select
 
training data so as to minimize annotation cost, I will discuss methods for
 
determining when to stop the annotation process. I'll propose a method based
 
on stabilizing predictions; give empirical results for this approach;
 
and provide analytical explanation for understanding its behavior.
 
 
 
Bio: Michael Bloodgood is a researcher at the Center for Advanced Study of Language.
 
 
 
=== March 12: George Foster, Instance Weighting meets MIRA: Domain Adaptation in Statistical Machine Translation via Phrase-Level Features ===
 
 
 
In previous work on SMT domain adaptation (Foster et al, EMNLP 2010), we
 
demonstrated the effectiveness of weighting phrase pairs according to
 
generality and domain relevance. In this talk I will describe recent work that
 
extends this approach in two ways. First, rather than assuming a substantial
 
amount of in-domain material and an undifferentiated out-of-domain corpus, we
 
consider a more general setting with an in-domain development set but otherwise
 
heterogeneous training material. Second, rather than adjusting phrase-pair
 
counts using maximum likelihood in order to circumvent MERT limitations, we
 
incorporate adaptive features directly into the log-linear model, and tune with
 
MIRA. I will give empirical results for this approach, and compare them to
 
strong mixture-model baselines.
 
 
 
Bio: George Foster is a senior researcher at the National Research Council of
 
Canada. He is on the board of AMTA, and the editorial boards of Computational
 
Linguistics and Machine Translation. His research has mainly focused on
 
applications for translation technology, beginning with tools for translators,
 
and evolving as statistical MT has become more viable. His doctoral work led to
 
the TransType project on interactive MT via sentence completion. In 2003 he led
 
a JHU CLSP workshop on confidence estimation for SMT. More recently he has
 
worked on adaptation and discriminative estimation. George is the proud
 
originator of the "desperation-based MT" technique, first deployed by the
 
University of Montreal for the NIST 2003 evaluation, and widely adopted by
 
participants in subsequent evaluations.
 
 
 
=== March 7: Slav Petrov, Fast, Accurate and Robust Multilingual Syntactic Analysis ===
 
 
 
To build computer systems that can 'understand' natural language, we
 
need to go beyond bag-of-words models and take the grammatical
 
structure of language into account. Part-of-speech tag sequences and
 
dependency parse trees are one form of such structural analysis that
 
is easy to understand and use.
 
This talk will cover three topics. First, I will present a
 
coarse-to-fine architecture for dependency parsing that uses
 
linear-time vine pruning and structured prediction cascades. The
 
resulting pruned third-order model is twice as fast as an unpruned
 
first-order model and compares favorably to a state-of-the-art
 
transition-based parser in terms of speed and accuracy. I will then
 
present a simple online algorithm for training structured prediction
 
models with extrinsic loss functions. By tuning a parser with a loss
 
function for machine translation reordering, we can show that parsing
 
accuracy matters for downstream application quality, producing
 
improvements of more than 1 BLEU point on an end-to-end machine
 
translation task. Finally, I will present approaches for projecting
 
part-of-speech taggers and syntactic parsers across language
 
boundaries, allowing us to build models for languages with no labeled
 
training data. Our projected models significantly outperform
 
state-of-the-art unsupervised models and constitute a first step
 
towards an universal parser.
 
 
 
This is joint work with Ryan McDonald, Keith Hall, Dipanjan Das,
 
Alexander Rush, Michael Ringgaard and Kuzman Ganchev (a.k.a. the
 
Natural Language Parsing Team at Google).
 
 
 
Bio: Slav Petrov is a Senior Research Scientist in Google's New York
 
office. He works on problems at the intersection of natural language
 
processing and machine learning. He is in particular interested in
 
syntactic parsing and its applications to machine translation and
 
information extraction. He also teaches a class on Statistical Natural
 
Language Processing at New York University every Fall.
 
 
 
Prior to Google, Slav completed his PhD degree at UC Berkeley, where
 
he worked with Dan Klein. He holds a Master's degree from the Free
 
University of Berlin, and also spent a year as an exchange student at
 
Duke University. Slav was a member of the FU-Fighters team that won
 
the RoboCup 2004 world championship in robotic soccer and recently won
 
a best paper award at ACL 2011 for his work on multilingual syntactic
 
analysis.
 
 
 
Slav grew up in Berlin, Germany, but is originally from Sofia,
 
Bulgaria. He therefore considers himself a Berliner from Bulgaria.
 
Whenever Bulgaria plays Germany in soccer, he supports Bulgaria.
 
 
 
=== February 29: Vlad Eidelman, Unsupervised Textual Analysis with Rich Features ===
 
 
 
Learning how to properly partition a set of documents into categories
 
in an unsupervised manner is quite challenging, since documents are
 
inherently multidimensional, and a given set of documents can be
 
correctly partitioned along a number of dimensions, depending
 
on the criterion. Since the partition criterion for a supervised model
 
is encoded in the data via the class labels, even the standard
 
information retrieval representation of a document as a vector of term
 
frequencies is sufficient for many state-of-the-art classification
 
models. This representation is especially well suited for the most
 
common application: topic (or thematic) analysis, where term presence
 
is highly indicative of class. Furthermore, for tasks where term
 
presence may not be adequate, such as sentiment or perspective
 
analysis, discriminative models have the ability to incorporate
 
complex features, allowing them to generalize and adapt to the
 
specific domain. In the case where we do not have access to resources
 
for supervised training, we must turn to unsupervised clustering
 
models. Clustering models rely almost exclusively on a simple
 
bag-of-words vector representation, which performs well for topic
 
analysis, but unfortunately, is not guaranteed to perform well for a
 
different task.
 
 
 
In this talk, I will present a feature-enhanced unsupervised model for
 
categorizing textual data. The presented model allows for the
 
integration of arbitrary features of the observations within a
 
document. While in generative models the observed context is usually a
 
single unigram, or bigram, our model can robustly expand the context
 
to extract features from a block of text of larger size. After
 
presenting the model derivation, I will describe the use of complex
 
automatically derived linguistic and statistical features across three
 
practical tasks with different criterion: perspective, sentiment, and
 
topic analysis. I show that by introducing domain relevant features,
 
we can guide the model towards the task-specific partition we want to
 
learn. For each task, our feature enhanced model outperforms strong
 
baselines and state-of-the-art models.
 
 
 
Bio: Vladimir Eidelman is a fourth-year Ph.D. student in the
 
Department of Computer Science at the University of Maryland, working
 
primarily with Philip Resnik. He received his B.S. in Computer Science
 
and Philosophy from Columbia University in 2008 and a M.S in Computer
 
Science from UMD in 2010. His research interests are in machine
 
learning and natural language processing problems, such as machine
 
translation, structured prediction, and unsupervised learning. He is
 
the recipient of the National Science Foundation Graduate Research and
 
National Defense Science and Engineering Graduate Fellowships.
 
 
 
=== Feb 22: Rebecca Hwa, The Role of Machine Translation in Modeling English as a Second Language (ESL) Writings ===
 
 
 
Much of the English text on the web is written by people whose native
 
language is not English. Many English as a Second Language (ESL)
 
writers, even those with a high level of proficiency, make common
 
grammatical mistakes. In this talk, I will present some of our recent
 
work on modeling ESL writings. I will first talk about characterizing
 
ESL students’ writing revision process with syntax-driven machine
 
translation models. Then, I will focus on modeling similarities in ESL
 
writers’ preposition selection mistakes using a distance metric
 
learning algorithm.
 
 
 
BIO: Rebecca Hwa is an Associate Professor in the Department of Computer
 
Science at the University of Pittsburgh. Before joining Pitt, she was
 
a postdoc at University of Maryland. She received her PhD in Computer
 
Science from Harvard University in 2001 and her B.S. in Computer
 
Science and Engineering from UCLA in 1993. Dr. Hwa's primary research
 
interests include multilingual processing, machine translation, and
 
semi-supervised learning methods. She is a recipient of the NSF CAREER
 
Award. Her work has also been supported by NIH and DARPA. Dr. Hwa is a
 
past chair of the executive board of the North American Chapter of the
 
Association for Computational Linguistics.
 
 
 
=== Feb 15: Lidan Wang, Learning to Efficiently Rank (AVW 3258) ===
 
 
 
Technological advances have led to increases in the types and amounts of data, and there is a great need for developing methods
 
to manage and find relevant information from such data to satisfy user's information needs. Learning to rank is an emerging
 
discipline at the intersection of machine learning, data mining, and information retrieval. It develops principled machine
 
learning algorithms to construct ranking (i.e., retrieval) models, for finding and ranking relevant information to user queries
 
over large amounts of data. Although learning to rank approaches are capable of learning highly effective ranking functions, they
 
have mostly ignored the important issue of model efficiency (i.e., model speed). Given that efficiency and effectiveness are
 
competing forces that often counteract each other, models that are optimized for effectiveness alone may not meet the strict
 
efficiency requirements when dealing with real-world large-scale datasets.
 
 
 
My Ph.D. thesis introduces the Learning to Efficiently Rank framework for learning large-scale ranking models that facilitate
 
fast and effective retrieval, by exploiting and optimizing the tradeoffs between model complexity (i.e., speed) and accuracy.  At
 
a basic level, this framework learns ranking models whose speed and accuracy can be explicitly controlled.  I proposed and
 
designed solutions for three problems within this framework: 1) learning large-scale ranking models according to a desired
 
tradeoff between model speed and accuracy; 2) constructing temporally-constrained models capable of returning results under time
 
budgets; 3) breaking through the speed/accuracy tradeoff barrier by developing a novel cascade ranking model, and learning the
 
cascade model structure and parameters with a novel boosting-based learning algorithm. My research extends the conventional
 
effectiveness-centric approach in model learning and takes an efficiency-minded look at building effective retrieval models.
 
Results show that models learned this way significantly outperform traditional machine-learned models in terms of speed without
 
sacrificing result effectiveness. Moreover, the new models work particularly well when users impose stringent time requirements
 
for ranked retrieval on very large datasets.
 
 
 
Bio: I am a final-year Ph.D. student in the Computer Science Department at the University of Maryland, College Park. My main research interests are in information retrieval, machine learning, and text mining, with an emphasis on efficient and scalable methods for managing, learning, and retrieving information from large-scale data. I have a Master's degree from the Computer Science Department, University of Wisconsin, Madison, and a Bachelor's degree from the Computer Science and Engineering Department, University of Florida. My primary advisor is Prof. Jimmy Lin.
 
 
 
 
 
=== Feb 8: Janyce Wiebe, Subjectivity and Sentiment Analysis:  from Words to Discourse (AVW 3258) ===
 
 
 
There is growing interest in the automatic extraction of opinions,
 
emotions, and sentiments in text (subjectivity analysis) to support
 
natural language processing applications, ranging from mining product
 
reviews and summarization, to automatic question answering and
 
information extraction. In this talk, I will describe work on tasks in
 
subjectivity analysis along a continuum: from subjectivity word sense
 
labeling through discourse-level opinion interpretation and stance
 
recognition.
 
 
 
Bio: Janyce Wiebe is Professor of Computer Science and Director of the
 
Intelligent Systems Program at the University of Pittsburgh.  Her
 
research with students and colleagues has been in discourse
 
processing, pragmatics, and word-sense disambiguation.  A major
 
concentration of her research is "subjectivity analysis", recognizing
 
and interpretating expressions of opinions and sentiments in text, to
 
support NLP applications such as question answering, information
 
extraction, text categorization, and summarization.  Her professional
 
roles have included ACL Program Co-Chair, NAACL Program Chair, NAACL
 
Executive Board member, Computational Linguistics and Language
 
Resources and Evaluation Editorial Board member, AAAI Workshop
 
Co-Chair, ACM Special Interest Group on Artificial Intelligence
 
(SIGART) Vice-Chair, and ACM-SIGART/AAAI Doctoral Consortium Chair.
 
 
 
== Fall 2011 Speakers ==
 
 
 
=== Sept 7, 14: 5-Minute Madness (AVW 2460) ===
 
 
 
See what everybody has been working on and get to know who's in the lab!
 
 
 
=== Sept 21: Youngjoong Ko, Comparison Mining (AVW 2460) ===
 
 
 
Almost every day, people are faced with a situation that they must decide upon one thing or the other. To make better decisions, they probably attempt to compare entities that they are interested in. These days, many web search engines are helping people look for their interesting entities. It is clear that getting information from a large amount of web data retrieved by the search engines is a much better and easier way than traditional survey methods. However, it is also clear that directly reading each document is not a perfect solution. If people only have access to a small amount of data, they may get a biased point of view. On the other hand, investigating large amounts of data is a time-consuming job. Therefore, a comparison mining system, which can automatically provide a summary of comparisons between two (or more) entities from a large quantity of web documents, would be very useful in many areas such as marketing.
 
 
 
In this talk, I will describe how to build a Korean comparison mining system. Our work is composed of two consecutive tasks: 1) classifying comparative sentences into different types, and 2) mining comparative entities and predicates. We performed various experiments to find relevant features and learning techniques. As a result, we achieved outstanding performance enough for practical use.
 
 
 
Bio: Youngjoong Ko is an associate professor of Computer Engineering at Dong-A University in Korea. He received his PhD in 2003 at Sogang University. His research focuses on text mining (opinion mining, text classification/summarization), Information Retrieval, Dialogue System (speech-act analysis, dialogue modeling). He is currently at CLIP laboratory of UMIACS in University of Maryland as a visiting scholar. Homepage: http://web.donga.ac.kr/yjko/
 
 
 
=== Sept 28: No Speaker (Rosh Hashana) ===
 
 
 
=== Oct 5: Dave Uthus, Overcoming Information Overload in Navy Chat (AVW 2328) ===
 
 
 
Abstract: In this talk, I will describe the research we are undertaking at the Naval Research Laboratory which revolves around chat (such as Internet Relay Chat) and the problems it causes in the military domain. Chat has become a primary means for command and control communications in the US Navy. Unfortunately, its popularity has contributed to the classic problem of information overload. For example, Navy watchstanders monitor multiple chat rooms while simultaneously performing their other monitoring duties (e.g.,  tactical situation screens and radio communications). Some researchers have proposed how automated techniques can help to alleviate these problems, but very little research has addressed this problem.
 
 
 
I will give an overview of the three primary tasks that are the current focus of our research. The first is urgency detection, which involves detecting important chat messages within a dynamic chat stream. The second is summarization, which involves summarizing chat conversations and temporally summarizing sets of chat messages. The third is human-subject studies, which involves simulating a watchstander environment and testing whether our urgency detection and summarization ideas, along with 3D-audio cueing, can aid a watchstander in conducting their duties.
 
 
 
Short Bio: David Uthus is a National Research Council Postdoctoral Fellow hosted at the Naval Research Laboratory, where he is currently undertaking research focusing on analyzing multiparticipant chat. He received his PhD (2010) and MSc (2006) from the University of Auckland in New Zealand and his BSc (2004) from the University of California, Davis. His research interests include microtext analysis, machine learning, metaheuristics, heuristic search, and sport scheduling.
 
 
 
=== Oct 12: AISTATS Paper Clinic (CLIP Lab) ===
 
 
 
=== Oct 13: Nate Chambers, Learning General Events for Specific Event Extraction (AVW 3258) ===
 
 
 
Abstract: There is a wealth of knowledge about the world encoded in written text. How much of this knowledge and in what form is it accessible by today's unsupervised learning systems? There are two primary views that most systems take on interpreting documents: (1) the document primarily describes specific facts, and (2) the document describes general knowledge about "how the world works" through specific descriptions. These two views are largely separated into two subfields within NLP: Information Extraction, and Knowledge Representations/Induction. Information Extraction is mostly concerned with extracting atomic factoids about the world (e.g., Andrew Luck threw three touchdown passes). Knowledge Induction seeks generalized inferences about the world (e.g., Quarterbacks throw footballs). Although the two operate on similar datasets, most systems focus on only one of the two tasks. This talk will describe my efforts over the past few years to merge the goals of both views, performing unsupervised knowledge induction and information extraction in tandem. I describe a model of event schemas that represents common events and their participants (Knowledge Induction), as well as an algorithm that applies this model to extract specific instances of events from newspaper articles (Information Extraction). I will describe my unique learning approach that relies on coreference resolution to learn event schemas, and then will present the first work that performs template-based IE without labeled datasets or prior knowledge.
 
 
 
If time allows, I will also briefly describe my interests in event ordering and temporal reasoning.
 
 
 
Bio:  Nate Chambers is an Assistant Professor in Computer Science at the US Naval Academy.  He recently graduated with his Ph.D. in CS from Stanford University.  His research interests focus on Natural Language Understanding and Knowledge Acquisition from large amounts of text with minimal human supervision.  Before attending Stanford, he worked as a Research Associate at the Florida Institute for Human and Machine Cognition, focusing on human-computer interfaces, dialogue systems, and knowledge representation.  He received his M.S. in Computer Science from the University of Rochester in 2003, and has published over 20 peer-reviewed articles.
 
 
 
=== Oct 17: Michael Collins ===
 
 
 
There has been a long history in combinatorial optimization of methods that exploit structure in complex problems, using methods such as dual decomposition or Lagrangian relaxation. These methods leverage the observation that complex inference problems can often be decomposed into efficiently solvable sub-problems. Thus far, however, these methods are not widely used in NLP.
 
 
 
In this talk I'll describe recent work on inference algorithms for NLP based on Lagrangian relaxation. In the first part of the talk I'll describe work on non-projective parsing. In the second part of the talk I'll describe an exact decoding algorithm for syntax-based statistical translation. If time permits, I'll also briefly describe algorithms for dynamic programming intersections (e.g., the intersection of a PCFG and an HMM), and for phrase-based translation.
 
 
 
For all of the problems that we consider, the resulting algorithms produce exact solutions, with certificates of optimality, on the vast majority of examples; the algorithms are efficient for problems that are either NP-hard (as is the case for non-projective parsing, or for phrase-based translation), or for problems that are solvable in polynomial time using dynamic programming, but where the traditional exact algorithms are far too expensive to be practical.
 
 
 
While the focus of this talk is on NLP problems, there are close connections to inference methods, in particular belief propagation, for graphical models. Our work was inspired by recent work that has used dual decomposition as an alternative to belief propagation in Markov random fields.
 
 
 
This is joint work with Yin-Wen Chang, Tommi Jaakkola, Terry Koo, Sasha Rush, and David Sontag.
 
 
 
 
 
Bio:
 
Michael Collins is the Vikram S. Pandit Professor of Computer Science at Columbia University.  His research interests are in natural language processing and machine learning. He completed a PhD in computer science from the University of Pennsylvania in December 1998. From January 1999 to November 2002 he was a researcher at AT&T Labs-Research, and from January 2003 until December 2010 he was an assistant/associate professor at MIT.  He joined Columbia University in January 2011. Prof. Collins's research has focused on topics including statistical parsing, structured prediction problems in machine learning, and NLP applications including machine translation, dialog systems, and speech recognition. His awards include a Sloan fellowship, an NSF career award, and best paper awards at several conferences: EMNLP (2002 and 2004), UAI (2004 and 2005), CoNLL 2008, and EMNLP 2010.
 
 
 
=== Oct 19: Taesun Moon, Pull your head out of your task: broader context in unsupervised models (AVW 2328) ===
 
 
 
 
 
abstract: I discuss unsupervised models and how broader context helps in the resolution of unsupervised or distantly supervised approaches. In the first section, I discuss how document boundaries help in two low-level unsupervised tasks that aren't traditionally resolved in terms of documents: unsupervised morphological segmentation/clustering and unsupervised part-of-speech tagging. For unsupervised morphology, I describe an intuitive model that uses document boundaries to strongly constrain how stems may be clustered and segmented with minimal parameter tuning. For unsupervised part-of-speech tagging, I discuss the crouching Dirichlet, hidden Markov model, an unsupervised POS-tagging model which takes advantage of the difference in the statistical variance of content word and function word POS-tags across documents. Next, I discuss a model of inferring probabilistic word meaning as a distribution over potential paraphrases within context. As opposed to many current approaches in lexical semantics which consider a limited subset of words in a sentence to infer meaning in isolation, this model is able to jointly conduct inference over all words in a sentence. Finally, I describe an approach for connecting language and geography that anchors natural language expressions to specific regions of the Earth. The core of the system is a region-topic model, which is used to learn word distributions for each region discussed in a given corpus. This model performs toponym resolution as a by-product, and additionally enables us to characterize a geographic distribution for corpora, individual texts, or even individual words. The last is joint work with Jason Baldridge,  Travis Brown, Katrin Erk, and Mike Speriosu at the University of Texas, Austin.
 
 
 
Bio: Taesun Moon received an MA (2009) and PhD (2011) in linguistics from the University of Texas at Austin (2011) under the supervision of Katrin Erk. He received a BA (2002) in English literature from Seoul National University in South Korea.
 
 
 
=== Oct 27: Tom Griffiths (3:30 p.m., Bioscience Research Building, 1103) ===
 
 
 
People are remarkably good at acquiring complex knowledge from limited data, as is required in learning causal relationships, categories, or aspects of language. Successfully solving inductive problems of this kind requires having good "inductive biases" -- constraints that guide inductive inference. Viewed abstractly, understanding human learning requires identifying these inductive biases and exploring their origins. I will argue that probabilistic models of cognition provide a framework that can facilitate this project, giving a transparent characterization of the inductive biases of ideal learners. I will outline how probabilistic models are traditionally used to solve this problem, and then present a new approach that uses a mathematical analysis of the effects of cultural transmission as the basis for an experimental method that magnifies the effects of inductive biases.
 
 
 
=== Nov 2: Jason Eisner, A Non-Parametric Bayesian Approach to Inflectional Morphology (AVW 2460) ===
 
 
 
We learn how the words of a language are inflected, given a plain text corpus plus a small supervised set of known paradigms.  The approach is principled, simply performing empirical Bayesian inference under a straightforward generative model that explicitly describes the generation of
 
 
 
1. The grammar and subregularities of the language (via many finite-state transducers coordinated in a Markov Random Field).
 
2. The infinite inventory of types and their inflectional paradigms (via a Dirichlet Process Mixture Model based on the above grammar).
 
3. The corpus of tokens (by sampling inflected words from the above inventory).
 
 
 
Our inference algorithm cleanly integrates several techniques that handle the different levels of the model: classical dynamic programming operations on the finite-state transducers, loopy belief propagation in the Markov Random Field, and MCMC and MCEM for the non-parametric Dirichlet Process Mixture Model.
 
 
 
We will build up the various components of the model in turn, showing experimental results along the way for several intermediate tasks such as lemmatization, transliteration, and inflection.  Finally, we show that modeling paradigms jointly with the Markov Random Field, and learning from unannotated text corpora via the non-parametric model, significantly improves the quality of predicted word inflections.
 
 
 
This is joint work with Markus Dreyer.
 
 
 
Bio: Jason Eisner is Associate Professor of Computer Science at Johns Hopkins University, where he is also affiliated with the Center for Language and Speech Processing, the Cognitive Science Department, and the national Center of Excellence in Human Language Technology. He is particularly interested in designing algorithms that statistically exploit linguistic structure. His 80 or so papers have presented a number of algorithms for parsing and machine translation; algorithms for constructing and training weighted finite-state machines; formalizations, algorithms, theorems and empirical results in computational phonology; and unsupervised or semi-supervised learning methods for domains such as syntax, morphology, and word-sense disambiguation.
 
 
 
=== Nov 3: EACL / WWW Paper Clinic (11AM, CLIP Lab) ===
 
 
 
=== Nov 15: Sergei Nirenburg & Marge McShane, Reference Resolution (10AM, AVW 3258) ===
 
 
 
Most work on reference resolution in natural language processing has been marked by three features: (1) it has concentrated on textual co-reference resolution, which is the linking of text strings with their coreferential, textual antecedents; (2) only a small subset of reference phenomena have been covered – namely, those that are most easily treated by a “corpus annotation + machine learning” development strategy; and (3) the methods used to treat the selected subset do not hold much promise of being extensible to a broader range of more difficult reference phenomena.
 
 
 
Within the theory of Ontological Semantics, we view reference resolution completely differently. For us, resolving reference means linking references of objects and events in a text to their anchors in the fact repository of the system processing the text – or, to use the terminology of intelligent agents, the memory of the agent processing the text. Furthermore, reference relations extend beyond coreference to meronymy, set-member relations, type-instance relations, so-called “bridging” constructions, etc.  The result of reference resolution is the appropriate memory modification of the text processing agent.
 
 
 
In this talk we will briefly introduce OntoSem, our semantically-oriented text processing system and then describe the approach to reference resolution used in OntoSem. We will motivate a semantically oriented approach to reference resolution and show how and why it is currently feasible to develop a new generation of reference resolution engines.
 
 
 
Bio: Dr. Sergei Nirenburg is Professor in the CSEE Department AT UMBC and the Director of its Institute for Language and Information Technologies. Before coming to UMBC, Dr. Nirenburg was Director of the Computing Research Laboratory and Professor of Computer Science at New Mexico State University. He received his Ph.D. in Linguistics from the Hebrew University of Jerusalem, Israel, and his M.Sc. in Computational Linguistics from Kharkov State University, USSR. Dr. Nirenburg has written or edited seven books and has published over 130 articles in various areas of computational linguistics and artificial intelligence. Dr. Nirenburg has directed a number of large-scale research and development projects in the areas of natural language processing, knowledge representation, reasoning, knowledge acquisition and cognitive modeling.
 
 
 
Marge McShane is a Research Associate Professor in the Department of Computer Science and Electrical Engineering of UMBC. She received her Ph.D. from Princeton University, with a specialization in linguistics. She works on theoretical and knowledge-oriented aspects of developing language-enabled intelligent agents. She has led several knowledge acquisition and annotation projects, including the development of a general-purpose workbench for developing computationally-tractable descriptions of lesser-studied languages. A special area of Dr. McShane’s interest is reference resolution, particularly its more difficult aspects, such as ellipsis and referential vagueness. She has published two books and over 60 scientific papers.
 
 
 
=== Nov 30: Claire Monteleoni (AVW 2328): Clustering Algorithms for Streaming and Online Settings ===
 
 
 
ABSTRACT: Clustering techniques are widely used to summarize
 
large quantities of data (e.g. aggregating similar news stories),
 
however their outputs can be hard to evaluate. While a domain
 
expert could judge the quality of a clustering, having a human in
 
the loop is often impractical. Probabilistic assumptions have
 
been used to analyze clustering algorithms, for example
 
i.i.d. data, or even data generated by a well-separated mixture
 
of Gaussians. Without any distributional assumptions, one can
 
analyze clustering algorithms by formulating some objective
 
function, and proving that a clustering algorithm either
 
optimizes or approximates it. The k-means clustering objective,
 
for Euclidean data, is simple, intuitive, and widely-cited,
 
however it is NP-hard to optimize, and few algorithms approximate
 
it, even in the batch setting (the algorithm known as "k-means"
 
does not have an approximation guarantee). Dasgupta (2008) posed
 
open problems for approximating it on data streams.
 
 
 
In this talk, I will discuss my ongoing work on designing
 
clustering algorithms for streaming and online settings. First I
 
will present a one-pass, streaming clustering algorithm which
 
approximates the k-means objective on finite data streams. This
 
involves analyzing a variant of the k-means++ algorithm, and
 
extending a divide-and-conquer streaming clustering algorithm
 
from the k-medoid objective. Then I will turn to endless data
 
streams, and introduce a family of algorithms for online
 
clustering with experts. We extend algorithms for online learning
 
with experts, to the unsupervised setting, using intermediate
 
k-means costs, instead of prediction errors, to re-weight
 
experts. When the experts are instantiated as k-means
 
approximate (batch) clustering algorithms run on a sliding window
 
of the data stream, we provide novel online approximation bounds
 
that combine regret bounds extended from supervised online
 
learning, with k-means approximation guarantees. Notably, the
 
resulting bounds are with respect to the optimal k-means cost on
 
the entire data stream seen so far, even though the algorithm is
 
online. I will also present encouraging experimental results.
 
 
 
This talk is based on joint work with Nir Ailon, Ragesh Jaiswal,
 
and Anna Choromanska.
 
 
 
BIO: Claire Monteleoni is an assistant professor of Computer
 
Science at George Washington University, and adjunct research
 
faculty at the Center for Computational Learning Systems at
 
Columbia University, where she was previously research
 
faculty. She did a postdoc in Computer Science and Engineering at
 
the University of California, San Diego, and completed her PhD
 
and Masters in Computer Science, at MIT. Her research focus is on
 
machine learning algorithms and theory for problems including
 
learning from data streams, learning from raw (unlabeled) data,
 
learning from private data, and Climate Informatics: accelerating
 
discovery in Climate Science with machine learning. Her papers
 
have received several awards, and she currently serves on the
 
Senior Program Committee of the International Conference on
 
Machine Learning, and the Editorial Board of the Machine Learning
 
Journal.
 
 
 
 
 
=== Dec. 7: Bill Rand: Authority, Trust and Influence: The Complex Network of Social Media (AVW 2328) ===
 
 
 
The dramatic feature of social media is that it gives everyone a voice; anyone can speak out and express their opinion to a crowd of followers with little or no cost or effort, which creates a loud and potentially overwhelming marketplace of ideas.  Given this egalitarian competition, how do users of social media identify authorities in this crowded space?  Who do they trust to provide them with the information and the recommendations that they want? Which tastemakers have the greatest influence on social media users?    Using agent-based modeling, machine learning and network analysis we begin to examine and shed light on these questions and develop a deeper understanding of the complex system of social media.
 
 
 
Bio: He received his doctorate in Computer Science from the University of Michigan in 2005 where he worked on the application of evolutionary computation techniques to dynamic environments, and was a regular member of the Center for the Study of Complex Systems, where he built a large-scale agent-based model of suburban sprawl. Before coming to Maryland, he was awarded a postdoctoral research fellowship at Northwestern University in the Northwestern Institute on Complex Systems (NICO), where he worked with the NetLogo development team studying agent-based modeling, evolutionary computation and network science.
 
 
 
== Spring 2011 Speakers ==
 
 
 
 
 
 
 
=== May 11, Dave Blei: Scalable Topic Modeling ===
 
 
 
Probabilistic topic modeling provides a suite of tools for the
 
unsupervised analysis of large collections of documents.  Topic
 
modeling algorithms can uncover the underlying themes of a collection
 
and decompose its documents according to those themes.  This analysis
 
can be used for corpus exploration, document search, and a variety of
 
prediction problems.
 
 
 
In this talk, I will review the state-of-the-art in probabilistic
 
topic models.  I will describe the basic ideas behind latent Dirichlet
 
allocation, and discuss a few of the recent topic modeling algorithms
 
that we have developed in my research group.
 
 
 
I will then describe an online strategy for fitting topic models.
 
This approach lets us analyze massive document collections and
 
document collections arriving in a stream.  Specifically, we use
 
variational inference to approximate the posterior of the topic model,
 
and we develop a stochastic optimization algorithm for the
 
corresponding objective function.  I will describe online algorithms
 
for finite dimensional topic models and for the Bayeisan nonparametric
 
variant based on the hierarchical Dirichlet process.
 
 
 
Our algorithms can fit models to millions of articles in a matter of
 
hours, and I will present a study of 3.3M articles from Wikipedia.
 
These results show that the online approach finds topic models that
 
are as good or better than those found with traditional inference
 
algorithms.
 
 
 
Bio: David Blei is an assistant professor of Computer Science at
 
Princeton University.  He received his PhD in 2004 at U.C. Berkeley
 
and was a postdoctoral fellow at Carnegie Mellon University.  His
 
research focuses on probabilistic models, Bayesian nonparametric
 
methods, and approximate posterior inference.  He works on a variety
 
of applications, including text, images, music, social networks, and
 
scientific data.
 
 
 
=== May 4, Sinead Williamson: Nonparametric Bayesian models for dependent data ===
 
 
 
A  priori assumptions about the number of parameters required to model
 
our data are often unrealistic. Bayesian nonparametric models
 
circumvent this problem by assigning prior mass to a countably
 
infinite set of parameters, only a finite (but random) number of which
 
will contribute to a given data set. Over recent years, a number of
 
authors have presented dependent nonparametric models -- distributions
 
over collections of random measures associated with values in some
 
covariate space. While the properties of these random measures are
 
allowed to vary across the covariate space, the marginal distribution
 
at each covariate value is given by a known nonparametric
 
distribution. Such distributions are useful for modelling data that
 
vary with some covariate: in image segmentation, proximal pixels are
 
likely to be assigned to the same segment; in modelling documents,
 
topics are likely to increase and decrease in popularity over time.
 
 
 
Most dependent nonparametric models in the literature have Dirichlet
 
process-distributed marginals. While the Dirichlet process is
 
undeniably the most commonly used discrete nonparametric Bayesian
 
prior, this ignores a wide range of interesting models. In my PhD, I
 
have focused on dependent nonparametric models beyond the Dirichlet
 
process -- in particular, on dependent nonparametric models based on
 
the Indian buffet process, a distribution over binary matrices with an
 
infinite number of columns. In this talk, I will give a general
 
introduction to dependent nonparametric models, and describe some of
 
the work I have done in this area.
 
 
 
Bio: Sinead Williamson is a PhD student working with Zoubin Ghahramani at
 
the University of Cambridge, UK. Her main research interests are
 
dependent nonparametric processes and nonparametric latent variable
 
models. She will be visiting the University of Maryland for six months
 
before starting a post doc at Carnegie Mellon University in the Fall.
 
 
 
=== April 27, Michele Gelfand ===
 
 
 
In this presentation, I will describe a perspective on metaphor and
 
negotiation that can help to understand, predict, and manage cultural differences in
 
negotiation. The metaphor approach has its roots in linguistics, cognitive science, and
 
cultural psychology. Metaphors are conceptual systems in which different
 
domains of experience are put into the same category so that knowledge from
 
one domain can be used to make sense of the other. Although they have
 
traditionally been conceived of as linguistic devices, metaphors are a basic
 
mechanism through which humans conceptualize experience (Gibbs, 1990;
 
Lakoff, 1987). In the context of negotiation, metaphors serve a number of critical
 
functions in negotiation. First, they function to create negotiators’ subjective
 
intentional realities (Bruner, 1980; Miller, 1997), guiding both thought and action in
 
negotiation. Specifically, metaphors provide a basis for answering the question, "What
 
kind of situation is this? Is it a battle? A game? A dance? A family gathering? A
 
seduction? A visit to the dentist? I will show how metaphoric mappings provide
 
information about what the task is about and dictate specific entailments or scripts that
 
are derived from their source domains. A "Negotiation as Sports" metaphor, for example,
 
suggests a very different task and scripts as compared to a "Negotiation as Dental Work"
 
or a "Negotiation as Marriage” metaphor. Second, shared metaphors function to
 
organize social action in negotiation (Weick, 1979). Through ongoing communicative
 
exchange, negotiators who develop a shared metaphor for negotiation will come to
 
inhabit the same intentional world, will be more organized and "in-sync" in their
 
interactions (Blount & Janicik, 2003), and will be in a better position to negotiate
 
effectively. I also discuss how metaphors for negotiation are selectively developed,
 
activated, and perpetuated through participation in cultural institutions, helping to explain
 
cross-cultural variation in negotiation dynamics, and problems that arise in intercultural
 
negotiations. In the presentation, I will present a number of recent empirical studies, from
 
the lab and the field, and with samples from a number of countries, which provide
 
support for aspects of the theory. I will conclude with a discussion of the role of
 
metaphor in helping create shared reality in intercultural negotiations.
 
 
 
Bio: Dr. Gelfand is a professor in Maryland's psychology department.
 
 
 
=== April 22, Eugene Agichtein: Mining Rich User Interaction Data to Improve Web Search ===
 
 
 
Abstract:
 
Web search engines have advanced greatly over the last decade.
 
In particular, query and click logs have been invaluable to
 
understanding and improving searcher experience. Yet, even
 
the immense logs amassed by the major search engines provide
 
only a narrow glimpse into the searcher behavior and goals.
 
I will present novel techniques for acquiring, analyzing,
 
and exploiting a much richer array of searcher interactions
 
including cursor movements, scrolling, and clicks. As a result,
 
we can more accurately infer searcher intent, enabling dramatic
 
improvements for some search tasks. I will also briefly
 
describe a promising medical application of these techniques.
 
 
 
Biosketch:
 
Eugene Agichtein is an Assistant Professor in the Math & CS
 
department at Emory University, where he leads the
 
Intelligent Information Access Lab. Eugene's research centers
 
on Web search and information retrieval, primarily focusing
 
on modeling users interactions in web search and social media
 
to improve access to information on the web. Increasingly,
 
Eugene is collaborating with medical researchers on applications
 
to medical informatics and clinical diagnosis. This work has
 
been supported by NSF, Microsoft Research, HP Labs, Yahoo!
 
Research, and others. More information about Eugene is available
 
at http://www.mathcs.emory.edu/~eugene/.
 
 
 
=== April 20, Lillian Lee: Language as Influence(d) ===
 
 
 
 
 
What effect does language have on people, and what effect do people have
 
on language?
 
 
 
You might say in response, "Who are you to discuss these problems?"
 
and you would be right to do so; these are Major Questions that
 
science has been tackling for many years.  But as a field, I think
 
natural language processing and computational linguistics have much to
 
contribute to the conversation, and I hope to encourage the community
 
to further address these issues.  To this end, I'll describe two
 
efforts I've been involved in.
 
 
 
The first project uncovers previously unexamined contextual biases
 
that people may have when determining which opinions to focus on,
 
using Amazon.com helpfulness votes on reviews as a case study to
 
evaluate competing theories from sociology and social psychology.  The
 
second project considers linguistic style matching between
 
conversational participants, using a novel setting to study factors
 
that affect the degree to which people tend to instantly adapt to each
 
others' conversational styles.
 
 
 
Joint work with Cristian Danescu-Niculescu-Mizil, Jon Kleinberg, and
 
Gueorgi Kossinets.
 
 
 
Bio:
 
 
 
Lillian Lee is a professor of computer science at Cornell
 
University. She is the recipient of the inaugural Best Paper Award at
 
HLT-NAACL 2004 (joint with Regina Barzilay), a citation in "Top Picks:
 
Technology Research Advances of 2004" by Technology Research News
 
(also joint with Regina Barzilay), and an Alfred P. Sloan Research
 
Fellowship, and her group's work has been featured in the New York
 
Times. [http://www.cs.cornell.edu/home/llee Homepage]
 
 
 
=== April 13, Leora Morgenstern: Knowledge Representation in the DARPA Machine Reading Program  ===
 
 
 
The DARPA Machine Reading Program (MRP)  is focused on developing reading systems that serve as a bridge between the informal information found in natural language texts and the powerful AI systems that use formal knowledge. Central to this effort is the integration of knowledge representation and reasoning techniques into standard information retrieval technology.
 
 
 
In this talk, I discuss the knowledge representation components, including the core ontologies and the domain specific reasoning system, for the MRP reading systems.  I focus on the spatiotemporal reasoning that serve as the cornerstone for the central challenge of  Phase 3 of the Machine Reading Program:  building geographical timelines from news reports.
 
 
 
Bio
 
 
 
Leora Morgenstern is currently PI of the DARPA Machine Reading evaluation and knowledge infrastructure team at SAIC. Previous to joining SAIC, she spent most of her career at the IBM T.J. Watson Research Center, where she combined foundational AI research with developing cutting-edge and highly profitable applications for Fortune-500 companies. She is noted in particular for her contributions in applying her research in semantic networks, nonmonotonic inheritance networks, and business rules for applications in knowledge management, customer relationship management, and decision support. 
 
 
 
Dr. Morgenstern is the author of over forty scholarly publications and holds three patents, which have won several IBM awards due to their value to industry. She has served on the editorial boards of JAIR, AMAI, and ETAI. She has edited several special issues of journals, the most recent of which was a volume of Artificial Intelligence  (January 2011) dedicated to John McCarthy's leadership in field of knowledge representation. Together with John McCarthy and Vladimir Lifschitz, she founded the biannual symposium on Logical Formalizations of Commonsense Reasoning, and has served several times as program co-chair of this symposium. She developed and continues to maintain the Commonsense Problem Page, a website devoted to the pursuit of research in formal commonsense knowledge and reasoning.
 
 
 
=== April 11, Giacomo Inches: Investigating the statistical properties of user-generated documents ===
 
 
 
The importance of the Internet as a communication medium is reflected in the large amount of documents being generated every day by users of the different services that take place online. We analyzed the properties of some of the established services over the Internet (Kongregate, Twitter, Myspace and Slashdot) and compared them with consolidated collection of standard information retrieval documents (from the Wall Street Journal, Associated Press and Financial Times, as part of the TREC ad-hoc collection). We investigate features such as document similarity, term burstiness, emoticons and Part-Of-Speech analysis, highlighting their similarities and differences.
 
 
 
Giacomo Inches is a Ph.D. student in the Information Retrieval group within the Informatics Faculty at the University of Lugano (Università della Svizzera italiana, USI), Switzerland. His research is focused on short documents analysis using IR, text mining and machine learning techniques of user generated contents like twitter, chat logs, sms and police report archives. He is currently working on the SNF ChatMiner project ("Mining of conversational content for topic identification  and author identification."). In prior scientific work he investigated the field of images classification and worked in the field of database systems (RIA, web engineering).  Giacomo received his B.Sc. and M.Sc. from the Politecnico di Milano, Italy and hold a Diplom in Informatik from the University of Erlangen-Nuerember, Germany.
 
 
 
=== April 6, Rachel Pottinger ===
 
 
 
When heterogeneous databases are combined, they typically have different schemas, i.e., a description of how the data is stored.  For information to be shared between these databases, there must be some way for differences in representation to be resolved. Combining these heterogeneous sources so that they can be queried uniformly is known as semantic integration.  There are many aspects to semantic integration, including how to create the underlying system that allows queries to be processed to allowing the user to understand the overpowering amount of data available.  In this talk,  I describe some of the research that my students and I have been doing to increase data utility through semantic integration, particularly when motivated by real world applications.
 
 
 
Bio:
 
 
 
Rachel Pottinger is an assistant professor in Computer Science at the University of British Columbia.  She received her Ph.D. in computer science from the University of Washington in 2004.  Her main research interest is data management, particularly semantic data integration, how to manage metadata (data about data), and how to manage data that is currently not well supported by databases.
 
 
 
=== March 30, Sujith Ravi: Deciphering Natural Language ===
 
 
 
Current research in natural language processing (NLP) relies heavily on supervised techniques, which require labeled training data. But such data does not exist for all languages and domains. Using human annotation to create new resources is not a scalable solution, which raises a key research challenge: How can we circumvent the problem of limited labeled resources for NLP applications?
 
 
 
Interestingly, cryptanalysts and archaeologists have tackled similar challenges in the past for solving decipherment problems. Our work draws inspiration from these successes and we present a novel, unified decipherment-based approach for solving natural language problems without labeled (parallel) data. In this talk, we show how NLP problems can be modeled as decipherment tasks. For example, in statistical language translation one can view the foreign-language text as a cipher for English.
 
 
 
Combining techniques from classical cryptography and statistical NLP, we then develop novel decipherment methods to tackle a wide variety of problems ranging from letter substitution decipherment to sequence labeling tasks (such as part-of-speech tagging) to language translation. We also introduce novel unsupervised algorithms that explicitly search for minimized models during decipherment and outperform existing state-of-the-art systems on several NLP tasks.
 
 
 
Along the way, we show experimental results on several tasks and finally, we demonstrate the first successful attempt at automatic language translation without the use of bilingual resources. Unlike conventional approaches, these decipherment methods can be easily extended to multiple domains and languages (especially resource-poor languages), thereby helping to spread the impact and benefits of NLP research.
 
 
 
 
 
 
 
Bio:
 
 
 
Sujith Ravi is a Ph.D. candidate in Computer Science at the University of Southern California/Information Sciences Institute, working with Kevin Knight. He received his M.S (2006) degree in Computer Science from USC, and a B.Tech (2004) degree in Computer Science from the National Institute of Technology, Trichy in India. He has also held summer research positions at Google Research and Yahoo Research. His research interests lie in natural language processing, machine learning, computational decipherment and artificial intelligence. His current research focuses on unsupervised and semi-supervised methods with applications in machine translation, transliteration, sequence labeling, large-scale information extraction, syntactic parsing, and information retrieval in discourse. Beyond that, his research experience also includes work on cross-disciplinary areas such as theoretical computer science, computational advertising and computer-aided education. During his graduate student career at USC, he received several awards including an Outstanding Research Assistant Award, an Outstanding Teaching Assistant Award, and an Outstanding Academic Achievement Award.
 
 
 
=== March 16, Mark Liberman: Problems and opportunities in corpus phonetics ===
 
 
 
Techniques developed for speech and language technology can now be
 
applied as research tools in an increasing number of areas, some of
 
them perhaps unexpected: sociolinguistics, psycholinguistics, language
 
teaching, clinical diagnosis and treatment, political science -- and
 
even theoretical phonetics and phonology. Some applications are
 
straightforward, and the short-term prospects for work in this field
 
are excellent, but there are many interesting problems for which
 
satisfactory solutions are not yet available. In contrast to
 
traditional speech-technology applications areas, in many of these
 
cases the obvious solutions have not been tried.
 
 
 
Bio (from Wikipedia): Mark has a dual appointment at the University of Pennsylvania, as Trustee Professor of Phonetics in the Department of Linguistics, and as a professor in the Department of Computer and Information Sciences. He is the founder and director of the Linguistic Data Consortium.  His main research interests lie in phonetics, prosody, and other aspects of speech communication.  Liberman is also the founder of (and frequent contributor to) Language Log, a blog with a broad cast of dozens of professional linguists. The concept of the eggcorn was first proposed in one of his posts there.
 
 
 
=== March 9, Asad Sayeed: Finding Target-Relevant Sentiment Words ===
 
 
 
A major indicator of the presence of an opinion and its polarity are the
 
words immediately surrounding a potential opinion "target".  But not all
 
the words near the target are likely to be relevant to finding an
 
opinion.  Furthermore, prior polarity lexica are only of limited value
 
in finding these words given corpora in specialized domains such as the
 
information technology (IT) business press.  There is no ready-made
 
labeled data for this genre and no existing lexica for domain-specific
 
polarity words.
 
 
 
This implementation-level talk describes some work in progress in
 
identifying polarity words in an IT business corpus through
 
crowdsourcing, identifying some of the challenges found in multiple
 
failed attempts.  We found that annotating at a fine-grained level with
 
trained individuals is slow, costly, and unreliable given articles that
 
are sometimes quite long.  In order to crowdsource the task, however,
 
we had to find ways to ask the question that do not require the user to
 
think too hard about exactly what an opinion is and to reduce the
 
propensity to cheat on a difficult question.
 
 
 
We built an CrowdFlower-based interface that uses a drag-and-drop
 
process to classify words in context.  We will demonstrate the interface
 
during the talk and show samples of the results, which we are still in
 
the process of gathering.  We will also show some of the
 
implementation-level challenges of adapting the CrowdFlower interface to
 
a non-standard UI paradigm.
 
 
 
If there is time, we will also discuss one of the ways in which we plan
 
to use the data through a CRF-based model of the syntactic relationship
 
between sentiment words and target mentions which we developed in
 
FACTORIE and Scala."
 
 
 
Bio:
 
"Asad Sayeed is a PhD candidate in computer science and member of the University of Maryland CLIP lab.  He is working on his dissertation in syntactically fine-grained sentiment analysis."
 
 
 
=== March 2, Ned Talley: An Unsupervised View of NIH Grants - Latent Categories and Clusters in an Interactive Format ===
 
 
 
The U.S. National Institutes of Health (NIH) consists of twenty-five Institutes and Centers that award ~80,000 grants each year.  The Institutes have distinct missions and research priorities, but there is substantial overlap in the types of research they support, which creates a funding landscape that can be difficult for researchers and research policy professionals to navigate. We have created a publicly accessible database (https://app.nihmaps.org) in which NIH grants are topic modeled using Latent Dirichlet Allocation, and are clustered using a force-directed algorithm for placing grants as nodes in two dimensional space, where they can be accessed in an online map-like format.
 
 
 
Ned Talley is an NIH Program Director who manages grants on synaptic transmission, synaptic plasticity, and advanced microscopy and imaging.  For the past two years he has also been focused on NIH grants informatics, in order to address unmet needs at NIH, and to match these needs with burgeoning technologies in artificial intelligence, information retrieval, and information visualization.  He has directed this project through collaborations with investigators from University of Southern California, UC Irvine, Indiana University, and University of Massachusetts.
 
 
 
=== February 16, Ophir Frieder: Humane Computing ===
 
 
 
Humane Computing is the design, development, and implementation of computing systems that directly focus on improving the human condition or experience.  In that light, three efforts are presented, namely, improving foreign name search technology, spam detection algorithms for peer-to-peer file sharing systems, and novel techniques for urinary tract infection treatment.
 
 
 
The first effort is in support of the Yizkor Books project of the Archives Section of the United States Holocaust Memorial Museum.  Yizkor Books are commemorative, firsthand accounts of communities that perished before, during, and after the Holocaust.  Users of such volumes include historians, archivists, educators, and survivors.  Since Yizkor collections are written in 13 different languages, searching them is difficult.  In this effort, novel foreign name search approaches which favorably compare against the state of the art are developed.  By segmenting names, fusing individual results, and filtering via a threshold, our approach statistically significantly improves on traditional Soundex and n-gram based search techniques used in the search of such texts.  Thus, previously unsuccessful searches are now supported.
 
 
 
In the second effort, spam characteristics in peer-to-peer file sharing systems are determined.  Using these characteristics, an approach that does not rely on external information or user feedback is developed.  Cost reduction techniques are employed resulting in a statistically significant reduction of spam.  Thus, the user search experience is improved.
 
 
 
Finally, a novel “self start”, patient-specific approach for the treatment of recurrent urinary tract infections is presented.  Using conventional data mining techniques, an approach that improves patient care, reduces bacterial mutation, and lowers treatment cost is presented.  Thus, an approach that provides better, in terms of patient comfort, quicker, in terms of outbreak duration, and more economical care for female patients that suffer from recurrent urinary tract infections is described.
 
 
 
 
 
Biography
 
Ophir Frieder is the Robert L. McDevitt, K.S.G., K.C.H.S. and Catherine H. McDevitt L.C.H.S. Chair in Computer Science and Information Processing and is Chair of the Department of Computer Science at Georgetown University. His research interests focus on scalable information retrieval systems spanning search and retrieval and communications issues.  He is a Fellow of the AAAS, ACM, and IEEE.
 
 
 
=== February 9, Naomi Feldman: Using a developing lexicon to constrain phonetic category acquisition ===
 
 
 
 
 
Variability in the acoustic signal makes speech sound category
 
learning a difficult problem.  Despite this difficulty, human learners
 
are able to acquire phonetic categories at a young age, between six
 
and twelve months.  Learners at this age also show evidence of
 
attending to larger units of speech, particularly in word segmentation
 
tasks.  This work investigates how word-level information can help
 
make the phonetic category learning problem easier.  A hierarchical
 
Bayesian model is constructed that learns to categorize speech sounds
 
and words simultaneously from a corpus of segmented acoustic tokens.
 
No lexical information is given to the model a priori; it is simply
 
allowed to begin learning a set of word types at the same time that it
 
learns to categorize speech sounds.  Simulations compare this model to
 
a purely distributional learner that does not have feedback from a
 
developing lexicon.  Results show that whereas a distributional
 
learner mistakenly merges several sets of overlapping categories, an
 
interactive model successfully disambiguates these categories.  An
 
artificial language learning experiment with human learners
 
demonstrates that people can make use of the type of word-level cues
 
required for interactive learning.  Together, these results suggest
 
that phonetic category learning can be better understood in
 
conjunction with other contemporaneous learning processes and that
 
simultaneous learning of multiple layers of linguistic structure can
 
potentially make the language acquisition problem more tractable.
 
 
 
Bio: Naomi was a graduate student in the Department of Cognitive and Linguistic Sciences at Brown University working with Jim Morgan and Tom Griffiths. She's interested in speech perception and language acquisition, especially the relationship between phonetic category learning, phonological development, and perceptual changes during infancy.  In January 2011, she became an assistant professor in the Department of Linguistics at the University of Maryland.
 
 
 
=== February 2, Ahn Jae-wook: Exploratory user interfaces for personalized information access ===
 
 
 
Personalized information access systems aim to provide tailored information to users according to their various tasks, interests, or contexts.  They have long been relied on the ability of algorithms for estimating user interests and generating personalized information.  They observe user behaviors, build mental models of the users, and apply the user model for customizing the information.  This process can be done even without any explicit user intervention.  However, we can add users into the loop of the personalization process, so that the systems can catch user interests even more precisely and the users can flexibly control the behavior of the systems.
 
 
 
In order to exploit the benefits of the user interfaces for personalized information access, we have investigated various aspects of exploratory information access systems.  Exploratory information access systems can combine the strengths of algorithms and user interfaces.  Users can learn and investigate their information need beyond the simple lookup search strategy.  By adding the idea of the exploration to the personalized information access, we could devise advanced user interfaces for the personalization.  Specifically, we have tried to understand how we could let users learn, manipulate, and control the core component of many personalized systems, user models.  In this presentation, I am going to introduce several ideas about how to present and control user models using different user interfaces.  The example studies include open/editable user model, tab-based user model and query control, reference point-based visualization that incorporates the user model and the query spaces, and named-entity based searching/browsing user interface.  The results and the lessons of the user studies are discussed.
 
 
 
Bio: Jae-wook Ahn has recently defended his Ph.D. dissertation at the School of Information Sciences, University of Pittsburgh in September 2010.  He has worked with his Ph.D. mentor Dr. Peter Brusilovsky and Dr. Daqing He.  He is currently a research associate of the Department of Computer Science and the Human Computer Interaction Lab, working with Dr. Ben Shneiderman.
 
 
 
== Fall 2010 Speakers ==
 
 
 
=== October 20, Kristy Hollingshead: Search Errors and Model Errors in Pipeline Systems ===
 
 
 
Pipeline systems, in which data is sequentially processed in stages with the output of one stage providing input to the next, are ubiquitous in the field of natural language processing (NLP) as well as many other research areas. The popularity of the pipeline system architecture may be attributed to the utility of pipelines in improving scalability by reducing search complexity and increasing efficiency of the system. However, pipelines can suffer from the well-known problem of "cascading errors," where errors earlier in the pipeline propagate to later stages in the pipeline. In this talk I will make a distinction between two different type of cascading errors in pipeline systems. The first I will term "search errors," where there exists a higher-scoring candidate (according to the model), but that candidate has been excluded from the search space. The second type of error that I will address might be termed "model errors," where the highest-scoring candidate (according to the model) is not the best candidate (according to some gold standard). Statistical NLP models are imperfect by nature, resulting in model errors. Interestingly, the same pipeline framework that causes search errors can also resolve (or work around) model errors; in this talk I will demonstrate several techniques for detecting and resolving search and model errors, which can result in improved efficiency with no loss in accuracy. I will briefly mention the technique of pipeline iteration, introduced in my ACL'07 paper, and introduce some related results from my dissertation. I will then focus on work done with my PhD advisor Brian Roark on chart cell constraints, as published in our COLING'08 and NAACL'09 papers; this work provably reduces the complexity of a context-free parser to quadratic performance in the worst case (observably linear) with a slight gain in accuracy using the Charniak parser. While much of this talk will be on parsing pipelines, I am currently extending some of this work to MT pipelines and would welcome discussion along those lines.
 
 
 
Kristy Hollingshead earned her PhD in Computer Science and Engineering this year, from the Center for Spoken Language Understanding (CSLU) at the Oregon Health & Science University (OHSU). She received her B.A. in English-Creative Writing from the University of Colorado in 2000 and her M.S. in Computer Science from OHSU in 2004. Her research interests in natural language processing include parsing, machine translation, evaluation metrics, and assistive technologies. She is also interested in general techniques on improving system efficiency, to allow for richer contextual information to be extracted for use in downstream stages of a pipeline system. Kristy was a National Science Foundation Graduate Research Fellow from 2004-2007.
 
 
 
=== October 27, Stanley Kok: Structure Learning in Markov Logic Networks ===
 
 
 
Statistical learning handles uncertainty in a robust and principled way.
 
Relational learning (also known as inductive logic programming)
 
models domains involving multiple relations. Recent years have seen a
 
surge of interest in the statistical relational learning (SRL) community
 
in combining the two, driven by the realization that many (if not most)
 
applications require both and by the growing maturity of the two fields.
 
 
 
Markov logic networks (MLNs) is a statistical relational model that has
 
gained traction within the AI community in recent years because of its
 
robustness to noise and its ability to compactly model complex domains.
 
MLNs combine probability and logic by attaching weights to first-order
 
formulas, and viewing these as templates for features of Markov networks.
 
Learning the structure of an MLN consists of learning both formulas and
 
their weights.
 
 
 
To obtain weighted MLN formulas, we could rely on human experts
 
to specify them. However, this approach is error-prone and requires
 
painstaking knowledge engineering. Further, it will not work on domains
 
where there is no human expert. The ideal solution is to automatically
 
learn MLN structure from data. However, this is a challenging task because
 
of its super-exponential search space. In this talk, we present a series of
 
algorithms that efficiently and accurately learn MLN structure.
 
 
 
=== November 1, Owen Rambow: Relating Language to Cognitive State ===
 
 
 
In the 80s and 90s of the last century, in subdisciplines such as planning,
 
text generation, and dialog systems, there was considerable interest in
 
modeling the cognitive states of interacting autonomous agents.  Theories
 
such as Speech Act Theory (Austin 1962), the belief-desire-intentions model
 
of Bratman (1987), and Rhetorical Structure Theory (Mann and Thompson 1988)
 
together provide a framework in which to link cognitive state with language
 
use.  However, in general natural language processing (NLP), little use was
 
made of such theories, presumably because of the difficulty at the time of
 
some underlying tasks (such as syntactic parsing).  In this talk, I propose
 
that it is time to again think about the explicit modeling of cognitive
 
state for participants in discourse.  In fact, that is the natural way to
 
formulate what NLP is all about.  The perspective of cognitive state can
 
provide a context in which many disparate NLP tasks can be classified and
 
related.  I will present two NLP projects at Columbia which relate to the
 
modeling of cognitive state:
 
 
 
Discourse participants need to model each other's cognitive states, and
 
language makes this possible by providing special morphological, syntactic,
 
and lexical markers.  I present results in automatically determining the
 
degree of belief of a speaker in the propositions in his or her utterance.
 
 
 
Bio: PhD from University of Pennsylvania, 1994, working on German syntax.
 
My office mate was Philip Resnik.  I worked at CoGentex, Inc (a small
 
company) and AT&T Labs -- Research until 2002, and since then at Columbia as
 
a Research Scientist.  My research interests cover both the nuts-and-bolts
 
of languages, specifically syntax, and how language is used in context.
 
 
 
=== November 10, Bob Carpenter: Whence Linguistic Data? ===
 
 
 
The empirical approach to linguistic theory involves collecting
 
data and annotating it according to a coding standard.  The
 
ability of multiple annotators to consistently annotate new
 
data reflects the applicability of the theory.    In this
 
talk, I'll introduce a generative probabilistic model of the
 
annotation process for categorical data.  Given a collection of
 
annotated data, we can infer the true labels of items, the prevalence
 
of some phenomenon (e.g. a given intonation or syntactic alternation),
 
the accuracy and category bias of each annotator, and the codability
 
of the theory as measured by the mean accuracy and bias of annotators
 
and their variability.  Hierarchical model extensions allow us to
 
model item labeling difficulty and take into account annotator
 
background and experience.  I'll demonstrate the efficacy of the
 
approach using expert and non-expert pools of annotators for simple
 
linguistic labeling tasks such as textual inference, morphological
 
tagging, and named-entity extraction.  I'll discuss applications
 
such as monitoring an annotation effort, selecting items with active
 
learning, and generating a probabilistic gold standard for machine
 
learning training and evaluation.
 
 
 
=== November 15, William Webber: Information retrieval effectiveness: measurably going nowhere? ===
 
 
 
Information retrieval works by heuristics; correctness cannot be
 
formally proved, but must be empirically assessed.  Test
 
collections make this evaluation automated and repeatable.
 
Collection-based evaluation has been standard for half a century.
 
The IR community prides itself on the rigour of the
 
experimental tradition that has been built upon this
 
foundation;  it is notoriously difficult to publish in the
 
field without a thorough experimental validation.  No
 
attention, however, has been paid to the question of whether
 
methodological rigour in evaluation has to verifiable.  In
 
this talk, we present a survey of retrieval results published
 
over the past decade, which fails to find evidence that
 
retrieval effectiveness is in fact improving.  Rather, each
 
experiment's impressive leap forward is preceded by a few
 
careful steps back.
 
 
 
Bio:
 
 
 
William Webber is a Research Associate in the Department of Computer
 
Science and Software Engineering at the University of Melbourne,
 
Australia.  He has recently completed his PhD thesis, "Measurement in
 
Information Retrieval Evaluation", under the supervision of Professors
 
Alistair Moffat and Justin Zobel.
 
 
 
=== December 8: Michael Paul: Summarizing Contrastive Viewpoints in Opinionated Text ===
 
 
 
Performing multi-document summarization of opinionated text has unique
 
challenges because it is important to recognize that the same information
 
may be presented in different ways from different viewpoints. In this talk,
 
we will present a special kind of contrastive summarization approach
 
intended to highlight this phenomenon and to help users digest conflicting
 
opinions. To do this, we introduce a new graph-based algorithm, Comparative
 
LexRank, to score sentences in a summary based on a combination of both
 
representativeness of the collection and comparability between opposing
 
viewpoints. We then address the issue of how to automatically discover and
 
extract viewpoints from unlabeled text, and we experiment with a novel
 
two-dimensional topic model for the task of unsupervised clustering of
 
documents by viewpoint. Finally, we discuss how these two stages can be
 
combined to both automatically extract and summarize viewpoints in an
 
interesting way. Results are presented on two political opinion data sets.
 
 
 
This project was joint work with ChengXiang Zhai and Roxana Girju.
 
 
 
Bio:
 
Michael Paul is a first-year Ph.D. student of Computer Science at the Johns
 
Hopkins University and a member of the Center for Language and Speech
 
Processing. He earned a B.S. from the University of Illinois at
 
Urbana-Champaign in 2009. He is currently a Graduate Research Fellow of the
 
National Science Foundation and a Dean's Fellow of the Whiting School of
 
Engineering.
 
 
 
 
 
=== Roger Levy ===
 
 
 
 
 
 
 
Considering the adversity of the conditions under which linguistic communication takes place in everyday life -- ambiguity of the signal, environmental competition for our attention, speaker error, our limited memory, and so forth -- it is perhaps remarkable that we are as successful at it as we are.  Perhaps the leading explanation of this success is that (a) the linguistic signal is redundant, (b) diverse information sources are generally available that can help us obtain infer something close to the intended message when comprehending an utterance, and (c) we use these diverse information sources very quickly and to the fullest extent possible.  This explanation suggests a theory of language comprehension as a rational, evidential process.  In this talk, I describe recent research on how we can use the tools of computational linguistics to formalize and implement such a theory, and to apply it to a variety of problems in human sentence comprehension, including classic cases of garden-path disambiguation as well as processing difficulty in the absence of structural ambiguity.  In addition, I address a number of phenomena that remain clear puzzles for the rational approach, due to an apparent failure to use information available in a sentence appropriately in global or incremental inferences about the correct interpretation of a sentence.  I argue that the apparent puzzle posed by these phenomena for models of rational sentence comprehension may derive from the failure of existing models to appropriately account for the environmental and cognitive constraints -- in this case, the inherent uncertainty of perceptual input, and humans' ability to compensate for it -- under which comprehension takes place.  I present a new probabilistic model of language comprehension under uncertain input and show that this model leads to solutions to the above puzzles.  I also present behavioral data in support of novel predictions made by the model. More generally, I suggest that appropriately accounting for environmental and cognitive constraints in probabilistic models can lead to a more nuanced and ultimately more satisfactory picture of key aspects of human cognition.
 
 
 
===  Earl Wagner ===
 
===  Eugene Charniak ===
 
 
 
We present a new syntactic parser that works left-to-right and top
 
down, thus maintaining a fully-connected parse tree for a few
 
alternative parse hypotheses.  All of the commonly used statistical
 
parsers use context-free dynamic programming algorithms and as such
 
work bottom up on the entire sentence.  Thus they only find a complete
 
fully connected parse at the very end.  In contrast, both subjective
 
and experimental evidence show that people understand a sentence
 
word-to-word as they go along, or close to it.  The constraint that
 
the parser keeps one or more fully connected syntactic trees is
 
intended to operationalize this cognitive fact.  Our parser achieves a
 
new best result for top-down parsers of 89.4%,a 20% error reduction
 
over the previous single-parser best result for parsers of this type
 
of 86.8% (Roark01). The improved performance is due to embracing the
 
very large feature set available in exchange for giving up dynamic
 
programming.
 
 
 
 
 
Eugene Charniak is University Professor of Computer Science and
 
Cognitive Science at Brown University and past chair of the Department
 
of Computer Science.  He received his A.B. degree in Physics from
 
University of Chicago, and a Ph.D. from M.I.T. in Computer Science.
 
He has published four books the most recent being Statistical Language
 
Learning.  He is a Fellow of the American Association of Artificial
 
Intelligence and was previously a Councilor of the organization.  His
 
research has always been in the area of language understanding or
 
technologies which relate to it.  Over the last 20 years years he has
 
been interested in statistical techniques for many areas of language
 
processing including parsing and discourse,
 
 
 
===  Dave Newman===
 
===  Ray Mooney ===
 
 
 
 
 
 
 
Current systems that learn to process natural language require
 
laboriously constructed human-annotated training data. Ideally, a
 
computer would be able to acquire language like a child by being
 
exposed to linguistic input in the context of a relevant but ambiguous
 
perceptual environment. As a step in this direction, we present a
 
system that learns to sportscast simulated robot soccer games by
 
example. The training data consists of textual human commentaries on
 
Robocup simulation games. A set of possible alternative meanings for
 
each comment is automatically constructed from game event traces. Our
 
previously developed systems for learning to parse and generate
 
natural language (KRISP and WASP) were augmented to learn from this
 
data and then commentate novel games. Using this approach, the system
 
has learned to sportscast in both English and Korean. The system has
 
been evaluated based on its ability to properly match sentences to the
 
events being described, parse sentences into correct meanings, and
 
generate accurate linguistic descriptions of events. Human evaluation
 
was also conducted on the overall quality of the generated sportscasts
 
and compared to human-generated commentaries, demonstrating that its
 
sportscasts are on par with those generated by humans.
 
 
 
Biographical Sketch:
 
 
 
Raymond J. Mooney is a Professor in the Department of Computer
 
Sciences at the University of Texas at Austin. He received his
 
Ph.D. in 1988 from the University of Illinois at Urbana/Champaign. He
 
is an author of over 150 published research papers, primarily in the
 
areas of machine learning and natural language processing. He is the
 
current President of the International Machine Learning Society, was
 
program co-chair for the 2006 AAAI Conference on Artificial
 
Intelligence, general chair of the 2005 Human Language Technology
 
Conference and Conference on Empirical Methods in Natural Language
 
Processing, and co-chair of the 1990 International Conference on
 
Machine Learning. He is a Fellow of the American Association for
 
Artificial Intelligence and recipient of best paper awards from the
 
National Conference on Artificial Intelligence, the SIGKDD
 
International Conference on Knowledge Discovery and Data Mining, the
 
International Conference on Machine Learning, and the Annual Meeting
 
of the Association for Computational Linguistics. His recent research
 
has focused on learning for natural-language processing, connecting
 
language and perception, statistical relational learning, and transfer
 
learning.
 

Revision as of 18:21, 6 June 2021

x

CLIP Colloquium

The CLIP Colloquium is a weekly speaker series organized and hosted by CLIP Lab. The talks are open to everyone. Most talks are held on Wednesday at 11AM in AV Williams 3258 unless otherwise noted. Typically, external speakers have slots for one-on-one meetings with Maryland researchers.

If you would like to get on the clip-talks@umiacs.umd.edu list or for other questions about the colloquium series, e-mail Doug Oard, the current organizer.

For up-to-date information, see the UMD CS Talks page. (You can also subscribe to the calendar there.)

Colloquium Recordings

Previous Talks

CLIP NEWS

  • News about CLIP researchers on the UMIACS website [1]
  • Please follow us on Twitter @umdclip [2]