Actions

Events: Difference between revisions

Computational Linguistics and Information Processing

 
(185 intermediate revisions by 10 users not shown)
Line 1: Line 1:
== Colloquia ==
<center>[[Image:colloq.jpg|center|504px|x]]</center>


''Titles and abstracts appear after the calendar.''  Talks are held at 11AM in AV Williams 3258 unless otherwise noted.  All are welcome.  Typically, external speakers have slots for one-on-one meetings with Maryland researchers.  Contact the host if you'd like to have a meeting.
== CLIP Colloquium ==


If you would like to get on the cl-colloquium@umiacs.umd.edu list, please e-mail [mailto:jessica@cs.umd.edu Jessica Touchard].
The CLIP Colloquium is a weekly speaker series organized and hosted by CLIP Lab. The talks are open to everyone. Most talks are held on Wednesday at 11AM online unless otherwise noted. Typically, external speakers have slots for one-on-one meetings with Maryland researchers.


=== Google Calendar for CLIP Speakers===
If you would like to get on the clip-talks@umiacs.umd.edu list or for other questions about the colloquium series, e-mail [mailto:rudinger@umd.edu Rachel Rudinger], the current organizer.


{{#widget:Google Calendar
For up-to-date information, see the [https://talks.cs.umd.edu/lists/7 UMD CS Talks page].  (You can also subscribe to the calendar there.)
|id=lqah25nfftkqi2msv25trab8pk@group.calendar.google.com
|color=B1440E
|title=CLIP Events
|view=AGENDA
|height=300
}}


== Spring 2011 Speakers ==
=== Colloquium Recordings ===
* [[Colloqium Recording (Fall 2020)|Fall 2020]]
* [[Colloqium Recording (Spring 2021)|Spring 2021]]
* [[Colloqium Recording (Fall 2021)|Fall 2021]]
* [[Colloqium Recording (Spring 2022)|Spring 2022]]


=== May 11, Dave Blei ===
=== Previous Talks ===
* [[https://talks.cs.umd.edu/lists/7?range=past Past talks, 2013 - present]]
* [[CLIP Colloquium (Spring 2012)|Spring 2012]]  [[CLIP Colloquium (Fall 2011)|Fall 2011]]  [[CLIP Colloquium (Spring 2011)|Spring 2011]]  [[CLIP Colloquium (Fall 2010)|Fall 2010]]


=== May 4, Sinead Williamson: Nonparametric Bayesian models for dependent data ===
== CLIP NEWS  ==


A  priori assumptions about the number of parameters required to model
* News about CLIP researchers on the UMIACS website [http://www.umiacs.umd.edu/about-us/news]
our data are often unrealistic. Bayesian nonparametric models
* Please follow us on Twitter @ClipUmd[https://twitter.com/ClipUmd?lang=en]
circumvent this problem by assigning prior mass to a countably
infinite set of parameters, only a finite (but random) number of which
will contribute to a given data set. Over recent years, a number of
authors have presented dependent nonparametric models -- distributions
over collections of random measures associated with values in some
covariate space. While the properties of these random measures are
allowed to vary across the covariate space, the marginal distribution
at each covariate value is given by a known nonparametric
distribution. Such distributions are useful for modelling data that
vary with some covariate: in image segmentation, proximal pixels are
likely to be assigned to the same segment; in modelling documents,
topics are likely to increase and decrease in popularity over time.
 
Most dependent nonparametric models in the literature have Dirichlet
process-distributed marginals. While the Dirichlet process is
undeniably the most commonly used discrete nonparametric Bayesian
prior, this ignores a wide range of interesting models. In my PhD, I
have focused on dependent nonparametric models beyond the Dirichlet
process -- in particular, on dependent nonparametric models based on
the Indian buffet process, a distribution over binary matrices with an
infinite number of columns. In this talk, I will give a general
introduction to dependent nonparametric models, and describe some of
the work I have done in this area.
 
Bio: Sinead Williamson is a PhD student working with Zoubin Ghahramani at
the University of Cambridge, UK. Her main research interests are
dependent nonparametric processes and nonparametric latent variable
models. She will be visiting the University of Maryland for six months
before starting a post doc at Carnegie Mellon University in the Fall.
 
=== April 27, Michele Gelfand ===
 
=== April 22, Eugene Agichtein ===
 
=== April 20, Lillian Lee: Language as Influence(d) ===
 
 
What effect does language have on people, and what effect do people have
on language?
 
You might say in response, "Who are you to discuss these problems?"
and you would be right to do so; these are Major Questions that
science has been tackling for many years.  But as a field, I think
natural language processing and computational linguistics have much to
contribute to the conversation, and I hope to encourage the community
to further address these issues.  To this end, I'll describe two
efforts I've been involved in.
 
The first project uncovers previously unexamined contextual biases
that people may have when determining which opinions to focus on,
using Amazon.com helpfulness votes on reviews as a case study to
evaluate competing theories from sociology and social psychology.  The
second project considers linguistic style matching between
conversational participants, using a novel setting to study factors
that affect the degree to which people tend to instantly adapt to each
others' conversational styles.
 
Joint work with Cristian Danescu-Niculescu-Mizil, Jon Kleinberg, and
Gueorgi Kossinets.
 
Bio:
 
Lillian Lee is a professor of computer science at Cornell
University. She is the recipient of the inaugural Best Paper Award at
HLT-NAACL 2004 (joint with Regina Barzilay), a citation in "Top Picks:
Technology Research Advances of 2004" by Technology Research News
(also joint with Regina Barzilay), and an Alfred P. Sloan Research
Fellowship, and her group's work has been featured in the New York
Times. [http://www.cs.cornell.edu/home/llee Homepage]
 
=== April 13, Leora Morgenstern ===
 
The DARPA Machine Reading Program (MRP)  is focused on developing reading systems that serve as a bridge between the informal information found in natural language texts and the powerful AI systems that use formal knowledge. Central to this effort is the integration of knowledge representation and reasoning techniques into standard information retrieval technology.
 
In this talk, I discuss the knowledge representation components, including the core ontologies and the domain specific reasoning system, for the MRP reading systems.  I focus on the spatiotemporal reasoning that serve as the cornerstone for the central challenge of  Phase 3 of the Machine Reading Program:  building geographical timelines from news reports.
 
Bio
 
Leora Morgenstern is currently PI of the DARPA Machine Reading evaluation and knowledge infrastructure team at SAIC. Previous to joining SAIC, she spent most of her career at the IBM T.J. Watson Research Center, where she combined foundational AI research with developing cutting-edge and highly profitable applications for Fortune-500 companies. She is noted in particular for her contributions in applying her research in semantic networks, nonmonotonic inheritance networks, and business rules for applications in knowledge management, customer relationship management, and decision support. 
 
Dr. Morgenstern is the author of over forty scholarly publications and holds three patents, which have won several IBM awards due to their value to industry. She has served on the editorial boards of JAIR, AMAI, and ETAI. She has edited several special issues of journals, the most recent of which was a volume of Artificial Intelligence  (January 2011) dedicated to John McCarthy's leadership in field of knowledge representation. Together with John McCarthy and Vladimir Lifschitz, she founded the biannual symposium on Logical Formalizations of Commonsense Reasoning, and has served several times as program co-chair of this symposium. She developed and continues to maintain the Commonsense Problem Page, a website devoted to the pursuit of research in formal commonsense knowledge and reasoning.
 
=== April 11, Giacomo Inches: Investigating the statistical properties of user-generated documents ===
 
The importance of the Internet as a communication medium is reflected in the large amount of documents being generated every day by users of the different services that take place online. We analyzed the properties of some of the established services over the Internet (Kongregate, Twitter, Myspace and Slashdot) and compared them with consolidated collection of standard information retrieval documents (from the Wall Street Journal, Associated Press and Financial Times, as part of the TREC ad-hoc collection). We investigate features such as document similarity, term burstiness, emoticons and Part-Of-Speech analysis, highlighting their similarities and differences.
 
Giacomo Inches is a Ph.D. student in the Information Retrieval group within the Informatics Faculty at the University of Lugano (Università della Svizzera italiana, USI), Switzerland. His research is focused on short documents analysis using IR, text mining and machine learning techniques of user generated contents like twitter, chat logs, sms and police report archives. He is currently working on the SNF ChatMiner project ("Mining of conversational content for topic identification  and author identification."). In prior scientific work he investigated the field of images classification and worked in the field of database systems (RIA, web engineering).  Giacomo received his B.Sc. and M.Sc. from the Politecnico di Milano, Italy and hold a Diplom in Informatik from the University of Erlangen-Nuerember, Germany.
 
=== April 6, Rachel Pottinger ===
 
=== March 30, Sujith Ravi: Deciphering Natural Language ===
 
Current research in natural language processing (NLP) relies heavily on supervised techniques, which require labeled training data. But such data does not exist for all languages and domains. Using human annotation to create new resources is not a scalable solution, which raises a key research challenge: How can we circumvent the problem of limited labeled resources for NLP applications?
 
Interestingly, cryptanalysts and archaeologists have tackled similar challenges in the past for solving decipherment problems. Our work draws inspiration from these successes and we present a novel, unified decipherment-based approach for solving natural language problems without labeled (parallel) data. In this talk, we show how NLP problems can be modeled as decipherment tasks. For example, in statistical language translation one can view the foreign-language text as a cipher for English.
 
Combining techniques from classical cryptography and statistical NLP, we then develop novel decipherment methods to tackle a wide variety of problems ranging from letter substitution decipherment to sequence labeling tasks (such as part-of-speech tagging) to language translation. We also introduce novel unsupervised algorithms that explicitly search for minimized models during decipherment and outperform existing state-of-the-art systems on several NLP tasks.
 
Along the way, we show experimental results on several tasks and finally, we demonstrate the first successful attempt at automatic language translation without the use of bilingual resources. Unlike conventional approaches, these decipherment methods can be easily extended to multiple domains and languages (especially resource-poor languages), thereby helping to spread the impact and benefits of NLP research.
 
 
 
Bio:
 
Sujith Ravi is a Ph.D. candidate in Computer Science at the University of Southern California/Information Sciences Institute, working with Kevin Knight. He received his M.S (2006) degree in Computer Science from USC, and a B.Tech (2004) degree in Computer Science from the National Institute of Technology, Trichy in India. He has also held summer research positions at Google Research and Yahoo Research. His research interests lie in natural language processing, machine learning, computational decipherment and artificial intelligence. His current research focuses on unsupervised and semi-supervised methods with applications in machine translation, transliteration, sequence labeling, large-scale information extraction, syntactic parsing, and information retrieval in discourse. Beyond that, his research experience also includes work on cross-disciplinary areas such as theoretical computer science, computational advertising and computer-aided education. During his graduate student career at USC, he received several awards including an Outstanding Research Assistant Award, an Outstanding Teaching Assistant Award, and an Outstanding Academic Achievement Award.
 
=== March 16, Mark Liberman: Problems and opportunities in corpus phonetics ===
 
Techniques developed for speech and language technology can now be
applied as research tools in an increasing number of areas, some of
them perhaps unexpected: sociolinguistics, psycholinguistics, language
teaching, clinical diagnosis and treatment, political science -- and
even theoretical phonetics and phonology. Some applications are
straightforward, and the short-term prospects for work in this field
are excellent, but there are many interesting problems for which
satisfactory solutions are not yet available. In contrast to
traditional speech-technology applications areas, in many of these
cases the obvious solutions have not been tried.
 
Bio (from Wikipedia): Mark has a dual appointment at the University of Pennsylvania, as Trustee Professor of Phonetics in the Department of Linguistics, and as a professor in the Department of Computer and Information Sciences. He is the founder and director of the Linguistic Data Consortium.  His main research interests lie in phonetics, prosody, and other aspects of speech communication.  Liberman is also the founder of (and frequent contributor to) Language Log, a blog with a broad cast of dozens of professional linguists. The concept of the eggcorn was first proposed in one of his posts there.
 
=== March 9, Asad Sayeed: Finding Target-Relevant Sentiment Words ===
 
A major indicator of the presence of an opinion and its polarity are the
words immediately surrounding a potential opinion "target".  But not all
the words near the target are likely to be relevant to finding an
opinion.  Furthermore, prior polarity lexica are only of limited value
in finding these words given corpora in specialized domains such as the
information technology (IT) business press.  There is no ready-made
labeled data for this genre and no existing lexica for domain-specific
polarity words.
 
This implementation-level talk describes some work in progress in
identifying polarity words in an IT business corpus through
crowdsourcing, identifying some of the challenges found in multiple
failed attempts.  We found that annotating at a fine-grained level with
trained individuals is slow, costly, and unreliable given articles that
are sometimes quite long.  In order to crowdsource the task, however,
we had to find ways to ask the question that do not require the user to
think too hard about exactly what an opinion is and to reduce the
propensity to cheat on a difficult question.
 
We built an CrowdFlower-based interface that uses a drag-and-drop
process to classify words in context.  We will demonstrate the interface
during the talk and show samples of the results, which we are still in
the process of gathering.  We will also show some of the
implementation-level challenges of adapting the CrowdFlower interface to
a non-standard UI paradigm.
 
If there is time, we will also discuss one of the ways in which we plan
to use the data through a CRF-based model of the syntactic relationship
between sentiment words and target mentions which we developed in
FACTORIE and Scala."
 
Bio:
"Asad Sayeed is a PhD candidate in computer science and member of the University of Maryland CLIP lab.  He is working on his dissertation in syntactically fine-grained sentiment analysis."
 
=== March 2, Ned Talley: An Unsupervised View of NIH Grants - Latent Categories and Clusters in an Interactive Format ===
 
The U.S. National Institutes of Health (NIH) consists of twenty-five Institutes and Centers that award ~80,000 grants each year.  The Institutes have distinct missions and research priorities, but there is substantial overlap in the types of research they support, which creates a funding landscape that can be difficult for researchers and research policy professionals to navigate. We have created a publicly accessible database (https://app.nihmaps.org) in which NIH grants are topic modeled using Latent Dirichlet Allocation, and are clustered using a force-directed algorithm for placing grants as nodes in two dimensional space, where they can be accessed in an online map-like format.
 
Ned Talley is an NIH Program Director who manages grants on synaptic transmission, synaptic plasticity, and advanced microscopy and imaging.  For the past two years he has also been focused on NIH grants informatics, in order to address unmet needs at NIH, and to match these needs with burgeoning technologies in artificial intelligence, information retrieval, and information visualization.  He has directed this project through collaborations with investigators from University of Southern California, UC Irvine, Indiana University, and University of Massachusetts.
 
=== February 16, Ophir Frieder: Humane Computing ===
 
Humane Computing is the design, development, and implementation of computing systems that directly focus on improving the human condition or experience.  In that light, three efforts are presented, namely, improving foreign name search technology, spam detection algorithms for peer-to-peer file sharing systems, and novel techniques for urinary tract infection treatment.
 
The first effort is in support of the Yizkor Books project of the Archives Section of the United States Holocaust Memorial Museum.  Yizkor Books are commemorative, firsthand accounts of communities that perished before, during, and after the Holocaust.  Users of such volumes include historians, archivists, educators, and survivors.  Since Yizkor collections are written in 13 different languages, searching them is difficult.  In this effort, novel foreign name search approaches which favorably compare against the state of the art are developed.  By segmenting names, fusing individual results, and filtering via a threshold, our approach statistically significantly improves on traditional Soundex and n-gram based search techniques used in the search of such texts.  Thus, previously unsuccessful searches are now supported.
 
In the second effort, spam characteristics in peer-to-peer file sharing systems are determined.  Using these characteristics, an approach that does not rely on external information or user feedback is developed.  Cost reduction techniques are employed resulting in a statistically significant reduction of spam.  Thus, the user search experience is improved.
 
Finally, a novel “self start”, patient-specific approach for the treatment of recurrent urinary tract infections is presented.  Using conventional data mining techniques, an approach that improves patient care, reduces bacterial mutation, and lowers treatment cost is presented.  Thus, an approach that provides better, in terms of patient comfort, quicker, in terms of outbreak duration, and more economical care for female patients that suffer from recurrent urinary tract infections is described.
 
 
Biography
Ophir Frieder is the Robert L. McDevitt, K.S.G., K.C.H.S. and Catherine H. McDevitt L.C.H.S. Chair in Computer Science and Information Processing and is Chair of the Department of Computer Science at Georgetown University. His research interests focus on scalable information retrieval systems spanning search and retrieval and communications issues.  He is a Fellow of the AAAS, ACM, and IEEE.
 
=== February 9, Naomi Feldman: Using a developing lexicon to constrain phonetic category acquisition ===
 
 
Variability in the acoustic signal makes speech sound category
learning a difficult problem.  Despite this difficulty, human learners
are able to acquire phonetic categories at a young age, between six
and twelve months.  Learners at this age also show evidence of
attending to larger units of speech, particularly in word segmentation
tasks.  This work investigates how word-level information can help
make the phonetic category learning problem easier.  A hierarchical
Bayesian model is constructed that learns to categorize speech sounds
and words simultaneously from a corpus of segmented acoustic tokens.
No lexical information is given to the model a priori; it is simply
allowed to begin learning a set of word types at the same time that it
learns to categorize speech sounds.  Simulations compare this model to
a purely distributional learner that does not have feedback from a
developing lexicon.  Results show that whereas a distributional
learner mistakenly merges several sets of overlapping categories, an
interactive model successfully disambiguates these categories.  An
artificial language learning experiment with human learners
demonstrates that people can make use of the type of word-level cues
required for interactive learning.  Together, these results suggest
that phonetic category learning can be better understood in
conjunction with other contemporaneous learning processes and that
simultaneous learning of multiple layers of linguistic structure can
potentially make the language acquisition problem more tractable.
 
Bio: Naomi was a graduate student in the Department of Cognitive and Linguistic Sciences at Brown University working with Jim Morgan and Tom Griffiths. She's interested in speech perception and language acquisition, especially the relationship between phonetic category learning, phonological development, and perceptual changes during infancy.  In January 2011, she became an assistant professor in the Department of Linguistics at the University of Maryland.
 
=== February 2, Ahn Jae-wook: Exploratory user interfaces for personalized information access ===
 
Personalized information access systems aim to provide tailored information to users according to their various tasks, interests, or contexts.  They have long been relied on the ability of algorithms for estimating user interests and generating personalized information.  They observe user behaviors, build mental models of the users, and apply the user model for customizing the information.  This process can be done even without any explicit user intervention.  However, we can add users into the loop of the personalization process, so that the systems can catch user interests even more precisely and the users can flexibly control the behavior of the systems.
 
In order to exploit the benefits of the user interfaces for personalized information access, we have investigated various aspects of exploratory information access systems.  Exploratory information access systems can combine the strengths of algorithms and user interfaces.  Users can learn and investigate their information need beyond the simple lookup search strategy.  By adding the idea of the exploration to the personalized information access, we could devise advanced user interfaces for the personalization.  Specifically, we have tried to understand how we could let users learn, manipulate, and control the core component of many personalized systems, user models.  In this presentation, I am going to introduce several ideas about how to present and control user models using different user interfaces.  The example studies include open/editable user model, tab-based user model and query control, reference point-based visualization that incorporates the user model and the query spaces, and named-entity based searching/browsing user interface.  The results and the lessons of the user studies are discussed.
 
Bio: Jae-wook Ahn has recently defended his Ph.D. dissertation at the School of Information Sciences, University of Pittsburgh in September 2010.  He has worked with his Ph.D. mentor Dr. Peter Brusilovsky and Dr. Daqing He.  He is currently a research associate of the Department of Computer Science and the Human Computer Interaction Lab, working with Dr. Ben Shneiderman.
 
== Fall 2010 Speakers ==
 
* Roger Levy
* Earl Wagner
* Eugene Charniak
* Dave Newman
* Ray Mooney
 
=== October 20, Kristy Hollingshead: Search Errors and Model Errors in Pipeline Systems ===
 
Pipeline systems, in which data is sequentially processed in stages with the output of one stage providing input to the next, are ubiquitous in the field of natural language processing (NLP) as well as many other research areas. The popularity of the pipeline system architecture may be attributed to the utility of pipelines in improving scalability by reducing search complexity and increasing efficiency of the system. However, pipelines can suffer from the well-known problem of "cascading errors," where errors earlier in the pipeline propagate to later stages in the pipeline. In this talk I will make a distinction between two different type of cascading errors in pipeline systems. The first I will term "search errors," where there exists a higher-scoring candidate (according to the model), but that candidate has been excluded from the search space. The second type of error that I will address might be termed "model errors," where the highest-scoring candidate (according to the model) is not the best candidate (according to some gold standard). Statistical NLP models are imperfect by nature, resulting in model errors. Interestingly, the same pipeline framework that causes search errors can also resolve (or work around) model errors; in this talk I will demonstrate several techniques for detecting and resolving search and model errors, which can result in improved efficiency with no loss in accuracy. I will briefly mention the technique of pipeline iteration, introduced in my ACL'07 paper, and introduce some related results from my dissertation. I will then focus on work done with my PhD advisor Brian Roark on chart cell constraints, as published in our COLING'08 and NAACL'09 papers; this work provably reduces the complexity of a context-free parser to quadratic performance in the worst case (observably linear) with a slight gain in accuracy using the Charniak parser. While much of this talk will be on parsing pipelines, I am currently extending some of this work to MT pipelines and would welcome discussion along those lines.
 
Kristy Hollingshead earned her PhD in Computer Science and Engineering this year, from the Center for Spoken Language Understanding (CSLU) at the Oregon Health & Science University (OHSU). She received her B.A. in English-Creative Writing from the University of Colorado in 2000 and her M.S. in Computer Science from OHSU in 2004. Her research interests in natural language processing include parsing, machine translation, evaluation metrics, and assistive technologies. She is also interested in general techniques on improving system efficiency, to allow for richer contextual information to be extracted for use in downstream stages of a pipeline system. Kristy was a National Science Foundation Graduate Research Fellow from 2004-2007.
 
=== October 27, Stanley Kok: Structure Learning in Markov Logic Networks ===
 
Statistical learning handles uncertainty in a robust and principled way.
Relational learning (also known as inductive logic programming)
models domains involving multiple relations. Recent years have seen a
surge of interest in the statistical relational learning (SRL) community
in combining the two, driven by the realization that many (if not most)
applications require both and by the growing maturity of the two fields.
 
Markov logic networks (MLNs) is a statistical relational model that has
gained traction within the AI community in recent years because of its
robustness to noise and its ability to compactly model complex domains.
MLNs combine probability and logic by attaching weights to first-order
formulas, and viewing these as templates for features of Markov networks.
Learning the structure of an MLN consists of learning both formulas and
their weights.
 
To obtain weighted MLN formulas, we could rely on human experts
to specify them. However, this approach is error-prone and requires
painstaking knowledge engineering. Further, it will not work on domains
where there is no human expert. The ideal solution is to automatically
learn MLN structure from data. However, this is a challenging task because
of its super-exponential search space. In this talk, we present a series of
algorithms that efficiently and accurately learn MLN structure.
 
=== November 1, Owen Rambow: Relating Language to Cognitive State ===
 
In the 80s and 90s of the last century, in subdisciplines such as planning,
text generation, and dialog systems, there was considerable interest in
modeling the cognitive states of interacting autonomous agents.  Theories
such as Speech Act Theory (Austin 1962), the belief-desire-intentions model
of Bratman (1987), and Rhetorical Structure Theory (Mann and Thompson 1988)
together provide a framework in which to link cognitive state with language
use.  However, in general natural language processing (NLP), little use was
made of such theories, presumably because of the difficulty at the time of
some underlying tasks (such as syntactic parsing).  In this talk, I propose
that it is time to again think about the explicit modeling of cognitive
state for participants in discourse.  In fact, that is the natural way to
formulate what NLP is all about.  The perspective of cognitive state can
provide a context in which many disparate NLP tasks can be classified and
related.  I will present two NLP projects at Columbia which relate to the
modeling of cognitive state:
 
Discourse participants need to model each other's cognitive states, and
language makes this possible by providing special morphological, syntactic,
and lexical markers.  I present results in automatically determining the
degree of belief of a speaker in the propositions in his or her utterance.
 
Bio: PhD from University of Pennsylvania, 1994, working on German syntax.
My office mate was Philip Resnik.  I worked at CoGentex, Inc (a small
company) and AT&T Labs -- Research until 2002, and since then at Columbia as
a Research Scientist.  My research interests cover both the nuts-and-bolts
of languages, specifically syntax, and how language is used in context.
 
=== November 10, Bob Carpenter: Whence Linguistic Data? ===
 
The empirical approach to linguistic theory involves collecting
data and annotating it according to a coding standard.  The
ability of multiple annotators to consistently annotate new
data reflects the applicability of the theory.    In this
talk, I'll introduce a generative probabilistic model of the
annotation process for categorical data.  Given a collection of
annotated data, we can infer the true labels of items, the prevalence
of some phenomenon (e.g. a given intonation or syntactic alternation),
the accuracy and category bias of each annotator, and the codability
of the theory as measured by the mean accuracy and bias of annotators
and their variability.  Hierarchical model extensions allow us to
model item labeling difficulty and take into account annotator
background and experience.  I'll demonstrate the efficacy of the
approach using expert and non-expert pools of annotators for simple
linguistic labeling tasks such as textual inference, morphological
tagging, and named-entity extraction.  I'll discuss applications
such as monitoring an annotation effort, selecting items with active
learning, and generating a probabilistic gold standard for machine
learning training and evaluation.
 
=== November 15, William Webber: Information retrieval effectiveness: measurably going nowhere? ===
 
Information retrieval works by heuristics; correctness cannot be
formally proved, but must be empirically assessed.  Test
collections make this evaluation automated and repeatable.
Collection-based evaluation has been standard for half a century.
The IR community prides itself on the rigour of the
experimental tradition that has been built upon this
foundation;  it is notoriously difficult to publish in the
field without a thorough experimental validation.  No
attention, however, has been paid to the question of whether
methodological rigour in evaluation has to verifiable.  In
this talk, we present a survey of retrieval results published
over the past decade, which fails to find evidence that
retrieval effectiveness is in fact improving.  Rather, each
experiment's impressive leap forward is preceded by a few
careful steps back.
 
Bio:
 
William Webber is a Research Associate in the Department of Computer
Science and Software Engineering at the University of Melbourne,
Australia.  He has recently completed his PhD thesis, "Measurement in
Information Retrieval Evaluation", under the supervision of Professors
Alistair Moffat and Justin Zobel.
 
=== December 8: Michael Paul: Summarizing Contrastive Viewpoints in Opinionated Text ===
 
Performing multi-document summarization of opinionated text has unique
challenges because it is important to recognize that the same information
may be presented in different ways from different viewpoints. In this talk,
we will present a special kind of contrastive summarization approach
intended to highlight this phenomenon and to help users digest conflicting
opinions. To do this, we introduce a new graph-based algorithm, Comparative
LexRank, to score sentences in a summary based on a combination of both
representativeness of the collection and comparability between opposing
viewpoints. We then address the issue of how to automatically discover and
extract viewpoints from unlabeled text, and we experiment with a novel
two-dimensional topic model for the task of unsupervised clustering of
documents by viewpoint. Finally, we discuss how these two stages can be
combined to both automatically extract and summarize viewpoints in an
interesting way. Results are presented on two political opinion data sets.
 
This project was joint work with ChengXiang Zhai and Roxana Girju.
 
Bio:
Michael Paul is a first-year Ph.D. student of Computer Science at the Johns
Hopkins University and a member of the Center for Language and Speech
Processing. He earned a B.S. from the University of Illinois at
Urbana-Champaign in 2009. He is currently a Graduate Research Fellow of the
National Science Foundation and a Dean's Fellow of the Whiting School of
Engineering.

Latest revision as of 18:22, 3 November 2023

x

CLIP Colloquium

The CLIP Colloquium is a weekly speaker series organized and hosted by CLIP Lab. The talks are open to everyone. Most talks are held on Wednesday at 11AM online unless otherwise noted. Typically, external speakers have slots for one-on-one meetings with Maryland researchers.

If you would like to get on the clip-talks@umiacs.umd.edu list or for other questions about the colloquium series, e-mail Rachel Rudinger, the current organizer.

For up-to-date information, see the UMD CS Talks page. (You can also subscribe to the calendar there.)

Colloquium Recordings

Previous Talks

CLIP NEWS

  • News about CLIP researchers on the UMIACS website [1]
  • Please follow us on Twitter @ClipUmd[2]