CLuE PI Meeting 2009: Program
Registration and Breakfast (7:30am - 8:00am)
Morning Keynote Session (8:00am - 10:05am)
Welcome and Kickoff (8:00 - 8:10)
- Jimmy Lin is an Associate Professor in the College of Information Studies (the iSchool), at the University of Maryland, College Park. He directs the recently-formed Cloud Computing Center, an interdisciplinary group which explores the many aspects of cloud computing as it impacts technology, people, and society. Jimmy's research lies at the intersection of natural language processing and information retrieval, with a recent emphasis on large-data issues. He received his Ph.D. from MIT in electrical engineering and computer science.
- Jim French is a Program Director in the Division of Information and Intelligent Systems, Directorate for Computer and Information Sciences and Engineering at the National Science Foundation (NSF). Jim worked with Google and IBM to start the Cluster Exploratory (CLuE) program and currently manages it. He is also part of the Information Integration and Informatics (III) and Data-intensive Computing (DC) programs. Jim is a Research Associate Professor of Computer Science at the University of Virginia presently on a leave of absence while working at the NSF.
Introductions by Google (8:10 - 8:20)
- Maggie Johnson is Director of Education and University Relations for Google. She manages all technical training and leadership development programs for Google engineers and operations staff, as well as Google’s educational outreach efforts. She also manages the university relations area, building strategic partnerships with faculty and labs globally. Prior to Google, Maggie was a faculty member and Director of Undergraduate Studies in the Department of Computer Science at Stanford University.
- Jerry Zhao is a staff software engineer at Google and the tech lead for project MapReduce. Before joining Google, he was a post-doc researcher at the International Computer Science Institute in Berkeley. He also taught at San Francisco State University. Jerry received his PhD in Computer Science from University of Southern California, and his BS from Fudan University. His research interests include wireless sensor networks, mobile networks, and distributed systems in general.
Introductions by IBM (8:20 - 8:30)
- Jim Spohrer has been director of IBM Global University Programs since 2009. Jim founded IBM's first Service Research group in 2003 at the Almaden Research Center with a focus on STEM (Science Technology Engineering and Math) for service sector innovations. In 2000, Jim became the founding CTO of IBM’s first Venture Capital Relations group in Silicon Valley. In the mid 1990’s, he lead Apple Computer’s Learning Technologies group, where he was awarded DEST (Distinguished Engineer Scientist and Technologist) status. Jim received a Ph.D. in Computer Science/Artificial Intelligence from Yale University and a B.S. in Physics from MIT.
- Jay Subrahmonia, Director, Advanced Customer Solutions, joined IBM's Software Group in September, 2003. She manages a team whose mission is to work with leading edge customers to develop first of a kind cloud solutions. Her team spread across a number of world wide Cloud Labs works with customers to pilot cloud solutions. Prior to joining Software Group, she managed a number of initiatives in IBM's Research Center—the Digital Rights Management (DRM) group at the Almaden Research Center and the Pen technologies group at the Watson Research Center in New York. She received a PhD in electrical engineering from Brown University and BTech in Electrical Engineering from IIT Mumbai.
IBM keynote: Impact of Cloud Computing on Research in Extreme Scale Analytics (8:30 - 9:15)
- Information technology is going through a fundamental change, influenced primarily by (1) Flexible provisioning and scalability of Cloud Computing, (2) Accelerated pace of analytics around semi-structured and unstructured data in the context of semantically rich data objects in the main stream data processing, (3) Much increased human interaction with the web due to the use of GPS enabled mobile devices, and its application in our daily lives, from social networking to conducting financial transactions, (4) Web Scale programming community with Web 2.0, search and open software, (5) Rise of SaaS (Software As A Service).
- Continuous arrival of huge amount of data (order of PB per day) from numerous sources requires continuous discovery of information. High Performance Computing (HPC) has been effective in creating a platform aimed at a set of target applications. Likewise, high scale cloud analytics will be primarily driven by a set of applications. Applications such as healthcare record management and analytics, pandemic, climate analysis and carbon trading are getting national level attention. There is a major shift in investing in smarter planet applications, such as those associated with green initiatives, instrumented cities, and smart power grids. Many of these applications play a significant role in major economies, and as such attract the attention of governments and commercial entities, and often the two are intertwined.
- High scale cloud analytics platforms have their roots in the commercial domain (e.g., Hadoop in internet companies). As a result the expected level of simplification, productivity, and widespread use is much higher in high scale cloud computing platforms. That includes more effective programming models (e.g., declarative query languages and much simpler exploitation of mining and machine learning). Even classic analytics applications such as bioinformatics and astronomy most likely will benefit from the emerging cloud platforms. Web scale solutions require new approaches to information integration and composition, such as Web 2.0 mash-ups. Variability of incoming information requires semi-structured repositories with flexible schema and the associated query languages.
- There is particular emphasis on breaking the complexity barrier of today's solutions through simplification. The lifetime cost of ownership of solutions is dominated by the human time spent in building, operating, and evolving these solutions. Much increased compute power in cloud computing enables us to reduce this complexity by reducing the use of fragile and complex machine optimized programs in favor of simpler and more stable and scalable ones. Flexibility and much quicker provisioning of cloud computing combined with much reduced cost per terra byte/flop are key factors in much faster deployment of solutions.
- IBM-Google cloud is an example of an effort that provides cloud services for universities. High scale cloud platforms play an increasingly bigger role in the strategy of NSF and other government funding agencies. Such high scale platforms inevitably should be shared across universities and research institutions, addressing the needs of the applications mentioned above. The level of investment in these large systems makes sharing a necessity. Further, many of the above-mentioned applications require collaboration across multiple research institutions. In addition to building solutions, university participation in shaping the research cloud platform is fundamental.
- Bio: Hamid Pirahesh, Ph.D., is an IBM fellow, ACM Fellow and is a senior manager responsible for the exploratory database research department at IBM Almaden Research Center. Pirahesh is an IBM master inventor, and is a member of IBM Academy. On the IBM product division side, he is a core member of IBM Information Management Architecture board, and has direct responsibilities in various aspects of IBM information management products, including DB2.
Topic-Partitioned Search Engine Indexes (9:15 - 9:40)
Jamie Callan, Jaime Arguello, Anagha Kulkarni
Carnegie Mellon University
- Web search engines typically divide their indexes into tiers, which are each partitioned into shards that are distributed across tens or hundreds of computers. When a tier must be searched, all of its shards are searched in parallel. This project adapts federated search techniques to create a more selective approach to searching large search engine indexes. Each shard of a topic-partitioned index covers specific content areas. When a new query arrives, only the most related shards are searched. Experimental results indicate that topic-partitioned indexes reduce search costs by an order of magnitude without reducing search accuracy.
Indexing Geospatial Data with MapReduce (9:40 - 10:05)
Naphtali Rishe, Vagelis Hristidis, Raju Rangaswami, Ouri Wolfson, Howard Ho, Ariel Cary, Zhengguo Sun, Lester Melendes
Florida International University
- The amount of information in spatial databases is growing as more data is made available. Spatial databases mainly store two types of data: raster data (satellite/aerial digital images), and vector data (points, lines, polygons). The complexity and nature of spatial databases makes them ideal for applying parallel processing. MapReduce is an emerging massively parallel computing model, proposed by Google. We have utilized the MapReduce model to solve spatial problems, including bulk-construction of R-Trees and aerial image quality computation, which involve vector and raster data, respectively. Our algorithms were executed on the Google/IBM cluster. The cluster supports the Hadoop framework – an open source implementation of MapReduce. The MapReduce index of GeoSpatial data has contributed to our TerraFly project. We have interconnected IBM's JAQL and systemT with a Hadoop implementation of MapReduce to extract the United States Governmental Organizational Hierarchy and inter-dependencies from federal publications for use in IBM's MIDAS Government domain applications. We are also working on MapReducing indexing of moving objects. Most work was performed at Florida International University under the NSF CLuE funding. Some of our work was performed on location at IBM Almaden Research Center where our student team member interned.
Morning Coffee Break (10:05am - 10:30am)
Morning Regular Session (10:30am - 12:10am)
Scalable Graph Processing in Data Center Environments (10:30 - 10:55)
Ben Zhao, Xifeng Yan, Divyakant Agrawal, Amr El Abbadi
University of California, Santa Barbara
- Searching and mining large graphs today is critical to a variety of application domains, ranging from community detection in social networks and blog spaces, to searches for functional modules in biological pathways. These tasks often involve complex queries that repeatedly access huge amounts of graph links, exposing issues unexplored in traditional keyword queries. The highly interconnected nature of graph data means their operations tend to "crawl" across many links, resulting in extremely large memory footprints that stretch the capabilities of today's commodity servers. This connectivity also generates extremely high network traffic when graphs are distributed across clusters.
- The MAGIC (Massive Graphs in Clusters) project at UCSB is studying several new challenges in processing massive graphs, including efficient graph processing primitives, distributed infrastructures for scalable graph processing, and techniques for privacy-preserving graph processing. Initial progress has been made in all three components of the project. In this talk, we will outline our key goals, report on initial progress, and highlight upcoming challenges.
Large-Scale Data Cleaning Using Hadoop (10:55 - 11:20)
Chen Li, Michael Carey, Alexander Behm, Shengyue Ji, Rares Vernica
University of California, Irvine
- We study research challenges related to data cleaning on large text repositories using Hadoop. This problem is becoming increasingly more important in applications such as Web search and record linkage, which need to deal with a variety of data inconsistencies in structures, representations, or semantics. Important technical topics include approximate string search in large text dictionaries and fuzzy joins between string collections. Existing algorithms developed on single-machine environments do not scale well on large amounts of data. We will present several interesting research directions of supporting data-cleaning operations using Hadoop, and report our initial research results.
Cluster Computing for Statistical Machine Translation (11:20 - 11:45)
Stephan Vogel, Qin Gao, Noah Smith, Kevin Gimpel, Alok Parlikar, Andreas Zollmann
Carnegie Mellon University
- Statistical Machine Translation (SMT) has become a large data endeavor. Translation models for large scale systems are trained on millions of parallel sentence pairs. Language models are trained on even larger corpora. Handling these amounts of data has pushed traditional SMT approaches to the limit, not only because of long training times, but also because of the size of the resulting models. In this talk we will present our work on reformulating the SMT processing pipeline under a MapReduce framework. We will describe the components needed for training and decoding, how they are currently realized, the benefits, but also the problems we still face.
Research and Education with MapReduce/Hadoop: Data-Intensive Text Processing and Beyond (11:45 - 12:10)
Jimmy Lin, Tamer Elsayed, Chris Dyer, Philip Resnik, Doug Oard
University of Maryland
- As one of the original pilot institutions in the Google/IBM Academic Cloud Computing Initiative, the University of Maryland has been experimenting with "cloud computing" courses that attempt to integrate research and education. We have grappled not only with the problem of scaling up our research to handle ever-growing datasets in a distributed environment, but also how to train our future generation of colleagues in being able to think at "web-scale". This talk overviews ongoing efforts in cloud computing, discussing success stories that might be of interest to both researchers and educators. Original research that has come out of these courses include work in statistical machine translation, genomic sequence alignment, and reference resolution in large email archives.
Lunch (12:10pm - 1:00pm)
Afternoon Keynote Session (1:00pm - 3:00pm)
Google keynote: Datacenter-Scale Computing (1:00 - 1:45)
Luiz André Barroso
- A model of computing that involves applications and services offered remotely by large-scale datacenters has been increasing in popularity, due in large part to the efficiencies achievable by co-locating vast computing and storage capabilities geographically and by amortizing them over many users and applications. Achieving such efficiencies however requires further understanding of this new computing platform; how to design it and how to best program it. In this talk I will provide an overview of this new class of Warehouse-scale computing systems and briefly describe some of the key challenges in realizing their efficiency potential.
- Bio: Luiz André Barroso is a Distinguished Engineer at Google, where he has worked across several engineering areas, ranging from applications and software infrastructure to the design of Google's computing platform. Prior to working at Google, he was a member of the research staff at Compaq and Digital Equipment Corporation, where his group did some of the pioneering work on computer architectures for commercial workloads. That work included the design of Piranha, a system based on an aggressive chip-multiprocessing, which helped inspire some of the multi-core CPUs that are now in the mainstream. Luiz has a Ph.D. degree in computer engineering from the University of Southern California and B.S. and M.S. degrees in electrical engineering from the Pontifícia Universidade Católica, Rio de Janeiro.
A Performance and Usability Comparison of Hadoop and Relational Database Systems (1:45 - 2:10)
Sam Madden, Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Michael Stonebraker
MIT, Brown, University of Wisconsin, Microsoft, Yale
- There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model. In this talk, I will describe and compare both paradigms and evaluate both kinds of systems in terms of performance and development complexity. I will also describe a benchmark we developed consisting of a collection of tasks that we have run on an open source version of MR (Hadoop) as well as on two parallel DBMSs. For each task, we measured each system’s performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of the DBMSs was strikingly better. I will speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.
HadoopDB An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads (2:10 - 2:35)
Daniel Abadi, Azza Abouzeid, Kamil Bajda-Pawlikowski
- HadoopDB is a hybrid of DBMS and MapReduce technologies that targets analytical workloads (large-scale data warehousing). Superior performance on structured data analysis is achieved by using database servers deployed across a shared-nothing cluster while scalability and fault-tolerance comes with Hadoop, which serves as a task coordination layer. A prototype of HadoopDB was build at Yale University out of open source components and attempts to fill the gap in the market for a free and open source parallel DBMS.
Towards Interactive Visualization in the Cloud (2:35 - 3:00)
Bill Howe, Huy Vo, Claudio Silva, Juliana Friere, YingYi Bu
University of Washington, University of Utah
- We are exploring Hadoop's role in support of large-scale query-driven visualization. The goal is to allow users to explore terabytes and petabytes of data with the same level of interactivity they expect when exploring gigabytes—but with significantly less code. Our method is to use Hadoop for "data preparation" and index-building, but to use aggressive caching, pre-fetching, and speculative processing to mask latency and provide a seamless experience for the user. Further, to reduce programming effort, we are designing declarative abstractions—think ruby-on-rails—for establishing the multi-tier pipeline of indexes and data structures needed to support efficient visualization applications. These pipelines span the client-cloud interface, interfacing with the local GPU on one end and massively parallel data analysis engines on the other.
- The domain we use as a test case is computational oceanography, where decade-scale simulations of ocean circulation must be processed and visualized ad hoc. These datasets occupy tens of terabytes and exhibit a "mesh" structure that is not always amenable to relational-style query processing, making it a natural fit for the flexibility of "raw" MapReduce. Using an algebra for mesh datasets called GridFields and a scientific workflow system called VisTrails, we are reducing the effort required to express and manage these exploration tasks as well as improve their performance.
Afternoon Coffee Break (3:00pm - 3:30pm)
Afternoon Regular Session (3:30pm - 5:10pm)
Scaling the Sky with MapReduce/Hadoop (3:30 - 3:55)
Andrew Connolly, Jeff Gardner, Simon Krughoff
University of Washington
- Astronomy is addressing many fundamental questions about nature of our universe through a series of ambitious wide-field imaging missions and surveys. These projects—funded by NSF, NASA and others—will investigate the properties of dark matter, the nature of dark energy, the evolution of large scale structure as well as searching for potentially hazardous asteroids. These surveys pose many fundamental challenges: how do we determine the interdependencies between the observable properties of stars and galaxies in order to better understand the physics of the formation of the universe when the data contain 1000s of dimensions and tens of billions of sources; how do we combine disjoint images, by collating data from several distinct surveys with different wavelengths, scales, resolutions and times in order to identify moving or variable sources. In this talk we explore how the emerging map-reduce paradigm for data-intensive computing might revolutionize the way that the sciences approach access to and analysis of a new generation of massive data streams.
Commodity Computing in Genomics Research (3:55 - 4:20)
Mihai Pop, Mike Schatz
University of Maryland
- In the next few years the data generated by DNA sequencing instruments around the world will exceed petabytes, surpassing the amounts of data generated by large-scale physics experiments such as the Hadron Collider. Even today, the terabytes of data generated every few days by each sequencing instrument test the limits of existing network and computational infrastructures. Our project is aimed at evaluating whether cloud computing technologies and the MapReduce/Hadoop infrastructure can enable the analysis of the large data-sets being generated. We will report on initial results in two specific applications: human genotyping and genome assembly using next generation sequencing data.
Relaxed Synchronization and Eager Scheduling in MapReduce (4:20 - 4:45)
Ananth Grama, Suresh Jagannathan
- MapReduce has rapidly emerged as a programming paradigm for large-scale distributed environments. While the underlying programming model based on maps and folds has been shown to be effective in specific domains, significant questions relating to scalability and application scope remain unresolved. This work targets questions of performance and application scope through relaxed semantics of underlying map and reduce constructs. Specifically, it investigates the notion of partial synchronizations combined with eager scheduling to overcome global synchronization overheads associated with reduce operations. It demonstrates the use of these partial synchronizations on complex problems in sparse network analysis. Significant performance improvements are reported for moderate sized distributed testbeds that lead us to believe the technique holds significant potential for applications with highly irregular data and computation profiles.
Dynamic Provisioning of Data Intensive Applications (4:45 - 5:10)
Chaitanya Baru, Sriram Krishnan
San Diego Supercomputer Center/University of California, San Diego
- In this study, we reassess how large scientific data archives are implemented by examining how data sets can be hosted and served to a broad community of users using on-demand and dynamic approaches to provisioning data sets. Such data archives typically witness "hot spots" in the data, where some parts of a data set are more frequently accessed than others. The hot spots can change over time, for example, a data set may have a hot spot immediately upon its release; after a particular natural event, upon publication of a new result, etc. There may also be different modes of access (workloads) experienced by different parts of the data set. Different data sets and different parts of a data set can be provisioned differently based on the nature of hot spots and workloads. Using our experience in hosting large airborne LiDAR (Light Detection And Ranging) data sets via the OpenTopography.org portal, we are conducting a number of experiments to compare the performance and price/performance of database systems versus Hadoop-based implementations for such large geospatial data sets. In this talk, we will describe the characteristics of the data sets, nature of the workloads, current implementation, and our on-going and planned experiments.
Break (5:10pm - 5:30pm)
Poster Reception (5:30pm-7:30pm)
Hors d'oeuvre and drinks hosted by Google and IBM.
- The ClueWeb09 Dataset: Jamie Callan, Mark Hoy, Changkuk Yoo, Le Zhao (Carnegie Mellon University)
- Scalable Graph Processing in Data Center Environments: Ben Zhao, Xifeng Yan, Divyakant Agrawal, Amr El Abbadi (University of California, Santa Barbara)
- Large-Scale Data Cleaning Using Hadoop: Chen Li, Michael Carey, Alexander Behm, Shengyue Ji, Rares Vernica (University of California, Irvine)
- Cluster Computing for Statistical Machine Translation: Stephan Vogel, Qin Gao, Noah Smith, Kevin Gimpel, Alok Parlikar, Andreas Zollmann (Carnegie Mellon University)
- Identity Resolution in Email Collections: Tamer Elsayed, Douglas W. Oard (University of Maryland)
- A Performance and Usability Comparison of Hadoop and Relational Database Systems: Sam Madden, Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Michael Stonebraker (MIT, Brown, University of Wisconsin, Microsoft, Yale)
- HadoopDB An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads: Daniel Abadi, Azza Abouzeid, Kamil Bajda-Pawlikowski (Yale University)
- Towards Interactive Visualization in the Cloud: Huy Vo, Claudio Silva, Juliana Freire, Bill Howe, YingYi Bu (University of Washington, University of Utah)
- Indexing Geospatial Data with MapReduce: Naphtali Rishe, Vagelis Hristidis, Raju Rangaswami, Ouri Wolfson, Howard Ho, Ariel Cary, Zhengguo Sun, Lester Melendes (Florida International University)
- Streaming Astronomy: Andrew Connolly, Jeff Gardner, Simon Krughoff (University of Washington)
- Single Image Super-Resolution Using Trillions of Examples: Sean Arietta, Jason Lawrence (University of Virginia)
- Commodity Computing in Genomics Research: Mihai Pop, Mike Schatz (University of Maryland)
- Relaxed Synchronization and Eager Scheduling in MapReduce: Ananth Grama, Suresh Jagannathan (Purdue University)
- Dynamic Provisioning of Data Intensive Applications: Chaitanya Baru, Sriram Krishnan (San Diego Supercomputer Center/University of California, San Diego)
- Finding Word Relationships for Information Retrieval: James Allan, W. Bruce Croft, David A. Smith (University of Massachusetts, Amherst)