- 1 ADAPT (An Approach to Digital Archiving and Preservation Technology)
- 2 CloudBurst and Crossbow
- 3 Google/IBM Academic Cloud Computing Initiative (ACCI)
- 4 Ivory: A Hadoop toolkit for Web-scale information retrieval
- 5 Large-Data Statistical Machine Translation
- 6 NodeXL: Scalable Network Analysis and Visualization
ADAPT (An Approach to Digital Archiving and Preservation Technology)
The ADAPT project (An Approach to Digital Archiving and Preservation Technology) is developing technologies for building a scalable and reliable infrastructure for the long-term access and preservation of digital assets. Our approach uses a distributed object architecture that operates on different levels of abstractions built around cloud technologies and web services. Long term preservation of digital objects requires systematic methodologies to address the following requirements.
- Each preserved digital object should encapsulate information regarding content, structure, context, provenance, and access to enable the long term maintenance and lifecycle management of the digital object.
- Efficient management of technology evolution, both hardware and software, and the appropriate handling of technology obsolescence (for example, format obsolescence).
- Efficient risk management and disaster recovery mechanisms either from technology degradation and failure, or natural disasters such as fires, floods, and hurricanes, or human-induced operational errors, or security failures and breaches.
- Efficient mechanisms to ensure the authenticity and integrity of content, context, and structure of archived information throughout the preservation period.
- Ability for information discovery and content access and presentation, with an automatic enforcement of authorization and IP rights, throughout the lifecycle of each object.
- Scalability in terms of ingestion rate, capacity and processing power to manage and preserve large scale heterogeneous collections of complex objects, and the speed at which users can discover and retrieve information.
CloudBurst and Crossbow
Next-generation DNA sequencing machines are generating an enormous amount of sequence data, placing unprecedented demands on traditional single-processor read mapping algorithms. CloudBurst is a new parallel read-mapping algorithm optimized for mapping next-generation sequence data to the human genome and other reference genomes, for use in a variety of biological analyses including SNP discovery, genotyping, and personal genomics. It is modeled after the short read mapping program RMAP, and reports either all alignments or the unambiguous best alignment for each read with any number of mismatches or differences. This level of sensitivity could be prohibitively time consuming, but CloudBurst uses the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes.
Crossbow is our new high-throughout pipeline for whole genome resequencing analysis. It combines Bowtie, an ultrafast and memory efficient short read aligner, and SoapSNP, a highly accurate Bayesian genotyping algorithm, within Hadoop to distribute and accelerate the computation with many nodes. The pipeline is extremely efficient, and can accurately analyze an entire genome in one day on a small 10-node local cluster, or in one afternoon and for less than $250 in the Amazon EC2 cloud.
Michael Schatz. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 25(11):1363-1369, 2009.
Google/IBM Academic Cloud Computing Initiative (ACCI)
In October 2007, Google and IBM announced the Academic Cloud Computing Initiative (ACCI), which granted several U.S. universities access to a large computer cluster running Hadoop, an open source implementation of the MapReduce programming model. One goal of the project was to bring large-scale distributed processing into the classroom and to teach students how to think at "web scale". To this end, the University of Maryland has offered "cloud computing" courses at the advanced undergraduate/introductory graduate level, in Spring 2008 and Fall 2008. Cloud9 is a MapReduce Library for Hadoop, originally developed as a teaching tool for these courses; it has since evolved to provide shared APIs for research in text processing.
Ivory: A Hadoop toolkit for Web-scale information retrieval
Ivory is a Hadoop toolkit for Web-scale information retrieval research that features a retrieval engine based on Markov Random Fields, appropriately named SMRF (Searching with Markov Random Fields). This open-source project began in Spring 2009 and represents a collaboration between the University of Maryland and Yahoo! Research. Ivory takes full advantage of the Hadoop distributed environment (the MapReduce programming model and the underlying distributed file system) for both indexing and retrieval. Ivory was specifically designed to work with Hadoop "out of the box" on the ClueWeb09 collection, a 1 billion page (25 TB) Web crawl distributed by Carnegie Mellon University. Ivory is meant to serve as a reference implementation of indexing and retrieval algorithms that can operate at the multi-terabyte scale—some of its algorithms are described in a SIGIR 2009 tutorial on Data-Intensive Text Processing.
Large-Data Statistical Machine Translation
In recent years, the quantity of training data available for statistical machine translation has increased far more rapidly than the performance of individual computers, thus presenting an impediment to progress. Parallelization of the model building algorithms that process this data on clusters is fraught with challenges such as synchronization, data exchange, and fault tolerance. However, the MapReduce programming paradigm has recently emerged as one solution to these issues: a powerful functional abstraction hides system-level details from the researcher, allowing programs to be transparently distributed across potentially very large clusters of commodity hardware. We've been working on MapReduce implementations of parameter estimation algorithms for large-data statistical machine translation.
Chris Dyer, Aaron Cordova, Alex Mont, and Jimmy Lin. Fast, Easy, and Cheap: Construction of Statistical Machine Translation Models with MapReduce. Proceedings of the Third Workshop on Statistical Machine Translation at ACL 2008, pages 199-207, June 2008, Columbus, Ohio.
NodeXL: Scalable Network Analysis and Visualization
The NodeXL Template for Microsoft Excel 2007 is a free and open source extension to the widely used spreadsheet application that provides a range of basic network analysis and visualization features. NodeXL uses a highly structured workbook template that includes multiple worksheets to store all the information needed to represent a network graph. Visualization features allow users to display a range of network graph representations and map data attributes to visual properties including shape, color, size, transparency, and location. Initial goals for NodeXL were to simplify usage and support educational applications, but now the development effort will address the issue of scalability, for which cloud computing solutions will play an important role.