User:Dsommer

From Cbcb
Jump to navigation Jump to search

 

[#top File:Amostop.gif]
File:Umd-logo.gif
File:Middle1.gif
File:Tigr-logo.gif
File:Middle2.gif
File:Ki-logo.gif
File:Middle1.gif
File:Mbl-logo.gif
File:Middle2.gif
SourceForge.net Logo


AMOS: A Modular Open-Source Assembler


1. AMOS overview

The AMOS consortium is committed to the development of open-source whole genome assembly software. The project acronym (AMOS) represents our primary goal -- to produce A Modular, Open-Source whole genome assembler. Open-source so that everyone is welcome to contribute and help build outstanding assembly tools, and modular in nature so that new contributions can be easily inserted into an existing assembly pipeline. This modular design will foster the development of new assembly algorithms and allow the AMOS project to continually grow and improve in hopes of eventually becoming a widely accepted and deployed assembly infrastructure. In this sense, AMOS is both a design philosophy and a software system.

[docs/tutorials/readme-first.html AMOS Getting Started]

[docs/tutorials/programmers-guide.html Programmer's guide]

AMOS Documentation Project

Quick links:

  • [#download Download]
  • SourceForge project page
  • [#libamos API documentation]
  • [#specifications File format specs]
  • [recentcvs.html Recent CVS Activity]

Module documentation:

  • [docs/converters File conversion utilities]
  • [hawkeye Hawkeye] - assembly viewer
  • [forensics amosvalidate] - assembly validation
  • [docs/pipeline/minimus.html minimus] - basic genome assembler for small datasets
  • [docs/pipeline/minimus2.html minimus2] - basic genome assembler for two datasets; can also be used as an assembly merge pipeline
  • [docs/pipeline/AMOScmp.html AMOScmp] - comparative assembler
  • [docs/pipeline/AMOScmp-shortReads.html AMOScmp-shortReads] - comparative assembler for short reads (Solexa,454)
  • [docs/pipeline/AMOScmp-shortReads-alignmentTrimmed.html AMOScmp-shortReads-alignmentTrimmed] - comparative assembler for short reads that uses alignment based trimming
  • Figaro - statistical vector trimmer
  • [#libamos libAMOS ]- the API
  • [#libalign libAlign]
  • [#libslice libSlice]
  • [#bambus Bambus]
  • [#autoeditor AutoEditor]

2. Table of contents

  1. [#overview AMOS overview]
  2. [#toc Table of contents]
  3. [#collaborators Collaborators]
  4. [#acknowledgements Acknowledgements]
  5. [#infrastructure Infrastructure]
    1. [#libamos libAMOS API]
    2. [#specifications Specifications]
  6. [#modules Modules and projects]
    1. [#pipeline Assembly pipeline]
    2. [#overlap Overlap detection]
    3. [#contig Contig construction]
    4. [#consensus Consensus]
    5. [#scaffold Scaffolding]
    6. [#error Error correction]
    7. [#validation Validation]
    8. [#utilities Utilities]
  7. [#download Download]
  8. [#join Join the consortium]
  9. [#bugs Bug reports]


3. Consortium members

There have been numerous positive responses regarding the AMOS initiative, and we expect the list of involved organizations to grow significantly as the project matures. [#join Please contact us] if you want to join. The groups currently involved with the development of AMOS are listed below, along with their responsibilities and areas of expertise.


4. Acknowledgements

The AMOS consortium would like to thank the following organizations for their funding and/or support:

The National Institutes of Health - grants R01-LM06845, N01-AI-15447

The National Science Foundation - grants IIS-9902923, IIS-9820497

Department of Homeland Security - cooperative agreement W81XWH-05-2-0051

SourceForge.net


5. Infrastructure

The principal benefit of the AMOS project is its modular design, but in order to facilitate many, isolated components, a robust infrastructure is desirable. In response to this need, TIGR has developed numerous C++ classes for the efficient storage of assembly data types. These assembly objects can be written to and read from a central data repository, allowing for separate modules to build on and improve existing assemblies in discrete steps. This allows an assembly pipeline to run its steps in any order, and for data snapshots to be preserved at any time. In order to convey the assembly data outside of the C++ classes, we have implemented an ASCII message format modeled on that used by Celera Assembler*. This message format will be the unifying standard for all external module communication, and allow for the data snapshots to be output in a concise, text format. The API (application programming interface) for the AMOS foundation classes and the specification for the AMOS message format can be found in the sections below.

  • "A Whole-Genome Assembly of Drosophila." Myers E, Sutton G, et. al., Science, 2000. 287(5461):2196-204.


5.1. Application programming interface

The AMOS API describes the programming interface for all of the AMOS foundation classes. Currently these classes are implemented in C++, but could ported to other languages as long as the API was preserved. The implementation can be found in the latest distribution under the src/AMOS project directory. These classes comprise the libAMOS.a library. This library contains the tools necessary to handle and manipulate AMOS messages, data-banks and internal assembly data structures such as sequencing reads, contigs, scaffolds, etc. The C++ source code for libAMOS is freely available for download [#download here].

[docs/api AMOS infrastructure API]


5.2. Specifications

The AMOS file types and message formats are defined in various specification documents, which can be found by following the below link. These documents also provide information on how to use messages for module communication and general development procedure recommendations.

[docs/specs AMOS specification documents]


6. Modules and projects

The following sections list all modules currently in development and those modules that are already in production. Because AMOS is in a constant state of development, there is an ever expanding list of ongoing projects, and this section attempts to outline the basic function of each project along with its status and parent organization. Status descriptions are (in order of occurrence): planning, development, testing, production and antiquated. These status descriptions appear to the right of the module name. Clicking on a module name will redirect you to the project homepage (if applicable).

6.1. Assembly pipeline

[docs/pipeline/runAmos.html runAmos]

STATUS: production

runAmos is the command executor for all of the AMOS pipelines.


[docs/pipeline/minimus.html minimus]

STATUS: production

minimus is a lightweight assembly tool for performing small assembly tasks for which a the complexity of a full assembler is unnecessary. Some such tasks, commonly needed during genome finishing, include joining together two overlapping contigs, adding reads to an existing contig, and refining the multiple-alignment of the reads within a contig. We use a standard 3-step assembly process known as overlap-layout-consensus which is explained further on the minimus website. minimus is freely available as part of the AMOS distribution which can be downloaded [#download here].

Examples of a flu assembly and a Zebrafish gene can be found in the test/minimus directory created when the AMOS distribution is untarred. Documentation on the examples is included with the distribution in /docs subdirectory

Supported in part by DHS cooperative agreement W81XWH-05-2-0051.


[docs/pipeline/minimus2.html minimus2]

STATUS: production

minimus2 is modified version of the minimus pipeline designed for merging one or two sequence sets.

[docs/pipeline/AMOScmp.html AMOScmp]

STATUS: production

AMOScmp is a comparative assembly pipeline. With the rapid growth in the number of sequenced genomes has come an increase in the number of organisms for which two or more closely-related species have been sequenced. This has created the possibility of building a comparative genome assembly algorithm, which can assemble a newly sequenced genome by mapping it onto a reference genome. Methods are described in our paper (below) and on the AMOScmp website. The MUMmer whole genome alignment package is required for the mapping step of this pipeline, and is freely available from the [#mummer MUMmer homepage]. AMOScmp is freely available as part of the AMOS distribution, which can be downloaded [#download here].

Related publications
  • "Comparative Genome Assembly." Pop M, Phillippy A, Delcher AL, Salzberg SL, Briefings in Bioinformatics, 2004. 5(3):237-48.


[docs/pipeline/AMOScmp-shortReads.html AMOScmp-shortReads]

STATUS: production

AMOScmp-shortReads is a modified version of AMOScmp designed for assembling short reads.

[docs/pipeline/AMOScmp-shortReads-alignmentTrimmed.html AMOScmp-shortReads-alignmentTrimmed]

STATUS: production

AMOScmp-shortReads-alignmentTrimmed is a modified version of AMOScmp designed for alignment based trimming and assembling of short reads.

6.2. Overlap detection

UMD overlapper

STATUS: testing

The UMD overlapper is designed to reduce the number of overlaps produced by the assembler by reducing the number of repeat-induced overlaps. Furthermore the algorithm is greatly enhanced through the use of minimizers - a technique for reducing the number of k-mers considered in the initial phase of overlapping by an order of magnitude. Most assemblers use exact k-mer matches in order to identify reads that potentially overlap.

Related publications
  • "A preprocessor for shotgun assembly of large genomes." Roberts M, Hunt BR, Yorke JA, Bolanos R, Delcher A, Journal of Computational Biology, 2004. 11(4):734-752
  • "Reducing storage requirements for biological sequence comparison." Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA. Bioinformatics, 2004, 20(18):3363-3369.


KI overlapper

STATUS: testing

The Karolinska Institutet overlapper is designed to handle the problems created by sequencing errors. Instead of exact k-mer matches - the approach used by most existing assemblers - the KI overlapper uses a q-gram based method to identify "near hits" - k-mers that differ at a small number of positions. This approach allows this overlapper to identify overlaps otherwise missed by other overlappers.

Related publications
  • "Correcting errors in shotgun sequences." Tammi MT, Arner E, Kindlund E, Andersson B, Nucleic Acids Research, 2003. 31(15):4663-72.
  • "TRAP: Tandem Repeat Assembly Program produces improved shotgun assemblies of repetitive sequences." Tammi MT, Arner E, Andersson B, Computational Methods Programs Biomed, 2003. 70(1):47-59.
  • "Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions, DNPs." Tammi MT, Arner E, Britton T, Andersson B, Bioinformatics, 2002. 18(3):379-88.


6.3. Contig construction

UMD Contigger

STATUS: development

The UMD contigger uses the set of read overlaps generated during the overlap stage in order to identify unambiguous contigs - maximally consistent tilings of reads. These contigs represent stretches of the genome that can be unambiguously assembled and form a convenient backbone for further processing.


6.4. Consensus

libSlice

STATUS: production

libSlice is a C++ library that provides the user with a parametric implementation of the Churchill-Waterman algorithm for computing the consensus base from a column in a multiple alignment of reads. This task is an essential part of any consensus module. The implementation can be found in the latest distribution under the src/Slice project directory. These C structs comprise the libSlice.a library.


libAlign

STATUS: production

libAlign is a robust multi-alignment library for consensus generation. It can efficiently handle large inputs and is able to identify and correctly align slightly misplaced and/or low-similarity reads in the input. The implementation can be found in the latest distribution under the src/Align project directory. These classes comprise the libAlign.a library and depend on the [#libslice libSlice] library.


6.5. Scaffolding

[docs/bambus/ Bambus]

STATUS: production

Bambus is the first general purpose scaffolder that is publicly available as an open source package. While most other scaffolders are closely tied to a specific assembly program, Bambus accepts the output from most current assemblers and provides the user with great flexibility in choosing the scaffolding parameters. In particular, Bambus is able to accept contig linking data other than specified by mate-pairs. Such sources of information include alignment to a reference genome (Bambus can directly use the output of MUMmer), physical mapping data, or information about gene synteny.

Related publications
  • "Hierarchical scaffolding with Bambus." Pop M, Kosack DS, Salzberg SL, Genome Research, 2004. 14(1):149-59.


6.6. Error correction

AutoEditor

STATUS: production

The AutoEditor is a tool developed at TIGR that combines the trace information with the tiling of reads within a contig in order to identify and correct sequencing errors. Note that unlike other methods for error correction, the Auto Editor will only modify a base if supporting evidence is found in the traces, thus greatly reducing the possibility of errors. In our tests the Auto Editor corrected up to 90% of the sequencing errors present in the data, leading to a corresponding reduction in the manual labor required during the finishing stages.

Related publications
  • "Automated correction of genome sequence errors." Gajer P, Schatz M, Salzberg SL, Nucleic Acids Research, 2004. 32(2):562-9.


UMD error corrector

STATUS: testing

In conjunction with the UMD overlapper, the UMD error corrector identifies and corrects potential sequencing errors by detecting bases in a multiple alignment of reads that are supported by only one of the reads. The algorithm uses a heuristic rule called the 4-3 rule that examines overlapping sets of 4 reads at 3 positions in order to identify differences corresponding to distinct copies of a repeat.

Related publications
  • "A preprocessor for shotgun assembly of large genomes." Roberts M, Hunt BR, Yorke JA, Bolanos R, Delcher A, Journal of Computational Biology (to appear)


KI error corrector

STATUS: testing

Sequencing errors in combination with repeated regions cause major problems in shotgun sequencing, mainly due to the failure of assembly programs to distinguish single base differences between repeat copies from erroneous base calls. The Karolinska Institutet error corrector implements a new strategy to correct errors in shotgun sequence data using defined nucleotide positions, DNPs. The method distinguishes single base differences from sequencing errors by analyzing multiple alignments consisting of a read and all its overlaps with other reads. The construction of multiple alignments is performed using a novel pattern matching algorithm.

Related publications
  • "Correcting errors in shotgun sequences." Tammi MT, Arner E, Kindlund E, Andersson B, Nucleic Acids Research, 2003. 31(15):4663-72.
  • "TRAP: Tandem Repeat Assembly Program produces improved shotgun assemblies of repetitive sequences." Tammi MT, Arner E, Andersson B, Computational Methods Programs Biomed, 2003. 70(1):47-59.
  • "Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions, DNPs." Tammi MT, Arner E, Britton T, Andersson B, Bioinformatics, 2002. 18(3):379-88.


6.7. Validation

[/forensics/ amosvalidate]

STATUS: production

amosvalidate is a validation pipeline for genome assemblies. This pipeline includes a collection of methods for ascertaining the quality of an assembly, and examines multiple measures of assembly quality to pinpoint potential mis-assemblies. Validation techniques include mate-pair validation, repeat analysis, coverage analysis, identification of correlated read polymorphisms, and read alignment breakpoint analysis. Regions of the assembly exhibiting multiple signatures of mis-assembly are flagged as suspicious and output by amosvalidate for further examination.


Benchmark

STATUS: production

TIGR has assembled a set of benchmark assembly genomes. Each benchmark set comes with the sequence of the finished genome, random shotgun reads, closure reads, and ancillary library and insert information. Each sequence is categorized as matching or non-matching, based on its mapping to the finished genome. Sequences that match the finished genome at 90% identity for over 80% of their trimmed length (as aligned by [#mummer MUMmer]) are included in the matching set, while all other reads are grouped into the non-matching set. Ancillary information is presented in Trace Archive XML format. Please refer the the benchmark website for a more lengthy description and the actual data.


6.8. [docs/utilities/ Utilities]

[docs/converters ASM File Converters]

STATUS: development - production

The ASM File converters are a collection of utilities for converting sequence and assembly data between the most widely used data formats as well as to and from the AMOS message format. Examples of the data handled by these utilities are: Trace Archive data and ancillary information, .ACE assembly format, TIGR Assembler input and output formats, Celera Assembler message format, and Arachne input and output formats.


[docs/utilities/amoslib.html amoslib]

STATUS: production

amoslib is a PERL module for the handling of AMOS message files.


MUMmer

STATUS: production

MUMmer is a whole-genome alignment software suite developed and maintained by TIGR. It is extremely useful to mapping sequencing reads and assembly contigs to finished sequence. This can provide valuable information for assembly quality assessment and for comparative genomics.


[hawkeye Hawkeye]

STATUS: production

Hawkeye serves as a visualization tool for AMOS assembly data. It works off of an AMOS data bank and produces an interactive display of the assembly data for quality assessment and assembly investigation. It was developed with the Qt GUI toolkit.

 

8. Download

The AMOS source if freely available for download from the File Release Section of our SourceForge project page. Please refer to the COPYING license included in the package for a description of the Artistic License, the same OSI certified open source license used by Perl and countless other packages. Not all of the above packages are included with the standard AMOS distribution, please see the homepage for the software you wish to download to verify that it is included with the AMOS source distribution.

9. Join the consortium

All interested parties are welcome to join or aid the AMOS consortium. Please address all correspondence via Email to:

amos-help(at)lists(dot)sourceforge(dot)net

To receive information regarding new releases and developments, please subscribe to our moderated, low-traffic users' mailing list:

amos-users(at)lists(dot)sourceforge(dot)net

10. Bug reports and support

For AMOS bug reports or support requests, please browse our SourceForge project page or Email us at:

amos-help(at)lists(dot)sourceforge(dot)net