Cbcb:Pop-Lab:How do I run the new Bambus

From Cbcb
Revision as of 21:24, 20 January 2010 by Tgibbons (talk | contribs)
Jump to navigation Jump to search

The new Bambus (aka Bambus 2) actually consists of fourexecutables that run in order using a supplied AMOS bank. An important note, Bambus 2 is still an early beta. As such, it is advisable to back up the bnk directory before using. Program documentation is also available on the command line by typing <command> -h.

The first program is clk. This program finds all mated reads within contigs and converts the mate distances to be relative to contigs rather than reads. The second program is Bundler. Bundler joins together the contig link messages generated by clk together when they can be to create consensus links between contigs. It will output multiple contig links for a pair of contigs. The third program is MarkRepeats. This program identifies repeats using two methods, the first is shortest paths and the second is by examining a-stat on a component by component basis. The final program is OrientContigs. OrientContigs uses the contig links to orient and order the contigs into scaffolds, as well as performing some simplification by joining contigs. Each of the programs is covered in more detail below. To get more help on running any program use -h.

1. clk

 - Modifies the bank to create contig edges.
- Example: clk -b[ank] data.bnk

2. Bundler

 - Bundle together contig edges to create contig links.
- Example: Bundler -b[ank] data.bnk [-t[ype] comma separated list of edge types] - The -t[ype] option allows only certain contig edges to be processed. ALL means use any type. The types are defined in src/AMOS/Link_AMOS.hh

3. MarkRepeats

- Run shortest paths and connected component repeat detection algorithms. This requires AMOS to be built with the Boost graph library available to it.
- Example: MarkRepeats -b[ank] data.bnk [-redundancy X -aggressive] - The -redundancy ignores linka containing fewer than X edges. - The -aggressive option marks contigs as repetitive based on global astat calculation rather than a connected component one.

4. OrientContigs

- Orient and order the contigs based on the links. This program uses a greedy algorithm to orient and order contigs relative to an arbitrary start contig. Edges that contradict the current scaffold are marked bad
  and ignored for the rest of the analysis. They are still output but don't affect any subsequent calculations.
  
  The output includes a a dot-formatted file, NCBI AGP scaffold formatted file, and xml files formatted to be compatible with Bambus 1 tools.
  
  Note that this program does not currently linearize the scaffold but maintains them as a graph. This program also recursively simplifies common patters in the graph. Currently the patterns are
  bubbles or straight lines. For example, contigs A->B->C will be simplified to just A. Also A->B->D will become A as well. This simplification is performed recursively until the graph is stable. Note that the   
  												\>C/> 
  simplification updates the bank in a destructive way by removing contigs and replacing them (as well as their edges) with updated contigs. 
  The marking of the edges as BAD or GOOD also destructively updates the bank. Therefore it is necessary to make a backup of the bank before running this program.
- Example: OrientContigs -b[ank] <bank_name> -prefix asm [-all -noreduce -redundancy X -repeats Y -aggressive]. - The - prefix option specifies the prefix to use for all output files. - The -all option specifies whether disconnected contigs should be output as their own scaffolds or if they should be skipped. - The -noreduce option turns off the graph simplification described above. - The -redundancy option ignores links containing fewer than X edges. - The -repeats option reads a file of repeats (Y) which specify one contig ID per line. Repeat contigs and their links are not used for odering/orienting any other data in the graph. Repeats are currently not resolved and are output as single-contig scaffolds. If known, these may be specified or the repeats identified by MarkRepeats (above) may be used. - The -aggressive option will not mark edges that move a contig more than 3 STDEVS away as bad and will try to reconcile the positions.