Comparative assemblies

From Cbcb
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

AMOScmp pipeline

Short reads(Solexa)

Modified parameters

 Modified AMOScmp pipeline: ~dpuiu/bin/AMOScmp
 
 Alignment:
 * Lower nucmer alignement/cluster sizes : default are 20/65 ; drop to 16/16 (Solexa_read_len/2)
   Can go as low as 14/14; 12/12 gives too many spurious alignments:
   -D MINMATCH=20 -D MINCLUSTER=20 
 * Run nucmer multiple times:
    all reads:       given alignement/cluster size
    unaligned reads: smaller alignement/cluster size
    unaligned reads: smaller alignement/cluster size
    ...
 * Use promer instead of nucmer: alignement/cluster sizes of 6/11 (in AA)
 
 Layout:
 * Drop casm-layout min ovl from 10 to 5: 
   -D MINOVL=5 
 * Drop casm-layout majority from 70 to 50: 
   -D MAJORITY=50 
 
 Consensus:
 * Drop make-consensus alignment wiggle from 15 to 2
   -D ALIGNWIGGLE=2
 * Use make-consensus -x option ???

Read trimming

 * Quality trimming: too stringent
 * Align to reference using nucmer (small -c <n> -l <n>); trim reads to alignment coordinates
 * Identify 0 cvg regions; don't trim reads adjacent to these regions
 * Update read clr's; run AMOScmp
 
  Example:
     $ show-coords -c -l -o -r -H  $(PREFIX).delta | $(SCRIPTDIR)/getNucmerCoverage.pl -M 0  > $(PREFIX).0cvg
     $ delta2clr.pl -zero_cvg $(PREFIX).0cvg -read_len $(READLEN) < $(PREFIX).delta > $(PREFIX).clr
     $ awk '{print $1}' $(PREFIX).clr > $(PREFIX).seqs
     $ updateClrRanges -i $(PREFIX).bnk $(PREFIX).clr
     $ dumpreads -I $(PREFIX).seqs $(BANK) > $(PREFIX).seq

Contig merging

 * Identify adjacent contig end overlaps
 * Overlaps might be too short to be identified by alignment programs
 * Programs that do alignment & sequence merging:
     * EMBOSS merger: does not handle long sequences
     * fastaMerge.pl
       Input: multiFasta file; contigs must be ordered and oriented; only checks adjacent contig ends
 
       Example: 
         $ fastaMerge.pl -min 5 -max 30 -id 0.8 $(PREFIX).fasta -debug 1 > $(PREFIX).merge.fasta
 
         ctg1_id ctg2_id ovl_len ovl_id
         20      21      10      1
         34      35      18      1
         36      37      9       0.88
         ...
     2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2 assembly: # contigs 2053 -> 1927

Multiple references

 * Find most similar genome : most number or reads it aligns to it