Comparative assemblies: Difference between revisions

From Cbcb
Jump to navigation Jump to search
No edit summary
 
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
== AMOScmp pipeline ==
== AMOScmp pipeline ==
Short reads(Solexa)


=== Modified parameters ===
=== Modified parameters ===
* Smaller alignement/cluster sizes : default are 20/65 ; drop to 16/16 ; as low as 14/14; 12/12 gives too many spurious alignments:
 
     -D MINCLUSTER=20  
  Modified AMOScmp pipeline: ~dpuiu/bin/AMOScmp
* Drop min ovl from 10 to 5:  
 
  Alignment:
  * Lower '''nucmer''' alignement/cluster sizes : default are 20/65 ; drop to 16/16 (Solexa_read_len/2)
    Can go as low as 14/14; 12/12 gives too many spurious alignments:
     -D MINMATCH=20 -D MINCLUSTER=20  
  * Run nucmer multiple times:
    all reads:      given alignement/cluster size
    unaligned reads: smaller alignement/cluster size
    unaligned reads: smaller alignement/cluster size
    ...
  * Use '''promer''' instead of '''nucmer''': alignement/cluster sizes of 6/11 (in AA)
 
  Layout:
  * Drop '''casm-layout''' min ovl from 10 to 5:  
     -D MINOVL=5  
     -D MINOVL=5  
* Drop majority from 70 to 50:  
  * Drop '''casm-layout''' majority from 70 to 50:  
     -D MAJORITY=50  
     -D MAJORITY=50  
* Drop wiggle from 15 to 2
 
  Consensus:
  * Drop '''make-consensus''' alignment wiggle from 15 to 2
     -D ALIGNWIGGLE=2
     -D ALIGNWIGGLE=2
* Use promer instead of nucmer: alignement/cluster sizes of 6/11 (in AA)
  * Use '''make-consensus''' -x option ???
 
=== Read trimming ===
 
  * Quality trimming: too stringent
 
  * Align to reference using nucmer (small -c <n> -l <n>); trim reads to alignment coordinates
  * Identify 0 cvg regions; don't trim reads adjacent to these regions
  * Update read clr's; run AMOScmp
 
  Example:
      $ show-coords -c -l -o -r -H  $(PREFIX).delta | $(SCRIPTDIR)/getNucmerCoverage.pl -M 0  > $(PREFIX).0cvg
      $ delta2clr.pl -zero_cvg $(PREFIX).0cvg -read_len $(READLEN) < $(PREFIX).delta > $(PREFIX).clr
      $ awk '{print $1}' $(PREFIX).clr > $(PREFIX).seqs
      $ updateClrRanges -i $(PREFIX).bnk $(PREFIX).clr
      $ dumpreads -I $(PREFIX).seqs $(BANK) > $(PREFIX).seq
 
=== Contig merging ===
 
  * Identify adjacent contig end overlaps
  * Overlaps might be too short to be identified by alignment programs
  * Programs that do alignment & sequence merging:
      * EMBOSS merger: does not handle long sequences
      * fastaMerge.pl
        Input: multiFasta file; contigs must be ordered and oriented; only checks adjacent contig ends
 
        Example:
          $ fastaMerge.pl -min 5 -max 30 -id 0.8 $(PREFIX).fasta -debug 1 > $(PREFIX).merge.fasta
 
          ctg1_id ctg2_id ovl_len ovl_id
          20      21      10      1
          34      35      18      1
          36      37      9      0.88
          ...
      2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2 assembly: # contigs 2053 -> 1927


=== Multiple references ===
=== Multiple references ===
  * Find most similar genome : most number or reads it aligns to it

Latest revision as of 13:51, 27 February 2008

AMOScmp pipeline

Short reads(Solexa)

Modified parameters

 Modified AMOScmp pipeline: ~dpuiu/bin/AMOScmp
 
 Alignment:
 * Lower nucmer alignement/cluster sizes : default are 20/65 ; drop to 16/16 (Solexa_read_len/2)
   Can go as low as 14/14; 12/12 gives too many spurious alignments:
   -D MINMATCH=20 -D MINCLUSTER=20 
 * Run nucmer multiple times:
    all reads:       given alignement/cluster size
    unaligned reads: smaller alignement/cluster size
    unaligned reads: smaller alignement/cluster size
    ...
 * Use promer instead of nucmer: alignement/cluster sizes of 6/11 (in AA)
 
 Layout:
 * Drop casm-layout min ovl from 10 to 5: 
   -D MINOVL=5 
 * Drop casm-layout majority from 70 to 50: 
   -D MAJORITY=50 
 
 Consensus:
 * Drop make-consensus alignment wiggle from 15 to 2
   -D ALIGNWIGGLE=2
 * Use make-consensus -x option ???

Read trimming

 * Quality trimming: too stringent
 * Align to reference using nucmer (small -c <n> -l <n>); trim reads to alignment coordinates
 * Identify 0 cvg regions; don't trim reads adjacent to these regions
 * Update read clr's; run AMOScmp
 
  Example:
     $ show-coords -c -l -o -r -H  $(PREFIX).delta | $(SCRIPTDIR)/getNucmerCoverage.pl -M 0  > $(PREFIX).0cvg
     $ delta2clr.pl -zero_cvg $(PREFIX).0cvg -read_len $(READLEN) < $(PREFIX).delta > $(PREFIX).clr
     $ awk '{print $1}' $(PREFIX).clr > $(PREFIX).seqs
     $ updateClrRanges -i $(PREFIX).bnk $(PREFIX).clr
     $ dumpreads -I $(PREFIX).seqs $(BANK) > $(PREFIX).seq

Contig merging

 * Identify adjacent contig end overlaps
 * Overlaps might be too short to be identified by alignment programs
 * Programs that do alignment & sequence merging:
     * EMBOSS merger: does not handle long sequences
     * fastaMerge.pl
       Input: multiFasta file; contigs must be ordered and oriented; only checks adjacent contig ends
 
       Example: 
         $ fastaMerge.pl -min 5 -max 30 -id 0.8 $(PREFIX).fasta -debug 1 > $(PREFIX).merge.fasta
 
         ctg1_id ctg2_id ovl_len ovl_id
         20      21      10      1
         34      35      18      1
         36      37      9       0.88
         ...
     2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2 assembly: # contigs 2053 -> 1927

Multiple references

 * Find most similar genome : most number or reads it aligns to it