Bumblebee: Difference between revisions

From Cbcb
Jump to navigation Jump to search
 
(230 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Data =
= Data =


* ~ 500B genome
* ~ 500Mbp genome (diploid); 250Mbp halpoid
 
* Complete mitochondrion genomes:
* Complete mitochondrion genomes:
   NC_011923.1    15468  14.67  Bombus hypocrita sapporoensis mitochondrion, complete genome
   NC_011923.1    15468  14.67  Bombus hypocrita sapporoensis mitochondrion, complete genome
   NC_010967.1    16434  13.22  Bombus ignitus mitochondrion, complete genome
   NC_010967.1    16434  13.22  Bombus ignitus mitochondrion, complete genome
   only 88% identity; no rearrangements, only snps, short indels
   only 88% identity; no rearrangements, only snps, short indels
* NCBI TaxId
  http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=132113


== Traces ==
== Traces ==


* 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
* 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
 
<pre style="background:yellow">
   Lane  Insert           ReadLen    #Mates         #Reads         Coverage  Orientation    Comments
   Lane  Insert ReadLen    #Mates         #Reads       Coverage  Orientation    Comments                         #RepeatsReads
   1      3K(2..6,avg 4K124        34,944,099                    14X        outie          865,687(1.2%) reads have qual==0
   1      4K     124        34,944,099                    14X        outie          865,687(1.2%) reads qual==0     30,576,297 (44%)
   2      8K(7..9,avg 8K)  124        32,540,640                   13X       outie
   2      8K     124        32,540,640                    13X        outie                                            26,972,424 (41%)
  3      400    124        34,745,750                    .                                                           8,319,335 (12%)
  5      400    124        34,601,239                    .                                                           8,398,759 (12%)
  6      400    124        34,553,857                    .
   7      400    124        34,682,612                   .
  8      400    124       12,975,839                    .
   
   
   3     400              124        34,745,750                   .                           gDNA ; originally thought to be 500bp insert instead of 400
   9      3K     124        29,615,036                   .         outie            new3kb ; received 05/18/2010
  5      400              124        34,601,239                    .
  6      400              124        34,553,857                    .
  7      400              124        34,682,612                    .
  8      400              124        12,975,839                    .
    
    
   3-8                                 151,559,297     303,118,594   75X
   1-8                       219,044,036     438,088,072   108X
   1-8                                219,044,036     438,088,072  108X
   1-9                      248,659,072     497,318,144


  3-8    400    124        151,559,297    303,118,594  75X        innie
  1,9    3-4K    124        64,559,135      129,118,270  32X        outie
  2      8K      124        32,540,640      65,081,280    13X        outie
</pre>
   Two new 3kb libraries to come; ~ 70% are true pair ends
   Two new 3kb libraries to come; ~ 70% are true pair ends


* Adaptors
== Adaptors ==
   >circularizarion
   >circ
   CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
   CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
   >circularizarion.revcomp
   >circ.revcomp
   TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG  
   TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG  
   >5
   >5                              
   GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
   GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
   >3
   >3
   CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
   CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT[AGATCTCGGTGGTCGCCGTATCATTA][A+]
 
                0        1        2        3        4
                123456789012345678901234567890123456789012
  circ          CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
                    .............  ..  .............
  circ.revcomp  TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG
 
  CGGCATTCCTGCTGAACC revcmp GGTTCAGCAGGAATGCCG
             
                0        1        2        3        4
                123456789012345678901234567890123456789012
                            GGTTCAGCAGGAATGCCG
  5            GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
 
                CGGCATTCCTGCTGAACC
  3            CGGCATTCCTGCTGAACCGAGATCGGAAGAG
 
 
  #18mers (5')
  CGTAATAACTTCGTATAG
  TCGTATAACTTCGTATAA
  GATCGGAAGAGCGGTTCA
  CGGCATTCCTGCTGAACC
 
  #18mers (5',3')
  CGTAATAACTTCGTATAG
  TTATACGAAGTTATACGA
  TCGTATAACTTCGTATAA
  CTATACGAAGTTATTACG 
  GATCGGAAGAGCGGTTCA
  GCAGGAATGCCGAGACCG
  CGGCATTCCTGCTGAACC
  CGTGTAGGGAAAGAGTGT
 
  #31 mers (5') # jellyfish uses 31mers
  CGTAATAACTTCGTATAGCATACATTATACG
  TCGTATAACTTCGTATAATGTATGCTATACG
  GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
  CGGCATTCCTGCTGAACCGAGATCGGAAGAG
 
== Frequent kmers ==
                                                        s_1  s_2    s_3
    1  CGTAATAACTTCGTATAGCATA|CGTATAACTTCGTATAATGTAT    15.2  4.59  2.829    # not assembled  circAdapter
    2  AATGTATGCTATACGAAGTTAT|ATAACTTCGTATAGCATACATT    5.56  19.02  0.525    # not assembled  circAdapter
    3  CGTTATAACGTATTACGTTATA|TGTATGCTATACGAAGTTATTA    5.2  4.56  .       
    4  TACATTATACGAAGTTATACGA|TATATGGTATGACGTTATAACG    3.39  1.87  0.365   
    5  ATATAACGTAATACGTTATAAC|TATAACGTATTACGTTATATGG    2.86  2.71  1.595   
    6  ACGTTATAACGTCATCCCATAT|ATACCATATAACGTAATACGTT    2.11  2.62  .
    7  TACGTTATAACGTCATACCATA|TCCATTTTACACTGCAAATGTA    1.98  2.04  1.749 
    8  CTTAAACGTCGAAATATCTCCA|TTACGTTATATGGGATGACGTT    0.87  1.16  1.136 
    9  GAGCGCTTAAACGTCGAAATAT|ATGAAATCTAGCCGTCATTTCC    0.73  1.02  1.442 
    10  ACGTAATACGTTATAACGTCAT|ATTTCCTTCCCGTTGGAGATAT    0.67  1.1    1.477 
    11  CTGCTGAACCGCTCTTCCGATC|GATCGGAAGAGCGGTTCAGCAG    0.61  3.92  0.045    # not assembled  5Adapter
    12  ACGGTCACAGCTTGTCTGTAAG|TGTAAAATGGAGTTCAACGAGC    0.31  1.73  0.798 
    13  ACGGTGGTGCATTACATTTGCA|GGCGAATGGCGCCTGATGCGGT    0.28  0.87  0.949    # not assembled
    14  CAGAAAAGCATCTTACGGATGG|TACATACCTCGCTCTGCTAATC    0.04  2.47  .        # not assembled
 


* Location
* Location
   /fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt : s_1-8 libs
   /fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt : s_1-8 libs
   /fs/szattic-asmg4/Bees/Bombus_impatiens/3kb_Bimpatiens.txt            : 88 reads
  /fs/szattic-asmg4/Bees/Bombus_impatiens/new3kb/s_1_?_sequence.txt    : new3kb lib received 05/18/2010
   /fs/szattic-asmg4/Bees/Bombus_impatiens/3kb_Bimpatiens.txt            : 88 new3kbr test reads


== Tasks to figure out ==
== Tasks to figure out ==
Line 116: Line 186:


* Data processing pipeline
* Data processing pipeline
   s_[12]    : /nfshomes/dpuiu/bin/fastq2adaptorFreeFrg.amos
   s_[12]    : /nfshomes/dpuiu/bin/fastq2adaptorFreeFastq.amos
   s_[35678] : /nfshomes/dpuiu/bin/fastqfrg.amos
   s_[35678] : /nfshomes/dpuiu/bin/fastqfrg.amos


Line 128: Line 198:
== D Kelly's trimming ==
== D Kelly's trimming ==


* Reads
* Method: http://github.com/davek44/error_correction
   .                        elem      min  q1  q2  q3  max  mean    n50  sum        
* Read lens stats
   s_1_1_sequence.cor.txt    5436814   30  71  90  115  124  89.70  100  487673256  
   lib                      reads        min  q1  q2  q3  max  mean    n50  sum           #repeatReads
   s_1_2_sequence.cor.txt    5864225   30  77  92  110  124  93.27  98  546956864  
   s_1_1_sequence.cor.txt    5,436,814   30  71  90  115  124  89.70  100  487673256     1,071,779 (20%)
   s_2_1_sequence.cor.txt    1053858   30  80  96  118  124  97.16  102  102395785  
   s_1_2_sequence.cor.txt    5,864,225   30  77  92  110  124  93.27  98  546956864     1,208,767 (20%)
   s_2_2_sequence.cor.txt    1041846   30  78  93  110  124  93.51  99  97426679    
   s_2_1_sequence.cor.txt    1,053,858   30  80  96  118  124  97.16  102  102395785     222,290  (21%)
   s_3_1_sequence.cor.txt    33668302   30  105  124  124  124  111.04  124  3738530761  
   s_2_2_sequence.cor.txt    1,041,846   30  78  93  110  124  93.51  99  97426679     208,029  (20%)
   s_3_2_sequence.cor.txt    32554322   30  82  108  124  124  100.42  120  3269039697   
   s_3_1_sequence.cor.txt    33,668,302   30  105  124  124  124  111.04  124  3738530761             (12%)
   s_5_1_sequence.cor.txt    33579744   30  109  124  124  124  111.65  124  3749192427   
   s_3_2_sequence.cor.txt    32,554,322   30  82  108  124  124  100.42  120  3269039697   
   s_5_2_sequence.cor.txt    32465412   30  84  112  124  124  102.13  124  3315553171   
   s_5_1_sequence.cor.txt    33,579,744   30  109  124  124  124  111.65  124  3749192427   
   s_6_1_sequence.cor.txt    33535725   30  109  124  124  124  111.60  124  3742602285   
   s_5_2_sequence.cor.txt    32,465,412   30  84  112  124  124  102.13  124  3315553171   
   s_6_2_sequence.cor.txt    32390877   30  83  109  124  124  100.92  123  3268875471   
   s_6_1_sequence.cor.txt    33,535,725   30  109  124  124  124  111.60  124  3742602285   
   s_7_1_sequence.cor.txt    33674235   30  112  124  124  124  112.19  124  3777917379   
   s_6_2_sequence.cor.txt    32,390,877   30  83  109  124  124  100.92  123  3268875471   
   s_7_2_sequence.cor.txt    32518568   30  84  110  124  124  101.60  124  3303807321   
   s_7_1_sequence.cor.txt    33,674,235   30  112  124  124  124  112.19  124  3777917379   
   s_8_1_sequence.cor.txt    11777821   30  113  124  124  124  112.30  124  1322658923   
   s_7_2_sequence.cor.txt    32,518,568   30  84  110  124  124  101.60  124  3303807321   
   s_8_2_sequence.cor.txt    12176364   30  84  110  124  124  101.51  124  1236034348  
   s_8_1_sequence.cor.txt    11,777,821   30  113  124  124  124  112.30  124  1322658923   
   total                    301,738,113  . .    .    .    .    .      .    31958664367 
   s_8_2_sequence.cor.txt    12,176,364   30  84  110  124  124  101.51  124  1236034348  
  total (64bp+)            275,763,594 .    .    .    .    .      .    30693543885
    
   s_9_1_sequence.cor.txt   21,273,603  30  87  124 124  124  104.52  124 2223545139    #new3kb added 05/18/2010
* Mates
   s_9_2_sequence.cor.txt   20,769,027   30   86   124  124  124  103.68  124  2153436097    #new3kb added 05/18/2010
  s_1.mates                5320163   
  s_2.mates                1034086    
   s_3.mates                31857304    
   s_5.mates                31820553 
  s_6.mates                31748713 
  s_7.mates                31891468 
  s_8.mates                11168258 
  total                    144,840,545


* Files 
  s_1,2                    13,396,743 
   /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[12].rev.frg
   s_1..8                    301,738,113  .  .    .    .    .    .       .   31958664367 
   /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/frg/s_[35678].frg       
   s_2..8 (64bp+)            275,763,594  .  .    .    .    .    .       .    30693543885 


  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/s_?_[12]_sequence.cor.txt      # fastq ; s_[12] m


= Assembly =
* Read/Mate counts
  lib                      reads        mates            range
  s_1                      11301039    5320163   
  s_2                      2095704      1034086          11301040..13396743     
  s_3                      66222624    31857304 
  s_5                      66045156    31820553       
  s_6                      65926602    31748713 
  s_7                      66192803    31891468 
  s_8                      23954185    11168258 
  s_9                      42042630    13245740
  s_1,2                    13,396,743  6,354,249
  s_1-8                    301,738,113  144,840,545
  s_1-9                    343,780,743  158,086,285


== CA.bog s_1 ==
* Files   
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt        # adaptor free corrected reads (long inserts)
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt          # corrected reads (short inserts)
  /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[12].rev.frg
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/frg/s_[35678].frg


* Input  :  31*2M s_1 reads trimmed to 100bp & formatted by fastqToCA
= Assembly =


* Spec file
* SOAP version 1.04:  /nfshomes/dpuiu/szdevel/SOAPdenovo_Release1.04/
  unitigger =                      bog
* CA version  6.0     /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/
  ovlOverlapper =                  mer
  obtMerThreshold =                400
  ovlMerThreshold =                120
  doOverlapTrimming =              0


* Unitigger : max utg len=852bp
== SOAPdenovo s_1..8 CBCB-corr (Illinois) ==


* Consensus after unitigger : 3 out of 129 jobs failed
* From: Kim Walden 04/30/2010
    
* Input:  s_1,2,3,5,6,7,8 reads
* Location
* Error correction: yes; CBCB method
   /scratch1/Bombus_impatiens/Assembly/31M
* Kim's e-mail
  ''I ran it in a stepwise fashion, not adding any of the available options and only changing K=31.''
   ''I then ran their more recent GapCloser algorithm, not the version that is part of the package when using option -F in "scaff".  ''
  ''I used their "all" assembly once, and I noticed that it automatically uses the -F option during scaffolding''
   ''Also, I used your data from the ftp site without checking for redundancy in the 3kb and 8kb lanes; did Daniela do the same?''


== CA.utg s_1,2 filtered ==
* Summary
                          elements min_len  max_len  mean    n50        sum
  contigs(>100)          103858  100      85850    2227    7541      231387390      # prior to scaffolding;
  scaffolds&contigs(>100) 12663    100      4648588  19282  1040109    244171008      # post-scaffolding and gap-filling
  scaffolds(span)        3736    252      4648588  64874  1047164    242369898      # same prior data minus contigs that didn't assemble into scaffolds.
  scaffolds(len; no N's)  3736    172      4585806  64034  1027043    239233801      # computed by CBCB
     
* Locations
  ginkgo:/scratch1/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC.redo/Bimp_Salz_corr_050310_NewGapCloser.fasta
  # backup
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC.redo/


* Input: 7*2M s_1 & s_2 reads formatted by convert-fasta-to-v2.pl
  # online
* Filter:  
  http://www.life.illinois.edu/robertson/private/
** only one read from the mate contains the adaptor in the 64..124bp range
  Login: robertson
** both reads from the mate have good quality in the 0..64bp range
  password: dna


* Spec file
== SOAPdenovo s_1..8 CBCB-corr (CBCB) ==
  unitigger            = utg 
  ovlOverlapper        = ovl 
  obtMerThreshold      = 400
  ovlMerThreshold      = 200
  doOverlapTrimming    =  0    => reads < 64 bp atre trimmed by the gatekeeper; no need to trim


* ovlStore stats
* Stats:
                          elem    min      max      mean    n50        sum
  contigs(all)            9943851  31      85850    60      45        594773555
  contigs(>100)          106213  100      85850    2177    7084      231246190
  scaffolds+contigs(>100) 12473    100      5933710  20428  1086614    254800874    # based on scafSeq file 
  scaffolds(span)        3664    236      5933710  69090  1115496    253148288    # based on scafSeq file
  scaffolds(len)          3664    134      5522831  63411  1059378    232338949
  readsAssembled          208389968                                                  # based on readOnContig file
  readsFiltered          289681090                                                  # based on readOnContig & log file
  readsTotal              301738113                                                  # based on config file
  readsDeleted            441343                                                      # based on log file
* Run gapClosing on assembly:
  stillOpen 14784
  closed    42379


   #ovls/read
* Contig cvg: GC bias:
   reads      min    q1    q2    q3    max        mean      n50        sum           
[[Image:Bombus.contig10000.png]]
  14220044   0     1      5      31    514        31.30      131        445090932
   cat asm.K31.contig100.infoseq | ~/bin/getCorrelationDescriptive.pl -i 2 -j 3  => -0.709501493345893
   cat asm.K31.contig1K.infoseq | ~/bin/getCorrelationDescriptive.pl -i 2 -j 3   => -0.923151349747983


* final stats
  #100+ ctgs
   .                    elem      min    q1    q2    q3    max        mean      n50        sum             
   .                    elem      min    q1    q2    q3    max        mean      n50        sum             
   ctg                 2          1004  1004  1055  1055   1055       1029.50    1055       2059         
   len                 106213    100    140    437    2420   85850      2177       7084       231246190     
   deg                  1433143    64     115    129    164    1028      145.79     145        208938949 
   gc                  106213     0.00  33.13  38.00  43.44  100.00     38.74      39        4115066
   assembled            7557472
   cvg                  106213    0.6    39.0  50.8  58.9   160.8      47.38      54        5032329
  singletons          6662572
 
* Mate redundancy
 
  cat 9-terminator/asm.posmap.frgdeg  | sort -nk2 -nk3 | perl ~/bin/posmap2ovl.pl | p 'chop $F[1]; chop $F[2]; print "$F[1] $F[2]\n"; print "$F[2] $F[1]\n";' | count.pl -m 2 | more
 
* Location
   /scratch1/Bombus_impatiens/Assembly/14M.redo


== SOAPdenovo s_3..8 orig (CBCB) ==
  #1K+ ctgs
 
* Kmer=23
* Input:  s_3,5,6,7,8 original reads
* Error correction: no
* Summary
   .                    elem      min    q1    q2    q3    max        mean      n50        sum             
   .                    elem      min    q1    q2    q3    max        mean      n50        sum             
   scf                 571,114    100    106    115    141    156066    162.19    137        92,628,126        
   len                 41965      1000  1860  3295  6338  85850      5110      7726       214421699     
   ctg                  54049561  24    24    28    42    2049      35.04     35         1,893,660,606   
   gc                  41965      19.18  34.31  37.96  41.79  59.25      38.19     38         1602711
   gapSeq              49518     28    123   143    181    424        152.97    158        7,574,733
   cvg                  41965     8.0   42.0  49.9  55.8  93.8      47.74      51        2003339


* Longest scaff:  
* Run soap2 to map reads back to the assembly: "-l 32 -v 2"
   scaffold672    156066 35.63
   readsAligned            173063062


   cat asm.scafSeq | ~/bin/scafffasta2scaff.pl | tail -21 | getSummary.pl -i 2  -t ctg
* Locations
   cat asm.scafSeq | ~/bin/scafffasta2scaff.pl | tail -21 | head -20| getSummary.pl -i 3 -t gap
   mulberry:/scratch0/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor/      # original
  .             elem      min    q1    q2    q3    max        mean      n50        sum           
   /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor/   # backup
  ctg          21        481    2943  5480  9853   19523      7268.48    12202      152638       
  gap          20        1      166    183    246    302        171.40    233        3428


== SOAPdenovo s_3..8 corr (CBCB) ==
== SOAPdenovo s_1..9 CBCB-corr (CBCB) ** ==


* Kmer=31
* Uses the new 3Kbp library : s_9 (reads have been reversed)
* Input: s_3,5,6,7,8 reads  
* Error correction: yes
* Summary
                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                106,217    100    140    434    2418  102317    2,177      7099      23,1261,742     


== SOAPdenovo s_3..8 corr (Illinois) ==
* Stats:
<pre style="background:yellow">
  .                    elem      min    q1    q2    q3    max        mean      n50        sum
  scf                  11178      100    111    135    390    5655980    23014      1205321    257251549
  ctg                  10856652  31    .      .      .      85850      57        43        627095607
  ctg100+              106741    100    .      .     .     85850**    2165      6939      231167576


* Kmer=31
   scfSeqContigs        67900      100    340   1788   4373   85819      3437       7586       233372689
* Input:  s_3,5,6,7,8 reads
   scfSeqContigsClosed  20651      100    128    488    7991  600179     11715      64083      241921308
* Error correction: yes; UIUC method
</pre>
* Summary
   .                    elem      min   q1    q2    q3    max        mean      n50        sum           
   scf                  17,550                                            13,460                236,225,169
   ctg                  4004102    31                          97,618    1916       6020       230,333,709
   ctg100bp+            120167     100


== SOAPdenovo s_1..8 corr (Illinois) ==
* Location
 
  mulberry:/scratch2/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/      # original
* Kmer=31
   /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/   # backup
* Input: s_1,2,3,5,6,7,8 reads
* Error correction: yes; CBCB method
* Summary
   .                   elem      min    q1    q2    q3    max        mean      n50        sum    
  ctg200bp+            43603      201    752    2786  9518  167534    7637.63    19895      333,023,446       


* Locations
== CA s_1..8 corr OBT (CBCB) ==
  ftp://stan.cropsci.uiuc.edu/download/bombus1-k64-contigs.fa.min200
  ginkgo:/scratch1/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC/
 
== CA s_1..8 corr OBT (CBCB) * ==


=== Issues ===
=== Issues ===
Line 284: Line 359:
   
   
   ovlMemory  =              8GB
   ovlMemory  =              8GB
  doExtendClearRanges =      0          # fast run


=== Gatekeeper ===
=== Gatekeeper ===
* ~ 58X cvg
   gatekeeper -dumpfragments asm.gkpStore | grep ^fragmentClear ...
   gatekeeper -dumpfragments asm.gkpStore | grep ^fragmentClear ...
   LOAD                  STATS         
   LOAD                  STATS         
Line 311: Line 389:
   7                    LIB           
   7                    LIB           
    
    
   LibraryName          numActiveFRG  numDeletedFRG  numMatedFRG  readLength  clearLength
   LibraryName          numActiveFRG  numDeletedFRG  numMatedFRG  readLength  clearLength   #egrep 'ATATAACGTAATACGTTATAAC|GTTATAACGTATTACGTTATAT'
   GLOBAL                274282535    27455578      247151634    30524746650  29445986650   
   GLOBAL                274282535    27455578      247151634    30524746650  29445986650   
   LegacyUnmatedReads    0            0              0            0            0             
   LegacyUnmatedReads    0            0              0            0            0             
    
    
   s_1                  10409320      891719        9039810      996306280    994802849  
   s_1                  10409320      891719        9039810      996306280    994802849     #1134233(11%)
   s_2                  2054575      41129          1987728      197829588    197593453  
   s_2                  2054575      41129          1987728      197829588    197593453     #224648(11%)
   s_3                  59894531      6328093        53925476    2382786529  2127257586  
   s_3                  59894531      6328093        53925476    2382786529  2127257586     #3037085(5%) 
   s_5                  60127274      5917882        54512448    2459696212  2209078913  
   s_5                  60127274      5917882        54512448    2459696212  2209078913     #5%
   s_6                  59813485      6113117        54062058    2396359740  2152656620  
   s_6                  59813485      6113117        54062058    2396359740  2152656620     #5%
   s_7                  60278536      5914267        54647258    2476326572  2237423989  
   s_7                  60278536      5914267        54647258    2476326572  2237423989     #5%
   s_8                  21704814      2249371        18976856    2435572545  2347304056  
   s_8                  21704814      2249371        18976856    2435572545  2347304056     #5%


   gatekeeper -dumplibraries -tabular asm.gkpStore
   gatekeeper -dumplibraries -tabular asm.gkpStore
Line 335: Line 413:
   s_1,2 lib : ids:1..13,396,744
   s_1,2 lib : ids:1..13,396,744
   overlap jobs that need to be redone : 1036 out of 11476 < 10%
   overlap jobs that need to be redone : 1036 out of 11476 < 10%
   ~/bin/getOvlJobsCount.pl 2000000 2000000 301738113 | p '/(\d+)-(\d+)\s+(\d+)-(\d+)/; print $_ if($1<14000000 or $3<14000000);'  
   ~/bin/getOvlJobsCount.pl 2000000 2000000 301738113 | p '/(\d+)-(\d+)\s+(\d+)-(\d+)/; print $_ if($1<14000000 or $3<14000000);'


=== Meryl ===  
=== Meryl ===  
   meryl -Dh -s 0-mercounts/asm-C-ms22-cm0  
   meryl -Dh -s 0-mercounts/asm-C-ms22-cm0  
   Found 23686053415 mers.
   Found 23686053415 mers.
   Found 265670791 distinct mers.
   Found 265670791 distinct mers.
   Found 8682767 unique mers.
   Found 8682767 unique mers.
  Largest mercount is 21,987,616; 668 mers are too big for histogram.
   ...
   ...
   #freq  count  %
   #freq  count  %
   1      8682767 0.0327  0.0004
   1      8682767 0.0327  0.0004
   2      4288602 0.0488  0.0007
   2      4288602 0.0488  0.0007
  3      3041005 0.0603  0.0011
  ...
   72*    3803015 0.7150  0.3229
   72*    3803015 0.7150  0.3229
   73      3797615 0.7293 0.3346
   ...
  1048158 1      1.0000  0.9274
 
  fasta2tab.pl /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/0-mercounts/asm.nmers.ovl.fasta  | sort -n -r | head -5
  21987616        ATATAACGTAATACGTTATAAC
  20992872        CGTTATAACGTATTACGTTATA
  20510821        ACGTTATAACGTATTACGTTAT
  20484950        CCATATAACGTAATACGTTATA
  20452950        CATATAACGTAATACGTTATAA


===  OBT/Overlap ===
  cat egrep.sh
  egrep -c 'ATATAACGTAATACGTTATAAC|GTTATAACGTATTACGTTATAT' /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/seq_obt/*clean.seq > egrep.count
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2
  ...
 
===  OBT ===


* job count  
* job count  
Line 365: Line 460:


* use OBTCHIMERA as CLR if want to redo the assembly !!!
* use OBTCHIMERA as CLR if want to redo the assembly !!!
=== Overlap ===
* job count : same as OBT


* Overlap counts
* Overlap counts
Line 383: Line 482:
   Lib    6  3'  43903838  50   
   Lib    6  3'  43903838  50   
   Lib    7  5'  16221438  50   
   Lib    7  5'  16221438  50   
   Lib    7  3'  16221438  50   
   Lib    7  3'  16221438  50
 
* 199542773 out of 257786946 (77%) usable reads had overlaps     
* 7988792  out of 11443537 (69%) usable s_1,2 reads had overlaps
 
=== Frg/Ovl correction ===


* Frgcorr jobs: 151
* Frgcorr jobs: 151
Line 390: Line 494:
* Ovlcorr jobs: 151
* Ovlcorr jobs: 151
   wc -l 3-overlapcorrection/cat-corrects.frgcorrlist
   wc -l 3-overlapcorrection/cat-corrects.frgcorrlist
=== OverlapStore ===


* Failures
* Failures
Line 409: Line 515:
   wc -l 4-unitigger/asm.partitioningInfo
   wc -l 4-unitigger/asm.partitioningInfo
   110
   110
* Failures
* Failures
   grep FAILED 5-consensus/asm*err
   grep FAILED 5-consensus/asm*err
   5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED.  Could not align fragment 264046761.
   5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED.  Could not align fragment 264046761.
   Fix:
    
   # update layout
  # Fix:  
   tigStore -g asm.gkpStore -t asm.tigStore 2 -u 62979707 -d layout > utg.62979707.layout
   # update layout (CA 6.1)
   ~/bin/updateUTGlayout.pl utg.62979707.layout >  utg.62979707.layout.update
  mkdir utg
   tigStore -g asm.gkpStore -t asm.tigStore 2 -u 62979707 -R utg.62979707.layout.update   
 
  # create success files
  # update layout (CA 6.1)
   touch 5-consensus/asm_090.success 5-consensus/consensus.success
   grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "tigStore -g asm.gkpStore -t asm.tigStore 2 -u $F[0] -d layout > utg/$F[0].layout\n";'  | ~/bin/scheduler.pl
   grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "~/bin/updateUTGlayout.pl utg/$F[0].layout >  utg/$F[0].layout.update\n ";'              | ~/bin/scheduler.pl
   grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg/$F[0].layout.update\n" ;'            | ~/bin/scheduler.pl -c 1
   grep FAILED 5-consensus/asm*err | sed 's/\.err:/ /' | awk '{print $1}' | sort -u | p 'print "touch $F[0].success\n";'                                              | ~/bin/scheduler.pl
  # create success file
  touch 5-consensus/consensus.success
 
   # relaunch runCA
   # relaunch runCA


Line 437: Line 551:


* Failure:  
* Failure:  
** buildPosMap : segFault on ctglen
** buildPosMap : segFault :ids too long???
  Fix: split asm file(optional) and reassign new accession numbers
  cd split/
  ~/bin/splitAsm.pl < asm.asm => ... CCO.asm, SCF.asm
  cat CCO.asm CLK.asm SCF.asm SLK.asm | ~/bin/newAsmIds.pl | buildPosMap -o asm


=== Stats ===
=== Stats ===
* [[Media:Bb.CA.s_1-8.cor.qc|Bb.qc]]
* Stats
   .                    elem      min    q1    q2    q3    max        mean      n50        sum             
   .                    elem      min    q1    q2    q3    max        mean      n50        sum             
   scf                  15009      74    4786  9789  20021  415318    16916.33  29133      253897199       
   scf                  15009      74    4786  9789  20021  415318    16916.33  29133      253897199       
Line 446: Line 567:
   utg                  2786974    54    90    110    142    25644      197.04    277        549140918       
   utg                  2786974    54    90    110    142    25644      197.04    277        549140918       
   singletons          61129865
   singletons          61129865
   seqs                301738113
   usableSeqs          257786946
 
* Stats (100bp+)
                        elem    min      max      mean    n50        sum
contigs(>100)          99429    100      138630    2148    7972      213595680
scaffolds(span)        14992    100      415318    16935  29133      253895676
scaffolds(len)        14991    100      303946    14499  24091      217357577
 
* Library stats (totals)
  lib    ctg      sur      deg      singl    total
  s_1    2251597  162800  3119665  3988095  9522157
  s_2    424821    41107    563656    891795    1921379
  s_3    28880189  1078891  13600909  12735637  56295626
  s_5    28983834  1077334  13561390  12962731  56585289
  s_6    28663974  1062939  13559494  12896855  56183262
  s_7    28784505  1058364  13713200  13020274  56576343
  s_8    10455533  383065  5229814  4634478  20702874
  all    128444453 4864500  63348128  61129865  257786930
 
* Library stats (percentages)
  lib    ctg  sur  deg  singl  total
  s_1    24  2    33  42    100
  s_2    22  2    29  46    100
  s_3    51  2    24  23    100
  s_5    51  2    24  23    100
  s_6    51  2    24  23    100
  s_7    51  2    24  23    100
  s_8    51  2    25  22    100
 
=== Mate redundancy ===
 
* Criteria:
  Given 2 mates 
    1: a1,a2
    2: b1,b1
  We have the following 2 overlaps
    1: a1,b1  a2,b2   
    2: a1,b2  a2,b1 
  At least 75% of one read overlaps the other read
 
* Location:
  cd /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor/9-terminator/
 
  wc -l s_1.mates asm.posmap.frgutg.s_1.*
  5320163 s_1.mates                                    # input mates (from FRG file; 2 reads/mate)
  1232734 asm.posmap.frgutg.s_1.mates                  # assembled mates (both reads assembled)
  366262 asm.posmap.frgutg.s_1.redundant_mates        # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 29.7% redundancy
  312206 asm.posmap.frgutg.s_1.redundant_mate_pairs    # redundant assembled mate pairs
 
  wc -l s_2.mates asm.posmap.frgutg.s_2.*
  1034086 s_2.mates                                    # input mates (from FRG file; 2 reads/mate)
  255905 asm.posmap.frgutg.s_2.mates                  # assembled mates (both reads assembled)
  110570 asm.posmap.frgutg.s_2.redundant_mates        # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 43.2% redundancy
  1871747 asm.posmap.frgutg.s_2.redundant_mate_pairs    # redundant assembled mate pairs
    11937 asm.posmap.frgutg.s_2.redundant_mate_clusters # redundant assembled mate clusters
 
=== Runtimes ===
  #    step            hours
  0    gatekeeper      5.36 
  1    initialTrim    2.07 
  2    meryl          9.90 
  9    overlap.sh      11.31 
  11    find            0.00 
  16    overlapStore    13.79 
  17    consolidate    5.12 
  18    merge-trimming  1.41 
  19    chimera(1)      10.04 
  20    overlap.sh      49.86 
  21    find            0.00 
  22    overlapStore    16.68 
  23    frgcorr.sh      5.57 
  24    cat-corrects    0.10 
  25    ovlcorr.sh      4.64 
  26    cat-erates      0.90 
  27    overlapStore    6.67 
  28    buildUnitigs    13.88 
  29    gatekeeper      1.31 
  30    consensus.sh    3.41 
  31    cgw(1)          65.00 
  43    gatekeeper      4.39 
  44    consensus.sh    4.70 
  46    terminator      6.26 
  47    asmOutputFasta  7.75 
  48    dumpSingletons  3.68 
  49    buildPosMap(1)  3.79


=== Location ===
=== Location ===
   ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/ # assembly
   ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/                             # original assembly
   ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/asm.gkpStore.OBTCHIMERACLR : dumped OBT clear ranges, 0/1 undel/del
   ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/asm.gkpStore.OBTCHIMERACLR   # dumped OBT clear ranges, 0/1 undel/del
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor                        # backup


== CA s_1..8 corr OBT redo (CBCB) * ==
== CA s_1..8 corr OBT redo (CBCB) ==
 
=== Issues ===
* Fixed outie s_1,2 lib : clear ranges will be different
* s_3..8 : stay the same


!!! Fixed outie s_1,2 lib
=== runCA spec ===
* CA spec
   doOverlapTrimming =        0        !!! already done in the prev assembly
   doOverlapTrimming =        0        !!! already done in the prev assembly
  saveOverlaps  =            1        !!! it was actually set to 0 for this run; otherwise the ovb.gz files get deleted
   ovlOverlapper =            ovl      # same
   ovlOverlapper =            ovl      # same
   unitigger =                bog      # same   
   unitigger =                bog      # same   
    
    
   BatchSize =               2000000    # must be the same so overlaps are not redone
   BatchSize =               2000000    # must be the same so overlaps are not redone
   BlockSize =               2000000    # must be the same
   BlockSize =               2000000    # must be the same
   ...
   ...
  doExtendClearRanges =      1          # initially set to 2; seems to take too long => reset to 1
=== Input ===


* Input: modified frg files: dumped from the prev assembly gatekeeper; undelete act:D fragments so the ids stay the save; reverse s_1,2 reads
* Modified frg files:  
** dumped "CA s_1..8 corr OBT" assembly gatekeeper  
** undelete act:D fragments so the ids stay the save
** reverse s_1,2 reads


* Gatekeeper : same
=== Gatekeeper ===


* Overlap: do only some of them
* Make sure the iids match


* Frgcorr; failed on 144 out of 151 jobs
=== Overlap ===


* Location
* Rerun only the overlap jobs for the s_1,2 reads
   /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg_obt/*frg
 
=== OverlapStore ===
 
*  To build faster in the future
e 'foreach ("0000000001".."0000000058") { print "overlapStore -c $_.ovlStore -g asm.gkpStore/ -i 0 -M 8192 1-overlapper/$_/*gz\n" }' | scheduler.pl  # 13min/job 
e 'foreach ("0000000002".."0000000058") { print "overlapStore -m 0000000001.ovlStore $_.ovlStore\n" }'                                              # 3 min/job
total=2*13+57*3=3.5 hrs (instead of 13.85 reported in runCA.log)
 
=== Frgcorr ===
 
* Initial failure: on 144 out of 151 jobs : the gatekeeper id's originally did not match (5 reads missing from s_5 lib shifted all the other reads)
 
=== Ovlcorr ===
 
* The error rate on overlaps with error went down by ~ 1.28%
 
=== Bog  ===
 
  cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
  elem      min    q1    q2    q3    max        mean      n50        sum           
  60970504  47    85    150    150    93191      126.17    150        7692801690
 
* Unitig stats
  tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep -c ^unitig_num_frags                                    # 60,970,523  : total unitigs
  tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep -c ' 1$'                      # 59,776,572  : unitigs with only 1 read
  tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep -v ' 1$' | getSummary.pl -i 2  # stats for unitig with 2+ reads
                elem        min    q1    q2    q3    max        mean      n50        sum           
  1read/utg    59,776,572                                                                59,776,572
  2+reads/utg  1,193,951    2      3      6      28    74715      165.84    4712      198,010,374
 
=== Bog consensus ===
 
* job count :
  wc -l 4-unitigger/asm.partitioningInfo
  110
* Failures: 1 failed unitig
  grep FAILED 5-consensus/asm*err
  5-consensus/asm_022.err:MultiAlignUnitig()-- Unitig 115198 FAILED. 
  Fix: same as before
 
=== Cgw ===
 
* Failure:
** after 7-0-CGW/asm.ckp.4
  tail runCA.log
  ERROR:  Negative variance for chunk 59905266* in scaffold 6409 (AEnd=-163479.174700 BEnd=-163491.446700)
  37980:Scaffold 6409:
  37981-  Idx    Chunk    (  Start, Variance)  (    End, Variance)      Len
  37982-    0: 60983787    (      0,        0)  (  31646,      823)    31646
  37983-    1:  69691    (  35389,      934)  (  31650,      837)    3739
  37984-    2: 59907028    (  35103,      942)  (  42035,    1122)    6932
  37985-    3: 60799298    (  41640,    1132)  (  64897,    1737)    23257
  37986-    4: 60975976    (  64924,    1784)  (  88439,    2395)    23515
  37987-    5: 60988396    (  88431,    2416)  (  122023,    3289)    33592
  37988-    6: 60988698    (  121551,    3324)  (  123589,    3377)    2038
  37989-    7: 59905266*  (  122331,  -163479)  (  121859,  -163491)      472
  37990-    8: 60988599    (  122275,  170138)  (  134564,  170457)    12289
 
  ./AS_REZ/GapFillREZ.C:5666: int Estimate_Chunk_Ends(VarArrayStack_Entry_t*, int, int, LengthT*, LengthT*, ChunkInstanceT*, float*, Scaffold_Fill_t*, int*, int*, int*): Assertion `min_right_variance >= 0.0' failed.
  ./AS_REZ/GapFillREZ.c:8339:      "ERROR:  Negative variance for chunk %d in scaffold %d (AEnd=%f BEnd=%f)\n",  #message
 
  Fix: modfy the code; comment assertion statemants
  cat /nfshomes/dpuiu/szdevel/SourceForge/wgs/src/AS_REZ/GapFillREZ.c
  // assert (0.0 <= ref_variance && ref_variance <= FLT_MAX);
  // assert (min_right_variance >= 0.0);
 
* Preliminary stats
  cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $sum,"\n"; $sum=0; } else { $sum+=$F[-1]}' | getSummary.pl -t scf
  cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -t ctg
  cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -min 200 -t ctg200+
  cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $count-1,"\n"; $count=0; } else { $count++}' | getSummary.pl -i 0 -min 1 -t scf.gaps
 
  #7-0-CGW/rez.i36.log :
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  1820      84    162    4583  82337  3161523    125884.07  667487    229109005      # ids from 0..13113
  ctg                  26453      64    962    3756  10485  255820    8759.59    21800      231717378     
  ctg200+              23343      200    1760  4772  11873  255820    9908.34    21845      231290344
  scf.gaps            877        1      4      12    36    373        27.71      64        24298
 
  #7-2-CGW/rez.i01.log :
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  4903      66    1283  6908  39466  1148626    47212.57  209756    231483215     
  ctg                  23480      64    1370  4397  11757  255820    9858.75    23476      231483360
 
  7-2-CGW/rez.i06.log
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  3818      64    268    6813  61948  1198480    60723.65  257457    231842914     
  ctg                  24005      64    1165  4161  11439  255820    9659.47    23819      231875677
 
  #7-2-CGW/rez.i11.log
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  3733      64    268    6880  65584  1391034    62111.99  258823    231864045     
  ctg                  24004      64    1165  4161  11439  255820    9659.98    23819      231878256   
 
  #7-2-CGW/stone.i02.log
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  1854      83    153    4039  52506  2983045    125375.58  860583    232446325     
  ctg                  23863      64    674    3833  11368  255820    9761.94    25892      232949194     
 
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  1847      84    152    4440  62397  3337984    125470.25  835387    231743549     
  ctg                  26088      64    429    3213  10218  297786    8941.40    25521      233263147     
 
* TigStore
  tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D unitigs | tail
  maID            isPresent      isDeleted      ptID    svID    fileOffset      covStat
  60970540        1              0              0      3      58023924696    84
 
  tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D contigs | tail -1
  maID            isPresent      isDeleted      ptID    svID    fileOffset
  61001100        1              0              0      32      16307420
 
* Checkpoints
  # 7-0-CGW             
                                                                    #times  iterMax
  CHECKPOINT_AFTER_BUILDING_SCAFFOLDS    "ckp01-ABS"             
  CHECKPOINT_DURING_CLEANING_SCAFFOLDS    "ckp02-DCS"              5        10
  CHECKPOINT_AFTER_CLEANING_SCAFFOLDS    "ckp03-ACD"              1
  CHECKPOINT_DURING_1ST_SCAFF_MERGE      "ckp04-1SM-partial"      12              # we should see more iterations if the scaffolding goes well; also ECR would take longer
  CHECKPOINT_AFTER_1ST_SCAFF_MERGE        "ckp05-1SM"              1
  CHECKPOINT_AFTER_STONES                "ckp06-AS"         
  CHECKPOINT_DURING_2ND_SCAFF_MERGE      "ckp07-2SM-partial" 
  CHECKPOINT_AFTER_2ND_SCAFF_MERGE        "ckp08-2SM"              1
  CHECKPOINT_AFTER_FINAL_ROCKS            "ckp09-FR"                1
  # 7-2-CGW             
                                                                    #times  iterMax
  CHECKPOINT_AFTER_BUILDING_SCAFFOLDS    "ckp01-ABS"             
  CHECKPOINT_DURING_CLEANING_SCAFFOLDS    "ckp02-DCS"              9        10
  CHECKPOINT_AFTER_CLEANING_SCAFFOLDS    "ckp03-ACD"              1
  ...
  CHECKPOINT_DURING_1ST_SCAFF_MERGE      "ckp04-1SM-partial"      10
  CHECKPOINT_AFTER_1ST_SCAFF_MERGE        "ckp05-1SM"              1
  CHECKPOINT_AFTER_STONES                "ckp06-AS"                1
  CHECKPOINT_DURING_2ND_SCAFF_MERGE      "ckp07-2SM-partial"      1
  CHECKPOINT_AFTER_2ND_SCAFF_MERGE        "ckp08-2SM"              1
  CHECKPOINT_AFTER_FINAL_ROCKS            "ckp09-FR"                1
  ...
  CHECKPOINT_AFTER_PARTIAL_STONES        "ckp10-PS"                1
  CHECKPOINT_AFTER_FINAL_CONTAINED_STONES "ckp11-FCS" 
  CHECKPOINT_AFTER_FINAL_CLEANUP          "ckp12-FC" 
  CHECKPOINT_AFTER_RESOLVE_SURROGATES    "ckp13-RS" 
  CHECKPOINT_AFTER_OUTPUT                "ckp14-FIN"
 
* Long running time : implemented 2 cgw options -ic, -im ....
* Running times
 
  CGW-0: START Wed Apr 28 22:30:12 ; END Mon May  3 03:34:57
  ECR-1: START Mon May  3 03:34:58 ; END Mon May  3 12:28:54       
  CGW-2: START Mon May  3 12:28:54 ; End Wed May 12 22:05:00
 
* Library insert histograms
  # Example
  cat 7-0-CGW/stat/scaffold_final.distlib_1.cgm | p 'print int($F[0]/1000),"\n";' | count.pl | sort -n > 7-0-CGW/stat/scaffold_final.distlib_1.cgm.hist
  cat 7-0-CGW/stat/scaffold_final.distlib_2.cgm | p 'print int($F[0]/1000),"\n";' | count.pl | sort -n > 7-0-CGW/stat/scaffold_final.distlib_2.cgm.hist
  gnuplot
    set xrange [0:10]
    plot "scaffold_final.distlib_1.cgm.hist" w l, "scaffold_final.distlib_2.cgm.hist" w l,
 
 
* Failure:
** after 7-2-CGW/asm.ckp.42 (after running about 2hrs)
  tail runCA.log
 
  * Screwed up scaffold 3790: Chunk 60994012 has bad mean
  * Scaffold 3790 iteration 1  1 bad CIs out of 33 LSE:2664.55 improvement:1
  * Reinserting chunk 60994012 in scaffold 3790
  * improvement = 0.0721104 LSE = 2664.55
 
  ----------------------------------------
  Failure message:
  scaffolder failed
 
=== ECR ===
 
* Count scaffold gaps : total/closed
  cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^examining gap'  # 24635
  cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^closed gap'      # 2803
 
=== Terminator ===
 
* Get id range
  grep -m 1 ^7 asm.posmap.utglen asm.posmap.ctglen asm.posmap.deglen asm.posmap.scflen
  asm.posmap.utglen:7180000000000 124   
  asm.posmap.deglen:7180001635156 133   
  asm.posmap.ctglen:7180001635158 124
  asm.posmap.scflen:7180002817601 161   
  =>
  utg: 7180000000000..7180001635155
  ctg/deg:7180001635156..7180002817600
  scf:7180002817601..7180002819521
 
=== Stats ===
 
* Assembly stats
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  1921      65    148    3389  55043  3122263    149121.08  982222    286461602     
  ctg                  91934      64    100    119    180    255828    2621.59    25123      241013378     
  deg                  1090511    54    103    124    152    7068      139.41    135        152024633     
  utg                  1635156    54    97    118    141    94720      267.81    1636      437917658
 
* Assembly stats(100bp+)
                          elem    min      max      mean    n50        sum
  contigs(>100)          69270    100      255828    3451    25383      239118680
  scaffolds(span)        1870    100      3122263  153185  982222    286457118
  scaffolds(len)          1870    100      2990211  128880  870182    241007381
 
=== Location ===
 
  ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo/          # original assembly
   /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo      # backup


== CA s_1..8 corr noOBT (CBCB) * ==
== CA s_1, 2 partial ,3..8 corr OBT (CBCB) ** ==


=== Issues ===
=== Issues ===
* Use only s_2 lib reads which seem 8K apart :  ~ 14% of the s_2 reads aligned by soap2 to SOAPdenovo Illinois assembly within 8+-2Kbp of one another
* Alignments (very similar results)
  soap2-index  Bimp.fasta                #  soap2-index -> 2bwt-builder
  bowtie-build Bimp.fasta  Bimp.fasta
  soap2 -D Bimp.fasta.index -a s_2.1.fastq -b s_2.2.fastq -l 32 -p 8 -v 2 -m 6000 -x 10000 -o s_2.mated.soap2 -2 s_2.single.soap2
  bowtie  Bimp.fasta      -1 s_2.1.fastq -2 s_2.2.fastq -l 32 -p 8 -n 2 -I 6000 -X 10000 >  s_2.mated.bowtie


* Used outie s_1,2 lib ; needs to be redone
  bowtie  Bimp.fasta        s_2_0.fastq -l 32 -p 8 -n 2                                  >  s_2_0.bowtie
  wc -l s_2.mated.*
  288880 s_2.mated.soap2
  288668 s_2.mated.bowtie
  intersect.pl s_2.mated.*
  288368
 
=== runCA spec ===
  doExtendClearRanges=0


=== runCA spec ===  
=== Input ===
  doOverlapTrimming =              0
  ovlOverlapper =                  ovl
  unitigger =                      utg  # cmd unitigger


===  Overlap ===
* Gatekeeper: same as prev assembly
  cat asm.repeatmodel.lib.00*stats | egrep '^Lib|nSamples|median' | p 'chomp; if(/Lib/) {print "\n$_"} else { print "  $F[1]"}' | pretty -o
* Overlaps: filter out all overlaps for most of the s_2 reads
  Lib    0  5'  212707959  51 
  Lib    0  3'  212707959  51 
  Lib    1  5'  7387977    39 
  Lib    1  3'  7387977    39 
  Lib    2  5'  1413291    40  ???
  Lib    2  3'  1413291    45  ???
  Lib    3  5'  46756517  51 
  Lib    3  3'  46756517  51 
  Lib    4  5'  46786172  52 
  Lib    4  3'  46786172  51 
  Lib    5  5'  46531611  52 
  Lib    5  3'  46531611  51 
  Lib    6  5'  46891503  52 
  Lib    6  3'  46891503  51 
  Lib    7  5'  16940888  52 
  Lib    7  3'  16940888  51 


===  Bog ===  
=== Bog ===
*  Utg len stats:
  cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
   elem      min    q1    q2    q3    max        mean      n50        sum             
   elem      min    q1    q2    q3    max        mean      n50        sum             
   71125093   66     95     150    150    23390     130.41    150       9275136839      
   61930057   47    85    150    150    113529     125.85     150       7793894671      # max went up from 93191->113529
 
=== Bog consensus ===
 
* Failures
  grep FAILED 5-consensus/asm*err : 113 unitigs between 5.. 33547 frags
  5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED.  Could not align fragment 264046761.
 
* Fix:
  grep    FAILED 5-consensus/asm*err | p 'print "$F[2]\n";' | perl ./fixLayout.pl | scheduler.pl
  grep -c FAILED 5-consensus/*err | grep -v ':0$' | p '/(.+).err/; print "touch $1.success\n";' | sort -u | scheduler.pl
 
=== Cgw ===
 
* Preliminary stats
  #7-0-CGW/stone.i02.log
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  1847      84    152    4440  62397  3337984    125470.25  835387    231743549     
  ctg                  26088      64    429    3213  10218  297786    8941.40   25521      233263147
 
  cat 7-0-CGW/cgw.out
  ...
  Processed 61,930,044 unitigs with 301,738,113 fragments
 
=== Terminator ===
 
* Message counts
  CTP    91502
  SCF    1896
  CLK    3288923
 
cat CLK.asm | grep ^ori: | count.pl
ori:O  979139
ori:A  869982
ori:N  857732
ori:I  582070
 
=== Stats ===
 
* Assembly stats
<pre style="background:yellow">
  .                    elem      min    q1    q2    q3    max        mean      n50        sum
  scf                  1896      76    150    4044  67922  4021294**  151760    1017298**  287738041
  ctg                  92307      63    100    119    186    297795**  2612.99    24781**    241197400
  deg                  1086650    54    103    124    152    6822      139.41    135        151484589
  utg                  1574225    54    98    119    143    123354    274.33    2027      431859732
</pre>
 
* Assembly stats(100bp+)
  .                      elem    min      max      mean    n50        sum           
  contigs(>100)          69944    100      297795    3421    25048      239324761     
  scaffolds(span)        1869    100      4021294  153951  1017298    287735536     
  scaffolds(len)          1869    100      3429736  128833  860270    240790206
 
* Library stats
  .                    badLong  badOuttie  badSame  bothChaff  bothDegen  bothSurrogate  diffScaffold  good  oneChaff  oneDegen  oneSurrogate 
  HWI-EAS385_0001:1    0.15    0.06      0.01    19        10.18     0              0.45          46.9  21.41    1.59      0.19         
  HWI-EAS385_0001:2    0.13    0.01      0        84.86      0.02      0              0.27          14.2  0        0.36      0.02         
  HWI-EAS385_0001:3    0.06    0.01      0        14.45      9.03      0.04          0.05          60.11  15.17    0.68      0.16         
  HWI-EAS385_0001:5    0.06    0.01      0        14.84      8.93      0.04          0.05          59.98  15.02    0.65      0.16         
  HWI-EAS385_0001:6    0.06    0.01      0        14.76      9.04      0.04          0.05          59.7  15.29    0.65      0.16         
  HWI-EAS385_0001:7    0.06    0.01      0        14.73      9.12      0.04          0.05          59.51  15.44    0.64      0.16         
  HWI-EAS385_0001:8    0.06    0.01      0       14.05      10.1      0.03          0.05          58.66  15.92    0.7      0.16         
 
* Issues
 
** Gap sizes vary between  -297,562 bp and 772,064  bp.  They should be no longer than ~ 8Kbp (max insert size)
  cat 9-terminator/split/SCF.asm | grep -A 3 "^{CTP" | grep ^mea: | sed 's/mea://' | sort -n | head
  -297562
  -216729
  ...
  162239    # scf7180002754932.314143..476382  : 26 s_1,3,5,6,7 mate pairs link the two sides of the gap
  772064    # scf7180002754479.867224..1639288 : 2  s_1 mate pairs link the two sides of the gap
 
  ~/bin/posmap2scaff.pl 9-terminator/asm.posmap.ctgscf | grep -v "^>" | sort -nk4 | tail
  7180002653372 EB 89    69161
  7180002699293 EB 78    92310
  7180002731234 BE 64936 162239
  7180002726839 BE 46924 772064
 
  grep ^mea: SCF.asm | sed 's/mea://' | getSummary.pl -i 0 -t scf.gap | pretty
  .          elem  min          q1      q2      q3      max        mean     n50    sum           
  scf.gap    91502  -297562      19      222      667      772064      -141    772064  -12910029
  scf.gap-  13788  -297562      -2166  -89      -12      0          -4284    0      -59071991
  scf.gap+  77699  0            84      325      775      772064      594      1194    46161963
 
=== Location ===
  mulberry:/scratch0/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/      # original (to be deleted)
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/  # backup
 
== Other ==
 
=== SOAPdenovo on CA contigs ===
 
  cat asm.ctg.fasta | ~/bin/assignFastaIds.pl | awk '{print $1}' > asm.contig
  lenseq asm.contig > asm.contig.len
 
  echo Edge_num `wc -l asm.contig.len` `wc -l asm.contig.len` > asm.ContigIndex
  echo "index  length  reverseComplement" >> asm.ContigIndex
  lenseq asm.K31.contig | awk '{print $1,$2,1}' >> asm.ContigIndex
 
  SOAPdenovo map  -s SOAPdenovo.config -g asm > SOAPdenovo.log        # ok
  SOAPdenovo scaff -g asm -F >> SOAPdenovo.log                        # fails: a nucmer of files are missing
 
= Assembly comparison =
 
== CA.contigs vs SOAPdenovo.contigs ==


* Filter contigs >=100 bp
* Alignment
  nucmer --mum -c 100  CA.fasta SOAPdenovo.fasta
  show-coords -H /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | p ' $F[19]="." unless($F[19]); print $F[19],"\n";' | count.pl | pretty -o
  [CONTAINS]    80413 
  [CONTAINED]    4646 
  [IDENTITY]    1960 
  [BEGIN]        4152   
  [END]          4034 
  .              157    # partial alignments


* Location:
  # get "broken" contigs
   walnut:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor.noOBT
  show-coords -H        /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | grep -v '\[' | awk '{print $18,$12}' | sort -u | getSummary.pl -i 1 -t all
  show-coords -L 500 -H /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | grep -v '\[' | awk '{print $18,$12}' | sort -u | getSummary.pl -i 1 -t 500+
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  all                  118        155    4630  9817  29042  144499    22498.97  54311      2654878   
  500+                16        1844  8205  16551  28426  66880      20514.88  28426      328238 
 
* Location  
  mulberry:/scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs
 
== CA.contigs vs SOAPdenovo.scaffolds ==
 
* All ctgs & scf
* Alignment
  #ref/qry all
    .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  SOAPdenovo.scf(span)    12473      100    113    144    498    5933710    20428      1086614    254800874   
  SOAPdenovo.scf(len)    12473      100    113    143    341    5522831    18759.84  1034925    233991535
  CA.ctg                  92307      63    100    119    186    297795    2612      24781      241197400
 
  #best hits
  nucmer --mum -c 40 CA.fasta SOAPdenovo.fasta
  delta-filter -1 40 -i 95 -o 50 -1 SOAPdenovo-vs-CA.40.delta >! SOAPdenovo-vs-CA.40.filter-1.delta
 
= Genbank submission =
 
* Project registration : http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi
  http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi?show=20FF4536-24A4-4D8C-94FA-0755E1326E33
  Project ID: 61101
 
  Submitting Organization:              University of Illinois & University of Maryland
  Sequencing Center:                    Keck Center for Comparative and Functional Genomics, University of Illinois  ??? BCUI ???  Biotechnology Center, Univ. Illinois (BCUI)
  Consortium Name:
  Organism Name:                        Bombus impatiens
  Strain/isolate/breed:
  Locus Tag Prefix:                    BIMP      #3+ letters
  Source of DNA used for sequencing:    haploid males
  Sequencing Method:                    wgs      ???
  Sequencing Technology:                Illumina
  Estimated Genome Size:                250Mb    # the haploid genome size
  Brief description of the importnace:  #2 pollinator in the US
  Genome Coverage                      127x
  Sequencing Technology                Illumina IIx
  Comments to the staff:               
    DNA
    whole genome sequencing
    single genome;
    no annotation
    assembly name BIMP_1.0
    assembly method: Celera Assembler 6.1
    plan to update: ?
    expect to release
    agp:
 
* Sequin:
  Manuscript:       
   Authors:     Hugh Robertson, Matt Hudson, Kim Walden, Aarti Venkat, Tom Newman, Daniela Puiu, Tanja Magoc, David Kelley, Aleksey Zimin, Steven Salzberg, Gene Robinson
 
----
 
* Best assembly (CA)
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/
  ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bombus_impatiens/Assembly/CA/
 
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  ctg                  92307      63    100    119    186    297795    2613      24781      241197400     
  ctg.contam          82        1331  10485  19924  31977  112638    25648      32851      2103095 
  ctg.clean            92225      63    100    119    185    297795    2593      24628      239094305   
  ctgNs                76        75    536    3124  6834  44990      5326      10261      404781       
 
  scf                  1896      76    150    4044  67922  4021294    151761    1017298    287738041
  scf.ctg.contam      24        14315  45008  71878  141921 1116872    141140    164885    3387362       
 
  deg                  1086650    54    103    124    152    6822      139        135        151484589 
  deg.contam          1129      65    114    141    207    1035      198        211        223588       
  deg.clean            1085521    54    103    124    152    6822      139        135        151261001     
  deg.Ns              37        98    145    234    426    914        313        437        11574           
 
* Contaminant list: ???
  /fs/szattic-asmg4/tmagoc/bees_bacteria/IDs
 
== 1st submission ==
 
* Submit all contigs (should have deleted the singletons < 200bp)
 
* Directories & files
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/BIMP_1.0.*
 
  server:      ftp-trace.ncbi.nlm.nih.gov
  user:        cbcb_trc
  directory:  test/
  files names: BIMP_1.1.sqn , BIMP_1.1.agp
 
  number contigs:  92,231
  number scaffolds: 1,895
 
* NCBI email (Dec 23rd)
  there are 3 contigs that need to be trimmed:
 
  Contig name Length Span(s) Apparent source
  7180002728824 11695 72..148 Escherichia coli
  7180002730077 25748 22703..25611 Escherichia coli
  7180002740828 30102 134..1688 Escherichia coli
 
  Since these are internal hits, consider whether the contig needs to split (if the E.coli was the 'joining' element).
 
* NOTE: When I blastn these 3 contigs to nr (distant blast) , all 3 contigs had multiple good hits to other bacteria , so probably they are all contaminated and should be removed completely, not just trimmed
 
== 2nd submission ==
 
* Jan 24th
  number of scaffolds: 1191
  number of contigs:  91524
 
* Directories & files
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/BIMP_1.1.*
 
== NCBI release ==
 
* [http://www.ncbi.nlm.nih.gov/nuccore/AEQM00000000 AEQM00000000] WGS
* [ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/ ftp] GCA_000188095 to come by end of Feb 2011

Latest revision as of 19:51, 17 May 2011

Data

  • ~ 500Mbp genome (diploid); 250Mbp halpoid
  • Complete mitochondrion genomes:
 NC_011923.1    15468  14.67  Bombus hypocrita sapporoensis mitochondrion, complete genome
 NC_010967.1    16434  13.22  Bombus ignitus mitochondrion, complete genome
 only 88% identity; no rearrangements, only snps, short indels
  • NCBI TaxId
 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=132113

Traces

  • 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
  Lane   Insert  ReadLen    #Mates          #Reads        Coverage   Orientation     Comments                         #RepeatsReads
  1      4K      124        34,944,099                    14X        outie           865,687(1.2%) reads qual==0      30,576,297 (44%)
  2      8K      124        32,540,640                    13X        outie                                            26,972,424 (41%)
 
  3      400     124        34,745,750                    .                                                           8,319,335 (12%)
  5      400     124        34,601,239                    .                                                           8,398,759 (12%)
  6      400     124        34,553,857                    . 
  7      400     124        34,682,612                    . 
  8      400     124        12,975,839                    . 
 
  9      3K      124        29,615,036                    .         outie            new3kb ; received 05/18/2010
   
  1-8                       219,044,036     438,088,072   108X
  1-9                       248,659,072     497,318,144

  3-8    400     124        151,559,297     303,118,594   75X        innie
  1,9    3-4K    124        64,559,135      129,118,270   32X        outie
  2      8K      124        32,540,640      65,081,280    13X        outie

 Two new 3kb libraries to come; ~ 70% are true pair ends

Adaptors

 >circ
 CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
 >circ.revcomp
 TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG 
 >5                               
 GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
 >3
 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT[AGATCTCGGTGGTCGCCGTATCATTA][A+]
               0        1         2         3         4 
               123456789012345678901234567890123456789012
 circ          CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
                   .............   ..   ............. 
 circ.revcomp  TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG 
 CGGCATTCCTGCTGAACC revcmp GGTTCAGCAGGAATGCCG
              
               0        1         2         3         4 
               123456789012345678901234567890123456789012
                           GGTTCAGCAGGAATGCCG
 5             GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
 
               CGGCATTCCTGCTGAACC
 3             CGGCATTCCTGCTGAACCGAGATCGGAAGAG


 #18mers (5')
 CGTAATAACTTCGTATAG
 TCGTATAACTTCGTATAA
 GATCGGAAGAGCGGTTCA
 CGGCATTCCTGCTGAACC
 #18mers (5',3')
 CGTAATAACTTCGTATAG
 TTATACGAAGTTATACGA 
 TCGTATAACTTCGTATAA
 CTATACGAAGTTATTACG  
 GATCGGAAGAGCGGTTCA
 GCAGGAATGCCGAGACCG
 CGGCATTCCTGCTGAACC
 CGTGTAGGGAAAGAGTGT
 #31 mers (5') # jellyfish uses 31mers
 CGTAATAACTTCGTATAGCATACATTATACG
 TCGTATAACTTCGTATAATGTATGCTATACG
 GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
 CGGCATTCCTGCTGAACCGAGATCGGAAGAG

Frequent kmers

                                                        s_1   s_2    s_3
    1  CGTAATAACTTCGTATAGCATA|CGTATAACTTCGTATAATGTAT    15.2  4.59   2.829    # not assembled   circAdapter
    2  AATGTATGCTATACGAAGTTAT|ATAACTTCGTATAGCATACATT    5.56  19.02  0.525    # not assembled   circAdapter
    3  CGTTATAACGTATTACGTTATA|TGTATGCTATACGAAGTTATTA    5.2   4.56   .         
    4  TACATTATACGAAGTTATACGA|TATATGGTATGACGTTATAACG    3.39  1.87   0.365     
    5  ATATAACGTAATACGTTATAAC|TATAACGTATTACGTTATATGG    2.86  2.71   1.595     
    6  ACGTTATAACGTCATCCCATAT|ATACCATATAACGTAATACGTT    2.11  2.62   .
    7  TACGTTATAACGTCATACCATA|TCCATTTTACACTGCAAATGTA    1.98  2.04   1.749  
    8  CTTAAACGTCGAAATATCTCCA|TTACGTTATATGGGATGACGTT    0.87  1.16   1.136  
    9  GAGCGCTTAAACGTCGAAATAT|ATGAAATCTAGCCGTCATTTCC    0.73  1.02   1.442  
   10  ACGTAATACGTTATAACGTCAT|ATTTCCTTCCCGTTGGAGATAT    0.67  1.1    1.477  
   11  CTGCTGAACCGCTCTTCCGATC|GATCGGAAGAGCGGTTCAGCAG    0.61  3.92   0.045    # not assembled   5Adapter
   12  ACGGTCACAGCTTGTCTGTAAG|TGTAAAATGGAGTTCAACGAGC    0.31  1.73   0.798  
   13  ACGGTGGTGCATTACATTTGCA|GGCGAATGGCGCCTGATGCGGT    0.28  0.87   0.949    # not assembled
   14  CAGAAAAGCATCTTACGGATGG|TACATACCTCGCTCTGCTAATC    0.04  2.47   .        # not assembled


  • Location
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt : s_1-8 libs
 /fs/szattic-asmg4/Bees/Bombus_impatiens/new3kb/s_1_?_sequence.txt     : new3kb lib received 05/18/2010

 /fs/szattic-asmg4/Bees/Bombus_impatiens/3kb_Bimpatiens.txt            : 88 new3kbr test reads

Tasks to figure out

  1. Erroneous reads/bases, which we need to correct or discard
  2. GC bias, so we can compute a-stats properly
  3. Redundancy in the long paired ends, which are lane 1 and lane 2.
  • Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument.
    • Some reads will begin in the circularization adaptor and thus will have only one usable read
    • Some reads have a few bases of DNA sequence and hit the circularization adaptor right away
    • Most reads will have at least 36bp from each end before hitting the adaptor.
    • Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor)
  • A small but significant number of reads from the 3kb and 8kb libraries are not recircularized.
    • Thus their mate distance is +400bp rather than -3kb or -8kb.
    • It's apparently the result of a faulty batch of cre recombinase. This causes problems with contiging and scaffolding.
    • It is possible to remove these reads by removing
  1. mate pairs where neither read is trimmed (thus no adapter ligation may have occurred)
  2. mate pairs where one read begins with the adapter sequence.
    • We are working on this, up to now our paired assembly stages have been disappointing.
  • Matt's idea is to exclude all mate pair reads that don't have evidence of the linker with flanking useful sequence, as a way to avoid uncircularized molecules that will give misleading "mate pairs" only 400 bp apart.
    • There has been no trimming of the adaptor, which is the 42 base 454 adaptor, so its presence can be used to indicate potentially good mate pairs.
    • Even tossing half the mate pairs might not be a problem, as we have perhaps too many anyway.
    • But you will also need to toss redundant mate pairs, and that will indeed reduce the total a lot.
    • Just to be clear - the 500 base mate pairs should have no such problems, except that as Matt has found from his preliminary assembly, the mean fragment length is actually 400 bp rather than 500 bp (and the 3 and 8 kb PE reads are typically shorter than nominally given, e.g. more like 2.5 and 6 kb).
    • You'll also need to throw out all the reads where one of the mates /starts/ with linker, I assume you'd do that anyway. We're also working on ways round this; one might be when we get a better assembly we'll find some better characteristic to filter the unrecircularized reads on.

Trimming

  • Quality trimming: trim all bases with fastq quality eq "B" (0)
  cat *fastq | ~/bin/fastq2clq.pl 
  • Adaptor trimming: Align all subsets to adaptors
 C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
 3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
 5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
 id  len    gc%
 C   42     30.95  
 3   52     55.77  
 5   67     59.70  
 nucmer -l 8 -c 16 -b 32 -g 32 adaptors.seq  ...
 delta-filter -l 16 -q   ...
 cat *.filter-q.delta | ~/bin/delta2clr53.pl -5 5,3 ...
  • Adaptor/primers median positions in the reads
 Lib       adaptorPos     5'primerPos 3'primerPos 
 s_1       34-75          0-36        0-19
 s_2       2-40           0-36        0-19
  • Mate stats
 Lib        #mates    adaptor.bothMates   adaptor.noMate   adaptor.oneMate   adaptor.oneMate(filtered)   
 s_1        34944099  269156              9804247          24870696          6048164(17.3%)               
 s_2        32540640  91528               16181288         16267824          1061858( 3.2%)   
 total      67484739                                                         7110022(10.5%)
  • Filtered read clr stats
 Lib        #mates     #reads       min    q1     q2     q3     max        mean       n50        sum            
 s_1        6048164    12096328     64     80     95     115    124        96.56      101        1,168,064,632     
 s_2        1061858    2123716      64     80     96     115    124        96.59      101          205,122,226 
 total      7110022    14220044     64     80     95     115    124        96.57      101        1,373,186,858 
  • Filtered read GC% stats
 Lib        #reads       min    q1     q2     q3     max        mean       n50                    
 s_1        12096322     0.00   31.93  36.26  42.35  100.00     37.45      38        
 s_2        2123715      0.00   31.88  36.46  42.48  100.00     37.56      38         
  • Other frequent kmers
 26mer : ACGTTATAACGTATTACGTTATATGG -> revcomp -> CCATATAACGTAATACGTTATAACGT : ~10% of the traces
 10mer : AAAAAAAAAA                               TTTTTTTTTT                 : ~32% of the traces    
 53mer: CGATTTCCATGGCGTCGTTTGAGGATTCCAATACGGCGAACCTGTTGTGAGTG                : ~2% of the mito seqs (either begin or end); not present in the 2 complete mito's (probably ok)
  • Data processing pipeline
 s_[12]    : /nfshomes/dpuiu/bin/fastq2adaptorFreeFastq.amos
 s_[35678] : /nfshomes/dpuiu/bin/fastqfrg.amos
  • Location:
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt                   # original fastq files
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/s_[1235678]_[12]_sequence.cor.txt    # corrected fastq files
 /fs/szattic-asmg5/Bees/Bombus_impatiens/s_[12]/all/s_[12].64.rev.frg                    # adaptor free s_[12] mated reads
 /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[1235678].frg                  # adaptor free & corrected s_[1235678] reads

D Kelly's trimming

 lib                       reads        min  q1   q2   q3   max  mean    n50  sum           #repeatReads
 s_1_1_sequence.cor.txt    5,436,814    30   71   90   115  124  89.70   100  487673256     1,071,779 (20%)
 s_1_2_sequence.cor.txt    5,864,225    30   77   92   110  124  93.27   98   546956864     1,208,767 (20%)
 s_2_1_sequence.cor.txt    1,053,858    30   80   96   118  124  97.16   102  102395785     222,290   (21%)
 s_2_2_sequence.cor.txt    1,041,846    30   78   93   110  124  93.51   99   97426679      208,029   (20%)
 s_3_1_sequence.cor.txt    33,668,302   30   105  124  124  124  111.04  124  3738530761              (12%)
 s_3_2_sequence.cor.txt    32,554,322   30   82   108  124  124  100.42  120  3269039697   
 s_5_1_sequence.cor.txt    33,579,744   30   109  124  124  124  111.65  124  3749192427   
 s_5_2_sequence.cor.txt    32,465,412   30   84   112  124  124  102.13  124  3315553171   
 s_6_1_sequence.cor.txt    33,535,725   30   109  124  124  124  111.60  124  3742602285   
 s_6_2_sequence.cor.txt    32,390,877   30   83   109  124  124  100.92  123  3268875471   
 s_7_1_sequence.cor.txt    33,674,235   30   112  124  124  124  112.19  124  3777917379   
 s_7_2_sequence.cor.txt    32,518,568   30   84   110  124  124  101.60  124  3303807321   
 s_8_1_sequence.cor.txt    11,777,821   30   113  124  124  124  112.30  124  1322658923   
 s_8_2_sequence.cor.txt    12,176,364   30   84   110  124  124  101.51  124  1236034348 

 s_9_1_sequence.cor.txt    21,273,603   30   87   124  124  124  104.52  124  2223545139     #new3kb added 05/18/2010
 s_9_2_sequence.cor.txt    20,769,027   30   86   124  124  124  103.68  124  2153436097     #new3kb added 05/18/2010
 s_1,2                     13,396,743   
 s_1..8                    301,738,113  .  .    .    .    .    .       .    31958664367  
 s_2..8 (64bp+)            275,763,594  .  .    .    .    .    .       .    30693543885  


  • Read/Mate counts
 lib                       reads        mates             range
 s_1                       11301039     5320163    
 s_2                       2095704      1034086           11301040..13396743      
 s_3                       66222624     31857304   
 s_5                       66045156     31820553         
 s_6                       65926602     31748713   
 s_7                       66192803     31891468   
 s_8                       23954185     11168258   
 s_9                       42042630     13245740

 s_1,2                     13,396,743   6,354,249
 s_1-8                     301,738,113  144,840,545
 s_1-9                     343,780,743  158,086,285
  • Files
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt        # adaptor free corrected reads (long inserts)
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt          # corrected reads (short inserts)

 /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[12].rev.frg
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/frg/s_[35678].frg

Assembly

  • SOAP version 1.04: /nfshomes/dpuiu/szdevel/SOAPdenovo_Release1.04/
  • CA version 6.0 /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/

SOAPdenovo s_1..8 CBCB-corr (Illinois)

  • From: Kim Walden 04/30/2010
  • Input: s_1,2,3,5,6,7,8 reads
  • Error correction: yes; CBCB method
  • Kim's e-mail
 I ran it in a stepwise fashion, not adding any of the available options and only changing K=31.
 I then ran their more recent GapCloser algorithm, not the version that is part of the package when using option -F in "scaff".  
 I used their "all" assembly once, and I noticed that it automatically uses the -F option during scaffolding
 Also, I used your data from the ftp site without checking for redundancy in the 3kb and 8kb lanes; did Daniela do the same?
  • Summary
                         elements min_len  max_len   mean    n50        sum
 contigs(>100)           103858   100      85850     2227    7541       231387390      # prior to scaffolding; 
 scaffolds&contigs(>100) 12663    100      4648588   19282   1040109    244171008      # post-scaffolding and gap-filling
 scaffolds(span)         3736     252      4648588   64874   1047164    242369898      # same prior data minus contigs that didn't assemble into scaffolds.
 scaffolds(len; no N's)  3736     172      4585806   64034   1027043    239233801      # computed by CBCB
     
  • Locations
 ginkgo:/scratch1/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC.redo/Bimp_Salz_corr_050310_NewGapCloser.fasta

 # backup
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC.redo/
 # online
 http://www.life.illinois.edu/robertson/private/
 Login: robertson
 password: dna

SOAPdenovo s_1..8 CBCB-corr (CBCB)

  • Stats:
                         elem     min      max       mean    n50        sum 
 contigs(all)            9943851  31       85850     60      45         594773555

 contigs(>100)           106213   100      85850     2177    7084       231246190
 scaffolds+contigs(>100) 12473    100      5933710   20428   1086614    254800874    # based on scafSeq file  
 scaffolds(span)         3664     236      5933710   69090   1115496    253148288    # based on scafSeq file
 scaffolds(len)          3664     134      5522831   63411   1059378    232338949 

 readsAssembled          208389968                                                   # based on readOnContig file
 readsFiltered           289681090                                                   # based on readOnContig & log file
 readsTotal              301738113                                                   # based on config file

 readsDeleted            441343                                                      # based on log file

  • Run gapClosing on assembly:
 stillOpen 14784
 closed    42379
  • Contig cvg: GC bias:
Bombus.contig10000.png

 cat asm.K31.contig100.infoseq | ~/bin/getCorrelationDescriptive.pl -i 2 -j 3  => -0.709501493345893
 cat asm.K31.contig1K.infoseq | ~/bin/getCorrelationDescriptive.pl -i 2 -j 3   => -0.923151349747983
 #100+ ctgs
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 len                  106213     100    140    437    2420   85850      2177       7084       231246190      
 gc                   106213     0.00   33.13  38.00  43.44  100.00     38.74      39         4115066
 cvg                  106213     0.6    39.0   50.8   58.9   160.8      47.38      54         5032329
 #1K+ ctgs
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 len                  41965      1000   1860   3295   6338   85850      5110       7726       214421699      
 gc                   41965      19.18  34.31  37.96  41.79  59.25      38.19      38         1602711
 cvg                  41965      8.0    42.0   49.9   55.8   93.8       47.74      51         2003339
  • Run soap2 to map reads back to the assembly: "-l 32 -v 2"
 readsAligned            173063062
  • Locations
 mulberry:/scratch0/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor/       # original
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor/   # backup

SOAPdenovo s_1..9 CBCB-corr (CBCB) **

  • Uses the new 3Kbp library : s_9 (reads have been reversed)
  • Stats:
  .                    elem       min    q1     q2     q3     max        mean       n50        sum
  scf                  11178      100    111    135    390    5655980    23014      1205321    257251549
  ctg                  10856652   31     .      .      .      85850      57         43         627095607
  ctg100+              106741     100    .      .      .      85850**    2165       6939       231167576

  scfSeqContigs        67900      100    340    1788   4373   85819      3437       7586       233372689
  scfSeqContigsClosed  20651      100    128    488    7991   600179     11715      64083      241921308
  • Location
 mulberry:/scratch2/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/      # original
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/   # backup

CA s_1..8 corr OBT (CBCB)

Issues

  • Used outie s_1,2 lib ; needs to be redone

runCA spec

 doOverlapTrimming =        1
 ovlOverlapper =            ovl
 unitigger =                bog         # cmd buildUnitigs

 *BatchSize =               2000000
 *BlockSize =               2000000

 ovlMemory   =              8GB
 doExtendClearRanges =      0          # fast run

Gatekeeper

  • ~ 58X cvg
 gatekeeper -dumpfragments asm.gkpStore | grep ^fragmentClear ...
 LOAD                  STATS         
 
 7                     libInput      
 7                     libLoaded     
 0                     libErrors     
 
 301738113             frgInput      
 301738113             frgLoaded     
 25974519              frgErrors     
 
 144840545             lkgInput      
 124575590             lkgLoaded     
 20264955              lkgErrors     
 0                     lkgWarnings   
 
 275763594             numRandom     
 301738113             numNormal     
 
 
 GLOBAL                STATS         
 
 301738113             FRG           
 7                     LIB           
 
 LibraryName           numActiveFRG  numDeletedFRG  numMatedFRG  readLength   clearLength    #egrep 'ATATAACGTAATACGTTATAAC|GTTATAACGTATTACGTTATAT'
 GLOBAL                274282535     27455578       247151634    30524746650  29445986650  
 LegacyUnmatedReads    0             0              0            0            0            
 
 s_1                   10409320      891719         9039810      996306280    994802849      #1134233(11%)
 s_2                   2054575       41129          1987728      197829588    197593453      #224648(11%)
 s_3                   59894531      6328093        53925476     2382786529   2127257586     #3037085(5%)  
 s_5                   60127274      5917882        54512448     2459696212   2209078913     #5%
 s_6                   59813485      6113117        54062058     2396359740   2152656620     #5%
 s_7                   60278536      5914267        54647258     2476326572   2237423989     #5%
 s_8                   21704814      2249371        18976856     2435572545   2347304056     #5%
 gatekeeper -dumplibraries -tabular asm.gkpStore
 UID    IID  Orientation  Mean  StdDev  NumFeatures  
 s_1    1    I            4000  400     1582823088   
 s_2    2    I            8000  800     1582823088   
 s_3    3    I            400   40      1582823088   
 s_5    4    I            400   40      1582823088   
 s_6    5    I            400   40      1582823088   
 s_7    6    I            400   40      1582823088   
 s_8    7    I            400   40      1582823088   
 s_1,2 lib : ids:1..13,396,744
 overlap jobs that need to be redone : 1036 out of 11476 < 10%
 ~/bin/getOvlJobsCount.pl 2000000 2000000 301738113 | p '/(\d+)-(\d+)\s+(\d+)-(\d+)/; print $_ if($1<14000000 or $3<14000000);'

Meryl

 meryl -Dh -s 0-mercounts/asm-C-ms22-cm0 
 Found 23686053415 mers.
 Found 265670791 distinct mers.
 Found 8682767 unique mers.
 Largest mercount is 21,987,616; 668 mers are too big for histogram.
 ...
 #freq   count   %
 1       8682767 0.0327  0.0004
 2       4288602 0.0488  0.0007
 3       3041005 0.0603  0.0011
 ...
 72*     3803015 0.7150  0.3229
 ...
 1048158 1       1.0000  0.9274
 fasta2tab.pl  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/0-mercounts/asm.nmers.ovl.fasta   | sort -n -r | head -5
 21987616        ATATAACGTAATACGTTATAAC
 20992872        CGTTATAACGTATTACGTTATA
 20510821        ACGTTATAACGTATTACGTTAT
 20484950        CCATATAACGTAATACGTTATA
 20452950        CATATAACGTAATACGTTATAA
 cat egrep.sh 
 egrep -c 'ATATAACGTAATACGTTATAAC|GTTATAACGTATTACGTTATAT' /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/seq_obt/*clean.seq > egrep.count
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2
 ...

OBT

  • job count
 n=((301738113/2000000)+0.5)=151
 OBT/OVL jobs = n*n/2 =151*76=11476
 ~ 20 min/OBT job =~ 0.33 hr/OBT job  => 0.33*11476/32=5 days
  • OBT trimming
 key             sum             count           mean
 CLR             31958664367     301738113       105.915239043733
 OBTINITIAL      30779687358     301610679       102.051052900551
 OBTMERGE        30052401719     301610679       99.6397137483318
 OBTCHIMERA      30052138325     301610679       99.638840456972
 LATEST          30052138325     301610679       99.638840456972
  • use OBTCHIMERA as CLR if want to redo the assembly !!!

Overlap

  • job count : same as OBT
  • Overlap counts
 cat asm.repeatmodel.lib.00*stats | egrep '^Lib|nSamples|median' | p 'chomp; if(/Lib/) {print "\n$_"} else { print "  $F[1]"}' | pretty -o 
 Lib    0  5'  199542773  50  
 Lib    0  3'  199542773  50  
 Lib    1  5'  6682072    38  
 Lib    1  3'  6682072    37  
 Lib    2  5'  1306720    37  ???
 Lib    2  3'  1306720    42  ???
 Lib    3  5'  43868663   50  
 Lib    3  3'  43868663   50  
 Lib    4  5'  43939366   51  
 Lib    4  3'  43939366   50  
 Lib    5  5'  43620676   50  
 Lib    5  3'  43620676   50  
 Lib    6  5'  43903838   51  
 Lib    6  3'  43903838   50  
 Lib    7  5'  16221438   50  
 Lib    7  3'  16221438   50
  • 199542773 out of 257786946 (77%) usable reads had overlaps
  • 7988792 out of 11443537 (69%) usable s_1,2 reads had overlaps

Frg/Ovl correction

  • Frgcorr jobs: 151
 wc -l 3-overlapcorrection/cat-corrects.frgcorrlist
  • Ovlcorr jobs: 151
 wc -l 3-overlapcorrection/cat-corrects.frgcorrlist

OverlapStore

  • Failures
    • overlapStore: too many files (max 10240) ; fix: recompile the code
    • chimera: FAILED file seems to be truncated but contains records from all libs; fix: link asm.chimera.report.FAILED asm.chimera.report

Bog

  • Run time: 14 hr
  • default unitigger : does not seem to work on such big datasets
  • Utg len stats:
 cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
 elem       min    q1     q2     q3     max        mean       n50        sum            
 63916892   47     85     150    150    22861      125.23     150        8004296170     

Bog consensus

  • job count :
 wc -l 4-unitigger/asm.partitioningInfo
 110
  • Failures
 grep FAILED 5-consensus/asm*err
 5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED.  Could not align fragment 264046761.
 
 # Fix: 
 # update layout (CA 6.1)
 mkdir utg
 
 # update layout (CA 6.1)
 grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "tigStore -g asm.gkpStore -t asm.tigStore 2 -u  $F[0] -d layout > utg/$F[0].layout\n";'  | ~/bin/scheduler.pl
 grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "~/bin/updateUTGlayout.pl utg/$F[0].layout >  utg/$F[0].layout.update\n ";'              | ~/bin/scheduler.pl 
 grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg/$F[0].layout.update\n" ;'             | ~/bin/scheduler.pl  -c 1

 grep FAILED 5-consensus/asm*err | sed 's/\.err:/ /' | awk '{print $1}' | sort -u | p 'print "touch $F[0].success\n";'                                               | ~/bin/scheduler.pl 
 # create success file
 touch 5-consensus/consensus.success
 
 # relaunch runCA

Cgw

  • Failures
 grep -B 5 "^Failure message:" runCA.log 
 

Cgw consensus

  • job count : 84
 wc -l 7-CGW/asm.partitionInfo
  • ctg count: 1698486
 cat 7-CGW/asm.partitionInfo | /nfshomes/dpuiu/bin/sum.pl -i 3 -t ctgs | pretty -o

Terminator

  • Failure:
    • buildPosMap : segFault :ids too long???
 Fix: split asm file(optional) and reassign new accession numbers
 cd split/
 ~/bin/splitAsm.pl < asm.asm => ... CCO.asm, SCF.asm
 cat CCO.asm CLK.asm SCF.asm SLK.asm | ~/bin/newAsmIds.pl | buildPosMap -o asm

Stats

 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  15009      74     4786   9789   20021  415318     16916.33   29133      253897199      
 ctg                  145036     64     93     116    1035   138630     1498.67    7726       217361459      
 deg                  1507280    54     98     118    146    1434       148.28     137        223502287      
 utg                  2786974    54     90     110    142    25644      197.04     277        549140918      
 singletons           61129865
 usableSeqs           257786946
  • Stats (100bp+)
                       elem     min      max       mean    n50        sum 
contigs(>100)          99429    100      138630    2148    7972       213595680
scaffolds(span)        14992    100      415318    16935   29133      253895676
scaffolds(len)         14991    100      303946    14499   24091      217357577 
  • Library stats (totals)
 lib    ctg       sur      deg       singl     total
 s_1    2251597   162800   3119665   3988095   9522157
 s_2    424821    41107    563656    891795    1921379
 s_3    28880189  1078891  13600909  12735637  56295626
 s_5    28983834  1077334  13561390  12962731  56585289
 s_6    28663974  1062939  13559494  12896855  56183262
 s_7    28784505  1058364  13713200  13020274  56576343
 s_8    10455533  383065   5229814   4634478   20702874
 all    128444453 4864500  63348128  61129865  257786930
  • Library stats (percentages)
 lib    ctg  sur  deg  singl  total
 s_1    24   2    33   42     100
 s_2    22   2    29   46     100
 s_3    51   2    24   23     100
 s_5    51   2    24   23     100
 s_6    51   2    24   23     100
 s_7    51   2    24   23     100
 s_8    51   2    25   22     100

Mate redundancy

  • Criteria:
 Given 2 mates  
   1: a1,a2 
   2: b1,b1
 We have the following 2 overlaps
   1: a1,b1  a2,b2    
   2: a1,b2  a2,b1  
 At least 75% of one read overlaps the other read
  • Location:
 cd /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor/9-terminator/
 wc -l s_1.mates asm.posmap.frgutg.s_1.*
 5320163 s_1.mates                                     # input mates (from FRG file; 2 reads/mate)
 1232734 asm.posmap.frgutg.s_1.mates                   # assembled mates (both reads assembled) 
  366262 asm.posmap.frgutg.s_1.redundant_mates         # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 29.7% redundancy
  312206 asm.posmap.frgutg.s_1.redundant_mate_pairs    # redundant assembled mate pairs
 
 wc -l s_2.mates asm.posmap.frgutg.s_2.* 
 1034086 s_2.mates                                     # input mates (from FRG file; 2 reads/mate)
  255905 asm.posmap.frgutg.s_2.mates                   # assembled mates (both reads assembled) 
  110570 asm.posmap.frgutg.s_2.redundant_mates         # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 43.2% redundancy
 1871747 asm.posmap.frgutg.s_2.redundant_mate_pairs    # redundant assembled mate pairs
   11937 asm.posmap.frgutg.s_2.redundant_mate_clusters # redundant assembled mate clusters

Runtimes

 #     step            hours
 0     gatekeeper      5.36   
 1     initialTrim     2.07   
 2     meryl           9.90   
 9     overlap.sh      11.31  
 11    find            0.00   
 16    overlapStore    13.79  
 17    consolidate     5.12   
 18    merge-trimming  1.41   
 19    chimera(1)      10.04  
 20    overlap.sh      49.86  
 21    find            0.00   
 22    overlapStore    16.68  
 23    frgcorr.sh      5.57   
 24    cat-corrects    0.10   
 25    ovlcorr.sh      4.64   
 26    cat-erates      0.90   
 27    overlapStore    6.67   
 28    buildUnitigs    13.88  
 29    gatekeeper      1.31   
 30    consensus.sh    3.41   
 31    cgw(1)          65.00  
 43    gatekeeper      4.39   
 44    consensus.sh    4.70   
 46    terminator      6.26   
 47    asmOutputFasta  7.75   
 48    dumpSingletons  3.68   
 49    buildPosMap(1)  3.79

Location

 ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/                             # original assembly
 ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/asm.gkpStore.OBTCHIMERACLR   # dumped OBT clear ranges, 0/1 undel/del
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor                        # backup

CA s_1..8 corr OBT redo (CBCB)

Issues

  • Fixed outie s_1,2 lib : clear ranges will be different
  • s_3..8 : stay the same

runCA spec

 doOverlapTrimming =        0         !!! already done in the prev assembly
 saveOverlaps  =            1         !!! it was actually set to 0 for this run; otherwise the ovb.gz files get deleted
 ovlOverlapper =            ovl       # same
 unitigger =                bog       # same   
 
 BatchSize =                2000000    # must be the same so overlaps are not redone
 BlockSize =                2000000    # must be the same
 ...
 doExtendClearRanges =      1          # initially set to 2; seems to take too long => reset to 1 

Input

  • Modified frg files:
    • dumped "CA s_1..8 corr OBT" assembly gatekeeper
    • undelete act:D fragments so the ids stay the save
    • reverse s_1,2 reads

Gatekeeper

  • Make sure the iids match

Overlap

  • Rerun only the overlap jobs for the s_1,2 reads

OverlapStore

  • To build faster in the future
e 'foreach ("0000000001".."0000000058") { print "overlapStore -c $_.ovlStore -g asm.gkpStore/ -i 0 -M 8192 1-overlapper/$_/*gz\n" }' | scheduler.pl  # 13min/job  
e 'foreach ("0000000002".."0000000058") { print "overlapStore -m 0000000001.ovlStore $_.ovlStore\n" }'                                               # 3 min/job 
total=2*13+57*3=3.5 hrs (instead of 13.85 reported in runCA.log)

Frgcorr

  • Initial failure: on 144 out of 151 jobs : the gatekeeper id's originally did not match (5 reads missing from s_5 lib shifted all the other reads)

Ovlcorr

  • The error rate on overlaps with error went down by ~ 1.28%

Bog

 cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
 elem       min    q1     q2     q3     max        mean       n50        sum            
 60970504   47     85     150    150    93191      126.17     150        7692801690
  • Unitig stats
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep -c ^unitig_num_frags                                    # 60,970,523  : total unitigs
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep -c ' 1$'                       # 59,776,572  : unitigs with only 1 read
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep -v ' 1$' | getSummary.pl -i 2  # stats for unitig with 2+ reads

               elem         min    q1     q2     q3     max        mean       n50        sum            
 1read/utg     59,776,572                                                                59,776,572
 2+reads/utg   1,193,951    2      3      6      28     74715      165.84     4712       198,010,374

Bog consensus

  • job count :
 wc -l 4-unitigger/asm.partitioningInfo
 110
  • Failures: 1 failed unitig
 grep FAILED 5-consensus/asm*err
 5-consensus/asm_022.err:MultiAlignUnitig()-- Unitig 115198 FAILED.  

 Fix: same as before

Cgw

  • Failure:
    • after 7-0-CGW/asm.ckp.4
 tail runCA.log
 ERROR:  Negative variance for chunk 59905266* in scaffold 6409 (AEnd=-163479.174700 BEnd=-163491.446700)

 37980:Scaffold 6409:
 37981-  Idx    Chunk     (   Start, Variance)  (     End, Variance)      Len
 37982-    0: 60983787    (       0,        0)  (   31646,      823)    31646
 37983-    1:   69691     (   35389,      934)  (   31650,      837)     3739
 37984-    2: 59907028    (   35103,      942)  (   42035,     1122)     6932
 37985-    3: 60799298    (   41640,     1132)  (   64897,     1737)    23257
 37986-    4: 60975976    (   64924,     1784)  (   88439,     2395)    23515
 37987-    5: 60988396    (   88431,     2416)  (  122023,     3289)    33592
 37988-    6: 60988698    (  121551,     3324)  (  123589,     3377)     2038
 37989-    7: 59905266*   (  122331,  -163479)  (  121859,  -163491)      472
 37990-    8: 60988599    (  122275,   170138)  (  134564,   170457)    12289
 ./AS_REZ/GapFillREZ.C:5666: int Estimate_Chunk_Ends(VarArrayStack_Entry_t*, int, int, LengthT*, LengthT*, ChunkInstanceT*, float*, Scaffold_Fill_t*, int*, int*, int*): Assertion `min_right_variance >= 0.0' failed.
 ./AS_REZ/GapFillREZ.c:8339:       "ERROR:  Negative variance for chunk %d in scaffold %d (AEnd=%f BEnd=%f)\n",  #message
 Fix: modfy the code; comment assertion statemants
 cat /nfshomes/dpuiu/szdevel/SourceForge/wgs/src/AS_REZ/GapFillREZ.c
 // assert (0.0 <= ref_variance && ref_variance <= FLT_MAX);
 // assert (min_right_variance >= 0.0);
  • Preliminary stats
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $sum,"\n"; $sum=0; } else { $sum+=$F[-1]}' | getSummary.pl -t scf
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -t ctg
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -min 200 -t ctg200+
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $count-1,"\n"; $count=0; } else { $count++}' | getSummary.pl -i 0 -min 1 -t scf.gaps
 #7-0-CGW/rez.i36.log :
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  1820       84     162    4583   82337  3161523    125884.07  667487     229109005       # ids from 0..13113
 ctg                  26453      64     962    3756   10485  255820     8759.59    21800      231717378      
 ctg200+              23343      200    1760   4772   11873  255820     9908.34    21845      231290344 
 scf.gaps             877        1      4      12     36     373        27.71      64         24298
 #7-2-CGW/rez.i01.log :
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  4903       66     1283   6908   39466  1148626    47212.57   209756     231483215      
 ctg                  23480      64     1370   4397   11757  255820     9858.75    23476      231483360
 7-2-CGW/rez.i06.log 
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  3818       64     268    6813   61948  1198480    60723.65   257457     231842914      
 ctg                  24005      64     1165   4161   11439  255820     9659.47    23819      231875677 
 #7-2-CGW/rez.i11.log 
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  3733       64     268    6880   65584  1391034    62111.99   258823     231864045      
 ctg                  24004      64     1165   4161   11439  255820     9659.98    23819      231878256     
 #7-2-CGW/stone.i02.log
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  1854       83     153    4039   52506  2983045    125375.58  860583     232446325      
 ctg                  23863      64     674    3833   11368  255820     9761.94    25892      232949194       
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  1847       84     152    4440   62397  3337984    125470.25  835387     231743549      
 ctg                  26088      64     429    3213   10218  297786     8941.40    25521      233263147      
  • TigStore
 tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D unitigs | tail
 maID            isPresent       isDeleted       ptID    svID    fileOffset      covStat
 60970540        1               0               0       3       58023924696     84
 tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D contigs | tail -1 
 maID            isPresent       isDeleted       ptID    svID    fileOffset
 61001100        1               0               0       32      16307420
  • Checkpoints
 # 7-0-CGW              
                                                                   #times   iterMax
 CHECKPOINT_AFTER_BUILDING_SCAFFOLDS     "ckp01-ABS"               
 CHECKPOINT_DURING_CLEANING_SCAFFOLDS    "ckp02-DCS"               5        10
 CHECKPOINT_AFTER_CLEANING_SCAFFOLDS     "ckp03-ACD"               1
 CHECKPOINT_DURING_1ST_SCAFF_MERGE       "ckp04-1SM-partial"       12              # we should see more iterations if the scaffolding goes well; also ECR would take longer
 CHECKPOINT_AFTER_1ST_SCAFF_MERGE        "ckp05-1SM"               1
 CHECKPOINT_AFTER_STONES                 "ckp06-AS"           
 CHECKPOINT_DURING_2ND_SCAFF_MERGE       "ckp07-2SM-partial"  
 CHECKPOINT_AFTER_2ND_SCAFF_MERGE        "ckp08-2SM"               1
 CHECKPOINT_AFTER_FINAL_ROCKS            "ckp09-FR"                1

 # 7-2-CGW              
                                                                   #times   iterMax
 CHECKPOINT_AFTER_BUILDING_SCAFFOLDS     "ckp01-ABS"               
 CHECKPOINT_DURING_CLEANING_SCAFFOLDS    "ckp02-DCS"               9        10
 CHECKPOINT_AFTER_CLEANING_SCAFFOLDS     "ckp03-ACD"               1 
  ...
 CHECKPOINT_DURING_1ST_SCAFF_MERGE       "ckp04-1SM-partial"       10
 CHECKPOINT_AFTER_1ST_SCAFF_MERGE        "ckp05-1SM"               1
 CHECKPOINT_AFTER_STONES                 "ckp06-AS"                1
 CHECKPOINT_DURING_2ND_SCAFF_MERGE       "ckp07-2SM-partial"       1
 CHECKPOINT_AFTER_2ND_SCAFF_MERGE        "ckp08-2SM"               1
 CHECKPOINT_AFTER_FINAL_ROCKS            "ckp09-FR"                1
  ...
 CHECKPOINT_AFTER_PARTIAL_STONES         "ckp10-PS"                1
 CHECKPOINT_AFTER_FINAL_CONTAINED_STONES "ckp11-FCS"  
 CHECKPOINT_AFTER_FINAL_CLEANUP          "ckp12-FC"   
 CHECKPOINT_AFTER_RESOLVE_SURROGATES     "ckp13-RS"   
 CHECKPOINT_AFTER_OUTPUT                 "ckp14-FIN"
  • Long running time : implemented 2 cgw options -ic, -im ....
  • Running times
 CGW-0: START Wed Apr 28 22:30:12 ; END Mon May  3 03:34:57
 ECR-1: START Mon May  3 03:34:58 ; END Mon May  3 12:28:54        
 CGW-2: START Mon May  3 12:28:54 ; End Wed May 12 22:05:00
  • Library insert histograms
 # Example
 cat 7-0-CGW/stat/scaffold_final.distlib_1.cgm | p 'print int($F[0]/1000),"\n";' | count.pl | sort -n > 7-0-CGW/stat/scaffold_final.distlib_1.cgm.hist
 cat 7-0-CGW/stat/scaffold_final.distlib_2.cgm | p 'print int($F[0]/1000),"\n";' | count.pl | sort -n > 7-0-CGW/stat/scaffold_final.distlib_2.cgm.hist
 gnuplot
   set xrange [0:10]
   plot "scaffold_final.distlib_1.cgm.hist" w l, "scaffold_final.distlib_2.cgm.hist" w l, 


  • Failure:
    • after 7-2-CGW/asm.ckp.42 (after running about 2hrs)
 tail runCA.log
 * Screwed up scaffold 3790: Chunk 60994012 has bad mean
 * Scaffold 3790 iteration 1  1 bad CIs out of 33 LSE:2664.55 improvement:1
 * Reinserting chunk 60994012 in scaffold 3790
 * improvement = 0.0721104 LSE = 2664.55
 ----------------------------------------
 Failure message:
 scaffolder failed

ECR

  • Count scaffold gaps : total/closed
 cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^examining gap'   # 24635
 cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^closed gap'      # 2803

Terminator

  • Get id range
 grep -m 1 ^7 asm.posmap.utglen asm.posmap.ctglen asm.posmap.deglen asm.posmap.scflen 
 asm.posmap.utglen:7180000000000 124    
 asm.posmap.deglen:7180001635156 133     
 asm.posmap.ctglen:7180001635158 124
 asm.posmap.scflen:7180002817601 161    
 =>
 utg: 7180000000000..7180001635155
 ctg/deg:7180001635156..7180002817600
 scf:7180002817601..7180002819521

Stats

  • Assembly stats
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  1921       65     148    3389   55043  3122263    149121.08  982222     286461602      
 ctg                  91934      64     100    119    180    255828     2621.59    25123      241013378      
 deg                  1090511    54     103    124    152    7068       139.41     135        152024633      
 utg                  1635156    54     97     118    141    94720      267.81     1636       437917658
  • Assembly stats(100bp+)
                         elem     min      max       mean    n50        sum 
 contigs(>100)           69270    100      255828    3451    25383      239118680 
 scaffolds(span)         1870     100      3122263   153185  982222     286457118 
 scaffolds(len)          1870     100      2990211   128880  870182     241007381

Location

 ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo/           # original assembly
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo      # backup

CA s_1, 2 partial ,3..8 corr OBT (CBCB) **

Issues

  • Use only s_2 lib reads which seem 8K apart : ~ 14% of the s_2 reads aligned by soap2 to SOAPdenovo Illinois assembly within 8+-2Kbp of one another
  • Alignments (very similar results)
 soap2-index  Bimp.fasta                 #  soap2-index -> 2bwt-builder
 bowtie-build Bimp.fasta  Bimp.fasta 
 soap2 -D Bimp.fasta.index -a s_2.1.fastq -b s_2.2.fastq -l 32 -p 8 -v 2 -m 6000 -x 10000 -o s_2.mated.soap2 -2 s_2.single.soap2
 bowtie   Bimp.fasta       -1 s_2.1.fastq -2 s_2.2.fastq -l 32 -p 8 -n 2 -I 6000 -X 10000 >  s_2.mated.bowtie
 bowtie   Bimp.fasta         s_2_0.fastq -l 32 -p 8 -n 2                                  >  s_2_0.bowtie

 wc -l s_2.mated.*
 288880 s_2.mated.soap2
 288668 s_2.mated.bowtie

 intersect.pl s_2.mated.*
 288368

runCA spec

 doExtendClearRanges=0 

Input

  • Gatekeeper: same as prev assembly
  • Overlaps: filter out all overlaps for most of the s_2 reads

Bog

 cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
 elem       min    q1     q2     q3     max        mean       n50        sum            
 61930057   47     85     150    150    113529     125.85     150        7793894671      # max went up from 93191->113529

Bog consensus

  • Failures
 grep FAILED 5-consensus/asm*err : 113 unitigs between 5.. 33547 frags
 5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED.  Could not align fragment 264046761.
  • Fix:
 grep    FAILED 5-consensus/asm*err | p 'print "$F[2]\n";' | perl ./fixLayout.pl | scheduler.pl 
 grep -c FAILED 5-consensus/*err | grep -v ':0$' | p '/(.+).err/; print "touch $1.success\n";' | sort -u | scheduler.pl

Cgw

  • Preliminary stats
 #7-0-CGW/stone.i02.log
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  1847       84     152    4440   62397  3337984    125470.25  835387     231743549      
 ctg                  26088      64     429    3213   10218  297786     8941.40    25521      233263147
 cat 7-0-CGW/cgw.out
 ...
 Processed 61,930,044 unitigs with 301,738,113 fragments

Terminator

  • Message counts
 CTP     91502
 SCF     1896
 CLK     3288923

cat CLK.asm | grep ^ori: | count.pl ori:O 979139 ori:A 869982 ori:N 857732 ori:I 582070

Stats

  • Assembly stats
  .                    elem       min    q1     q2     q3     max        mean       n50        sum
  scf                  1896       76     150    4044   67922  4021294**  151760     1017298**  287738041
  ctg                  92307      63     100    119    186    297795**   2612.99    24781**    241197400
  deg                  1086650    54     103    124    152    6822       139.41     135        151484589
  utg                  1574225    54     98     119    143    123354     274.33     2027       431859732
  • Assembly stats(100bp+)
 .                       elem     min      max       mean    n50        sum            
 contigs(>100)           69944    100      297795    3421    25048      239324761      
 scaffolds(span)         1869     100      4021294   153951  1017298    287735536      
 scaffolds(len)          1869     100      3429736   128833  860270     240790206
  • Library stats
 .                    badLong  badOuttie  badSame  bothChaff  bothDegen  bothSurrogate  diffScaffold  good   oneChaff  oneDegen  oneSurrogate  
 HWI-EAS385_0001:1    0.15     0.06       0.01     19         10.18      0              0.45          46.9   21.41     1.59      0.19          
 HWI-EAS385_0001:2    0.13     0.01       0        84.86      0.02       0              0.27          14.2   0         0.36      0.02          
 HWI-EAS385_0001:3    0.06     0.01       0        14.45      9.03       0.04           0.05          60.11  15.17     0.68      0.16          
 HWI-EAS385_0001:5    0.06     0.01       0        14.84      8.93       0.04           0.05          59.98  15.02     0.65      0.16          
 HWI-EAS385_0001:6    0.06     0.01       0        14.76      9.04       0.04           0.05          59.7   15.29     0.65      0.16          
 HWI-EAS385_0001:7    0.06     0.01       0        14.73      9.12       0.04           0.05          59.51  15.44     0.64      0.16          
 HWI-EAS385_0001:8    0.06     0.01       0        14.05      10.1       0.03           0.05          58.66  15.92     0.7       0.16          
  • Issues
    • Gap sizes vary between -297,562 bp and 772,064 bp. They should be no longer than ~ 8Kbp (max insert size)
 cat 9-terminator/split/SCF.asm | grep -A 3 "^{CTP" | grep ^mea: | sed 's/mea://' | sort -n | head
 -297562
 -216729
 ...
 162239    # scf7180002754932.314143..476382  : 26 s_1,3,5,6,7 mate pairs link the two sides of the gap
 772064    # scf7180002754479.867224..1639288 : 2  s_1 mate pairs link the two sides of the gap
 ~/bin/posmap2scaff.pl 9-terminator/asm.posmap.ctgscf | grep -v "^>" | sort -nk4 | tail
 7180002653372 EB 89    69161
 7180002699293 EB 78    92310
 7180002731234 BE 64936 162239
 7180002726839 BE 46924 772064
 grep ^mea: SCF.asm | sed 's/mea://' | getSummary.pl -i 0 -t scf.gap | pretty
 .          elem   min          q1      q2       q3       max         mean     n50     sum            
 scf.gap    91502  -297562      19      222      667      772064      -141     772064  -12910029
 scf.gap-   13788  -297562      -2166   -89      -12      0           -4284    0       -59071991
 scf.gap+   77699  0            84      325      775      772064      594      1194    46161963

Location

 mulberry:/scratch0/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/       # original (to be deleted)
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/   # backup

Other

SOAPdenovo on CA contigs

 cat asm.ctg.fasta | ~/bin/assignFastaIds.pl | awk '{print $1}' > asm.contig
 lenseq asm.contig > asm.contig.len
 echo Edge_num `wc -l asm.contig.len` `wc -l asm.contig.len` > asm.ContigIndex
 echo "index   length  reverseComplement" >> asm.ContigIndex
 lenseq asm.K31.contig | awk '{print $1,$2,1}' >> asm.ContigIndex
 SOAPdenovo map   -s SOAPdenovo.config -g asm > SOAPdenovo.log        # ok

 SOAPdenovo scaff -g asm -F >> SOAPdenovo.log                         # fails: a nucmer of files are missing

Assembly comparison

CA.contigs vs SOAPdenovo.contigs

  • Filter contigs >=100 bp
  • Alignment
 nucmer --mum -c 100  CA.fasta SOAPdenovo.fasta

 show-coords -H /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | p ' $F[19]="." unless($F[19]); print $F[19],"\n";' | count.pl | pretty -o
 [CONTAINS]     80413  
 [CONTAINED]    4646   
 [IDENTITY]     1960   
 [BEGIN]        4152    
 [END]          4034   
 .              157     # partial alignments
 # get "broken" contigs
 show-coords -H        /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | grep -v '\[' | awk '{print $18,$12}' | sort -u | getSummary.pl -i 1 -t all
 show-coords -L 500 -H /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | grep -v '\[' | awk '{print $18,$12}' | sort -u | getSummary.pl -i 1 -t 500+
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 all                  118        155    4630   9817   29042  144499     22498.97   54311      2654878     
 500+                 16         1844   8205   16551  28426  66880      20514.88   28426      328238  
  • Location
 mulberry:/scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs

CA.contigs vs SOAPdenovo.scaffolds

  • All ctgs & scf
  • Alignment
 #ref/qry all
    .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 SOAPdenovo.scf(span)    12473      100    113    144    498    5933710    20428      1086614    254800874    
 SOAPdenovo.scf(len)     12473      100    113    143    341    5522831    18759.84   1034925    233991535
 CA.ctg                  92307      63     100    119    186    297795     2612       24781      241197400
 #best hits
 nucmer --mum -c 40 CA.fasta SOAPdenovo.fasta
 delta-filter -1 40 -i 95 -o 50 -1 SOAPdenovo-vs-CA.40.delta >! SOAPdenovo-vs-CA.40.filter-1.delta

Genbank submission

 http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi?show=20FF4536-24A4-4D8C-94FA-0755E1326E33
 Project ID: 61101
 Submitting Organization:              University of Illinois & University of Maryland
 Sequencing Center:                    Keck Center for Comparative and Functional Genomics, University of Illinois  ??? BCUI ???  Biotechnology Center, Univ. Illinois (BCUI)
 Consortium Name:
 Organism Name:                        Bombus impatiens
 Strain/isolate/breed:
 Locus Tag Prefix:                     BIMP       #3+ letters
 Source of DNA used for sequencing:    haploid males
 Sequencing Method:                    wgs       ???
 Sequencing Technology:                Illumina
 Estimated Genome Size:                250Mb     # the haploid genome size 
 Brief description of the importnace:  #2 pollinator in the US
 Genome Coverage                       127x
 Sequencing Technology                 Illumina IIx 
 Comments to the staff:                
   DNA 
   whole genome sequencing 
   single genome; 
   no annotation
   assembly name BIMP_1.0 
   assembly method: Celera Assembler 6.1 
   plan to update: ?
   expect to release 
   agp:
  • Sequin:
 Manuscript:         
 Authors:     Hugh Robertson, Matt Hudson, Kim Walden, Aarti Venkat, Tom Newman, Daniela Puiu, Tanja Magoc, David Kelley, Aleksey Zimin, Steven Salzberg, Gene Robinson

  • Best assembly (CA)
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/
 ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bombus_impatiens/Assembly/CA/
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 ctg                  92307      63     100    119    186    297795     2613       24781      241197400      
 ctg.contam           82         1331   10485  19924  31977  112638     25648      32851      2103095   
 ctg.clean            92225      63     100    119    185    297795     2593       24628      239094305     
 ctgNs                76         75     536    3124   6834   44990      5326       10261      404781         
 scf                  1896       76     150    4044   67922  4021294    151761     1017298    287738041
 scf.ctg.contam       24         14315  45008  71878  141921 1116872    141140     164885     3387362        
 deg                  1086650    54     103    124    152    6822       139        135        151484589   
 deg.contam           1129       65     114    141    207    1035       198        211        223588         
 deg.clean            1085521    54     103    124    152    6822       139        135        151261001      
 deg.Ns               37         98     145    234    426    914        313        437        11574             
  • Contaminant list: ???
 /fs/szattic-asmg4/tmagoc/bees_bacteria/IDs

1st submission

  • Submit all contigs (should have deleted the singletons < 200bp)
  • Directories & files
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/ 
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/BIMP_1.0.*
 server:      ftp-trace.ncbi.nlm.nih.gov
 user:        cbcb_trc
 directory:   test/
 files names: BIMP_1.1.sqn , BIMP_1.1.agp
 number contigs:   92,231 
 number scaffolds: 1,895
  • NCBI email (Dec 23rd)
 there are 3 contigs that need to be trimmed:
 Contig name	Length	Span(s)	Apparent source
 7180002728824	11695	72..148	Escherichia coli
 7180002730077	25748	22703..25611	Escherichia coli
 7180002740828	30102	134..1688	Escherichia coli
 Since these are internal hits, consider whether the contig needs to split (if the E.coli was the 'joining' element).
  • NOTE: When I blastn these 3 contigs to nr (distant blast) , all 3 contigs had multiple good hits to other bacteria , so probably they are all contaminated and should be removed completely, not just trimmed

2nd submission

  • Jan 24th
  number of scaffolds: 1191
  number of contigs:   91524
  • Directories & files
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/BIMP_1.1.*

NCBI release