Bumblebee: Difference between revisions
		
		
		
		Jump to navigation
		Jump to search
		
| (227 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
= Data =  | = Data =  | ||
* ~   | * ~ 500Mbp genome (diploid); 250Mbp halpoid  | ||
* Complete mitochondrion genomes:  | * Complete mitochondrion genomes:  | ||
   NC_011923.1    15468  14.67  Bombus hypocrita sapporoensis mitochondrion, complete genome  |    NC_011923.1    15468  14.67  Bombus hypocrita sapporoensis mitochondrion, complete genome  | ||
   NC_010967.1    16434  13.22  Bombus ignitus mitochondrion, complete genome  |    NC_010967.1    16434  13.22  Bombus ignitus mitochondrion, complete genome  | ||
   only 88% identity; no rearrangements, only snps, short indels  |    only 88% identity; no rearrangements, only snps, short indels  | ||
* NCBI TaxId  | |||
  http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=132113  | |||
== Traces ==  | == Traces ==  | ||
* 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)  | * 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)  | ||
<pre style="background:yellow">  | |||
   Lane   Insert   |    Lane   Insert  ReadLen    #Mates          #Reads        Coverage   Orientation     Comments                         #RepeatsReads  | ||
   1        |    1      4K      124        34,944,099                    14X        outie           865,687(1.2%) reads qual==0      30,576,297 (44%)  | ||
   2      8K(  |    2      8K      124        32,540,640                    13X        outie                                            26,972,424 (41%)  | ||
  3      400     124        34,745,750                    .                                                           8,319,335 (12%)  | |||
  5      400     124        34,601,239                    .                                                           8,398,759 (12%)  | |||
  6      400     124        34,553,857                    .   | |||
   7      400     124        34,682,612                    .   | |||
  8      400     124        12,975,839                    .   | |||
   9      3K      124        29,615,036                    .         outie            new3kb ; received 05/18/2010  | |||
   1-8                       219,044,036     438,088,072   108X  | |||
   1-  |    1-9                       248,659,072     497,318,144  | ||
  3-8    400     124        151,559,297     303,118,594   75X        innie  | |||
  1,9    3-4K    124        64,559,135      129,118,270   32X        outie  | |||
  2      8K      124        32,540,640      65,081,280    13X        outie  | |||
</pre>  | |||
   Two new 3kb libraries to come; ~ 70% are true pair ends  |    Two new 3kb libraries to come; ~ 70% are true pair ends  | ||
== Adaptors ==  | |||
   >  |    >circ  | ||
   CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA  |    CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA  | ||
   >  |    >circ.revcomp  | ||
   TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG    |    TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG    | ||
   >5  |    >5                                 | ||
   GATCGGAAGAGCGGTTCAGCAGGAATGCCGA  | |||
   >3  |    >3  | ||
   CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT  |    CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT[AGATCTCGGTGGTCGCCGTATCATTA][A+]  | ||
                0        1         2         3         4   | |||
                123456789012345678901234567890123456789012  | |||
  circ          CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA  | |||
                    .............   ..   .............   | |||
  circ.revcomp  TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG   | |||
  CGGCATTCCTGCTGAACC revcmp GGTTCAGCAGGAATGCCG  | |||
                0        1         2         3         4   | |||
                123456789012345678901234567890123456789012  | |||
                            GGTTCAGCAGGAATGCCG  | |||
  5             GATCGGAAGAGCGGTTCAGCAGGAATGCCGA  | |||
                CGGCATTCCTGCTGAACC  | |||
  3             CGGCATTCCTGCTGAACCGAGATCGGAAGAG  | |||
  #18mers (5')  | |||
  CGTAATAACTTCGTATAG  | |||
  TCGTATAACTTCGTATAA  | |||
  GATCGGAAGAGCGGTTCA  | |||
  CGGCATTCCTGCTGAACC  | |||
  #18mers (5',3')  | |||
  CGTAATAACTTCGTATAG  | |||
  TTATACGAAGTTATACGA   | |||
  TCGTATAACTTCGTATAA  | |||
  CTATACGAAGTTATTACG    | |||
  GATCGGAAGAGCGGTTCA  | |||
  GCAGGAATGCCGAGACCG  | |||
  CGGCATTCCTGCTGAACC  | |||
  CGTGTAGGGAAAGAGTGT  | |||
  #31 mers (5') # jellyfish uses 31mers  | |||
  CGTAATAACTTCGTATAGCATACATTATACG  | |||
  TCGTATAACTTCGTATAATGTATGCTATACG  | |||
  GATCGGAAGAGCGGTTCAGCAGGAATGCCGA  | |||
  CGGCATTCCTGCTGAACCGAGATCGGAAGAG  | |||
== Frequent kmers ==  | |||
                                                         s_1   s_2    s_3  | |||
     1  CGTAATAACTTCGTATAGCATA|CGTATAACTTCGTATAATGTAT    15.2  4.59   2.829    # not assembled   circAdapter  | |||
     2  AATGTATGCTATACGAAGTTAT|ATAACTTCGTATAGCATACATT    5.56  19.02  0.525    # not assembled   circAdapter  | |||
     3  CGTTATAACGTATTACGTTATA|TGTATGCTATACGAAGTTATTA    5.2   4.56   .           | |||
     4  TACATTATACGAAGTTATACGA|TATATGGTATGACGTTATAACG    3.39  1.87   0.365       | |||
     5  ATATAACGTAATACGTTATAAC|TATAACGTATTACGTTATATGG    2.86  2.71   1.595       | |||
     6  ACGTTATAACGTCATCCCATAT|ATACCATATAACGTAATACGTT    2.11  2.62   .  | |||
     7  TACGTTATAACGTCATACCATA|TCCATTTTACACTGCAAATGTA    1.98  2.04   1.749    | |||
     8  CTTAAACGTCGAAATATCTCCA|TTACGTTATATGGGATGACGTT    0.87  1.16   1.136    | |||
     9  GAGCGCTTAAACGTCGAAATAT|ATGAAATCTAGCCGTCATTTCC    0.73  1.02   1.442    | |||
    10  ACGTAATACGTTATAACGTCAT|ATTTCCTTCCCGTTGGAGATAT    0.67  1.1    1.477    | |||
    11  CTGCTGAACCGCTCTTCCGATC|GATCGGAAGAGCGGTTCAGCAG    0.61  3.92   0.045    # not assembled   5Adapter  | |||
    12  ACGGTCACAGCTTGTCTGTAAG|TGTAAAATGGAGTTCAACGAGC    0.31  1.73   0.798    | |||
    13  ACGGTGGTGCATTACATTTGCA|GGCGAATGGCGCCTGATGCGGT    0.28  0.87   0.949    # not assembled  | |||
    14  CAGAAAAGCATCTTACGGATGG|TACATACCTCGCTCTGCTAATC    0.04  2.47   .        # not assembled  | |||
* Location  | * Location  | ||
   /fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt : s_1-8 libs  |    /fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt : s_1-8 libs  | ||
   /fs/szattic-asmg4/Bees/Bombus_impatiens/3kb_Bimpatiens.txt            : 88 reads  |   /fs/szattic-asmg4/Bees/Bombus_impatiens/new3kb/s_1_?_sequence.txt     : new3kb lib received 05/18/2010  | ||
   /fs/szattic-asmg4/Bees/Bombus_impatiens/3kb_Bimpatiens.txt            : 88 new3kbr test reads  | |||
== Tasks to figure out ==  | == Tasks to figure out ==  | ||
| Line 116: | Line 186: | ||
* Data processing pipeline  | * Data processing pipeline  | ||
   s_[12]    : /nfshomes/dpuiu/bin/  |    s_[12]    : /nfshomes/dpuiu/bin/fastq2adaptorFreeFastq.amos  | ||
   s_[35678] : /nfshomes/dpuiu/bin/fastqfrg.amos  |    s_[35678] : /nfshomes/dpuiu/bin/fastqfrg.amos  | ||
| Line 128: | Line 198: | ||
== D Kelly's trimming ==  | == D Kelly's trimming ==  | ||
*   | * Method: http://github.com/davek44/error_correction  | ||
* Read lens stats  | |||
   s_1_1_sequence.cor.txt      |    lib                       reads        min  q1   q2   q3   max  mean    n50  sum           #repeatReads  | ||
   s_1_2_sequence.cor.txt      |    s_1_1_sequence.cor.txt    5,436,814    30   71   90   115  124  89.70   100  487673256     1,071,779 (20%)  | ||
   s_2_1_sequence.cor.txt      |    s_1_2_sequence.cor.txt    5,864,225    30   77   92   110  124  93.27   98   546956864     1,208,767 (20%)  | ||
   s_2_2_sequence.cor.txt      |    s_2_1_sequence.cor.txt    1,053,858    30   80   96   118  124  97.16   102  102395785     222,290   (21%)  | ||
   s_3_1_sequence.cor.txt      |    s_2_2_sequence.cor.txt    1,041,846    30   78   93   110  124  93.51   99   97426679      208,029   (20%)  | ||
   s_3_2_sequence.cor.txt      |    s_3_1_sequence.cor.txt    33,668,302   30   105  124  124  124  111.04  124  3738530761              (12%)  | ||
   s_5_1_sequence.cor.txt      |    s_3_2_sequence.cor.txt    32,554,322   30   82   108  124  124  100.42  120  3269039697      | ||
   s_5_2_sequence.cor.txt      |    s_5_1_sequence.cor.txt    33,579,744   30   109  124  124  124  111.65  124  3749192427      | ||
   s_6_1_sequence.cor.txt      |    s_5_2_sequence.cor.txt    32,465,412   30   84   112  124  124  102.13  124  3315553171      | ||
   s_6_2_sequence.cor.txt      |    s_6_1_sequence.cor.txt    33,535,725   30   109  124  124  124  111.60  124  3742602285      | ||
   s_7_1_sequence.cor.txt      |    s_6_2_sequence.cor.txt    32,390,877   30   83   109  124  124  100.92  123  3268875471      | ||
   s_7_2_sequence.cor.txt      |    s_7_1_sequence.cor.txt    33,674,235   30   112  124  124  124  112.19  124  3777917379      | ||
   s_8_1_sequence.cor.txt      |    s_7_2_sequence.cor.txt    32,518,568   30   84   110  124  124  101.60  124  3303807321      | ||
   s_8_2_sequence.cor.txt      |    s_8_1_sequence.cor.txt    11,777,821   30   113  124  124  124  112.30  124  1322658923      | ||
   s_8_2_sequence.cor.txt    12,176,364   30   84   110  124  124  101.51  124  1236034348    | |||
   s_9_1_sequence.cor.txt    21,273,603   30   87   124  124  124  104.52  124  2223545139     #new3kb added 05/18/2010  | |||
   s_9_2_sequence.cor.txt    20,769,027   30   86   124  124  124  103.68  124  2153436097     #new3kb added 05/18/2010  | |||
  s_1,2                     13,396,743     | |||
   s_1..8                    301,738,113  .  .    .    .    .    .       .    31958664367    | |||
   s_2..8 (64bp+)            275,763,594  .  .    .    .    .    .       .    30693543885    | |||
* Read/Mate counts  | |||
  lib                       reads        mates             range  | |||
  s_1                       11301039     5320163      | |||
  s_2                       2095704      1034086           11301040..13396743        | |||
  s_3                       66222624     31857304     | |||
  s_5                       66045156     31820553           | |||
  s_6                       65926602     31748713     | |||
  s_7                       66192803     31891468     | |||
  s_8                       23954185     11168258     | |||
  s_9                       42042630     13245740  | |||
  s_1,2                     13,396,743   6,354,249  | |||
  s_1-8                     301,738,113  144,840,545  | |||
  s_1-9                     343,780,743  158,086,285  | |||
* Files      | |||
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt        # adaptor free corrected reads (long inserts)  | |||
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt          # corrected reads (short inserts)  | |||
  /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[12].rev.frg  | |||
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/frg/s_[35678].frg  | |||
= Assembly =  | |||
*   | * SOAP version 1.04:  /nfshomes/dpuiu/szdevel/SOAPdenovo_Release1.04/  | ||
* CA version  6.0     /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/  | |||
== SOAPdenovo s_1..8 CBCB-corr (Illinois) ==  | |||
*   | * From: Kim Walden 04/30/2010  | ||
* Input:  s_1,2,3,5,6,7,8 reads   | |||
* Error correction: yes; CBCB method  | |||
* Kim's e-mail  | |||
  ''I ran it in a stepwise fashion, not adding any of the available options and only changing K=31.''  | |||
   ''I then ran their more recent GapCloser algorithm, not the version that is part of the package when using option -F in "scaff".  ''  | |||
  ''I used their "all" assembly once, and I noticed that it automatically uses the -F option during scaffolding''  | |||
   ''Also, I used your data from the ftp site without checking for redundancy in the 3kb and 8kb lanes; did Daniela do the same?''  | |||
* Summary   | |||
                          elements min_len  max_len   mean    n50        sum  | |||
  contigs(>100)           103858   100      85850     2227    7541       231387390      # prior to scaffolding;   | |||
  scaffolds&contigs(>100) 12663    100      4648588   19282   1040109    244171008      # post-scaffolding and gap-filling  | |||
  scaffolds(span)         3736     252      4648588   64874   1047164    242369898      # same prior data minus contigs that didn't assemble into scaffolds.  | |||
  scaffolds(len; no N's)  3736     172      4585806   64034   1027043    239233801      # computed by CBCB  | |||
* Locations  | |||
  ginkgo:/scratch1/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC.redo/Bimp_Salz_corr_050310_NewGapCloser.fasta  | |||
  # backup  | |||
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC.redo/  | |||
  # online  | |||
  http://www.life.illinois.edu/robertson/private/  | |||
  Login: robertson  | |||
  password: dna  | |||
== SOAPdenovo s_1..8 CBCB-corr (CBCB) ==  | |||
*   | * Stats:  | ||
                          elem     min      max       mean    n50        sum   | |||
  contigs(all)            9943851  31       85850     60      45         594773555  | |||
  contigs(>100)           106213   100      85850     2177    7084       231246190  | |||
  scaffolds+contigs(>100) 12473    100      5933710   20428   1086614    254800874    # based on scafSeq file    | |||
  scaffolds(span)         3664     236      5933710   69090   1115496    253148288    # based on scafSeq file  | |||
  scaffolds(len)          3664     134      5522831   63411   1059378    232338949   | |||
  readsAssembled          208389968                                                   # based on readOnContig file  | |||
  readsFiltered           289681090                                                   # based on readOnContig & log file  | |||
  readsTotal              301738113                                                   # based on config file  | |||
  readsDeleted            441343                                                      # based on log file  | |||
* Run gapClosing on assembly:  | |||
  stillOpen 14784  | |||
  closed    42379  | |||
* Contig cvg: GC bias:  | |||
 [[Image:Bombus.contig10000.png]]  | |||
   cat asm.K31.contig100.infoseq | ~/bin/getCorrelationDescriptive.pl -i 2 -j 3  => -0.709501493345893  | |||
   cat asm.K31.contig1K.infoseq | ~/bin/getCorrelationDescriptive.pl -i 2 -j 3   => -0.923151349747983  | |||
  #100+ ctgs  | |||
   .                    elem       min    q1     q2     q3     max        mean       n50        sum               |    .                    elem       min    q1     q2     q3     max        mean       n50        sum               | ||
   len                  106213     100    140    437    2420   85850      2177       7084       231246190        | |||
   gc                   106213     0.00   33.13  38.00  43.44  100.00     38.74      39         4115066  | |||
   cvg                  106213     0.6    39.0   50.8   58.9   160.8      47.38      54         5032329  | |||
  #1K+ ctgs  | |||
   .                    elem       min    q1     q2     q3     max        mean       n50        sum               |    .                    elem       min    q1     q2     q3     max        mean       n50        sum               | ||
   len                  41965      1000   1860   3295   6338   85850      5110       7726       214421699        | |||
   gc                   41965      19.18  34.31  37.96  41.79  59.25      38.19      38         1602711  | |||
   cvg                  41965      8.0    42.0   49.9   55.8   93.8       47.74      51         2003339  | |||
*   | * Run soap2 to map reads back to the assembly:  "-l 32 -v 2"   | ||
   readsAligned            173063062  | |||
* Locations  | |||
   mulberry:/scratch0/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor/       # original  | |||
   /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor/   # backup  | |||
== SOAPdenovo   | == SOAPdenovo s_1..9 CBCB-corr (CBCB) ** ==  | ||
*   | * Uses the new 3Kbp library : s_9 (reads have been reversed)  | ||
=  | * Stats:  | ||
<pre style="background:yellow">  | |||
  .                    elem       min    q1     q2     q3     max        mean       n50        sum  | |||
  scf                  11178      100    111    135    390    5655980    23014      1205321    257251549  | |||
  ctg                  10856652   31     .      .      .      85850      57         43         627095607  | |||
  ctg100+              106741     100    .      .      .      85850**    2165       6939       231167576  | |||
   scfSeqContigs        67900      100    340    1788   4373   85819      3437       7586       233372689  | |||
   scfSeqContigsClosed  20651      100    128    488    7991   600179     11715      64083      241921308  | |||
</pre>  | |||
* Location  | |||
  mulberry:/scratch2/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/      # original  | |||
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/   # backup  | |||
== CA s_1..8 corr OBT (CBCB) ==  | |||
== CA s_1..8 corr OBT (CBCB)   | |||
=== Issues ===  | === Issues ===  | ||
| Line 284: | Line 359: | ||
   ovlMemory   =              8GB  |    ovlMemory   =              8GB  | ||
  doExtendClearRanges =      0          # fast run  | |||
=== Gatekeeper ===  | === Gatekeeper ===  | ||
* ~ 58X cvg  | |||
   gatekeeper -dumpfragments asm.gkpStore | grep ^fragmentClear ...  |    gatekeeper -dumpfragments asm.gkpStore | grep ^fragmentClear ...  | ||
   LOAD                  STATS            |    LOAD                  STATS            | ||
| Line 311: | Line 389: | ||
   7                     LIB              |    7                     LIB              | ||
   LibraryName           numActiveFRG  numDeletedFRG  numMatedFRG  readLength   clearLength   |    LibraryName           numActiveFRG  numDeletedFRG  numMatedFRG  readLength   clearLength    #egrep 'ATATAACGTAATACGTTATAAC|GTTATAACGTATTACGTTATAT'  | ||
   GLOBAL                274282535     27455578       247151634    30524746650  29445986650     |    GLOBAL                274282535     27455578       247151634    30524746650  29445986650     | ||
   LegacyUnmatedReads    0             0              0            0            0               |    LegacyUnmatedReads    0             0              0            0            0               | ||
   s_1                   10409320      891719         9039810      996306280    994802849   |    s_1                   10409320      891719         9039810      996306280    994802849      #1134233(11%)  | ||
   s_2                   2054575       41129          1987728      197829588    197593453   |    s_2                   2054575       41129          1987728      197829588    197593453      #224648(11%)  | ||
   s_3                   59894531      6328093        53925476     2382786529   2127257586   |    s_3                   59894531      6328093        53925476     2382786529   2127257586     #3037085(5%)    | ||
   s_5                   60127274      5917882        54512448     2459696212   2209078913   |    s_5                   60127274      5917882        54512448     2459696212   2209078913     #5%  | ||
   s_6                   59813485      6113117        54062058     2396359740   2152656620   |    s_6                   59813485      6113117        54062058     2396359740   2152656620     #5%  | ||
   s_7                   60278536      5914267        54647258     2476326572   2237423989   |    s_7                   60278536      5914267        54647258     2476326572   2237423989     #5%  | ||
   s_8                   21704814      2249371        18976856     2435572545   2347304056   |    s_8                   21704814      2249371        18976856     2435572545   2347304056     #5%  | ||
   gatekeeper -dumplibraries -tabular asm.gkpStore  |    gatekeeper -dumplibraries -tabular asm.gkpStore  | ||
| Line 335: | Line 413: | ||
   s_1,2 lib : ids:1..13,396,744  |    s_1,2 lib : ids:1..13,396,744  | ||
   overlap jobs that need to be redone : 1036 out of 11476 < 10%  |    overlap jobs that need to be redone : 1036 out of 11476 < 10%  | ||
   ~/bin/getOvlJobsCount.pl 2000000 2000000 301738113 | p '/(\d+)-(\d+)\s+(\d+)-(\d+)/; print $_ if($1<14000000 or $3<14000000);'    |    ~/bin/getOvlJobsCount.pl 2000000 2000000 301738113 | p '/(\d+)-(\d+)\s+(\d+)-(\d+)/; print $_ if($1<14000000 or $3<14000000);'  | ||
=== Meryl ===    | === Meryl ===    | ||
   meryl -Dh -s 0-mercounts/asm-C-ms22-cm0    |    meryl -Dh -s 0-mercounts/asm-C-ms22-cm0    | ||
   Found 23686053415 mers.  |    Found 23686053415 mers.  | ||
   Found 265670791 distinct mers.  |    Found 265670791 distinct mers.  | ||
   Found 8682767 unique mers.  |    Found 8682767 unique mers.  | ||
  Largest mercount is 21,987,616; 668 mers are too big for histogram.  | |||
   ...  |    ...  | ||
   #freq   count   %  |    #freq   count   %  | ||
   1       8682767 0.0327  0.0004  |    1       8682767 0.0327  0.0004  | ||
   2       4288602 0.0488  0.0007  |    2       4288602 0.0488  0.0007  | ||
  3       3041005 0.0603  0.0011  | |||
  ...  | |||
   72*     3803015 0.7150  0.3229  |    72*     3803015 0.7150  0.3229  | ||
   ...  | |||
  1048158 1       1.0000  0.9274  | |||
  fasta2tab.pl  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/0-mercounts/asm.nmers.ovl.fasta   | sort -n -r | head -5  | |||
  21987616        ATATAACGTAATACGTTATAAC  | |||
  20992872        CGTTATAACGTATTACGTTATA  | |||
  20510821        ACGTTATAACGTATTACGTTAT  | |||
  20484950        CCATATAACGTAATACGTTATA  | |||
  20452950        CATATAACGTAATACGTTATAA  | |||
===  OBT  |   cat egrep.sh   | ||
  egrep -c 'ATATAACGTAATACGTTATAAC|GTTATAACGTATTACGTTATAT' /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/seq_obt/*clean.seq > egrep.count  | |||
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2  | |||
  ...  | |||
===  OBT ===  | |||
* job count    | * job count    | ||
| Line 365: | Line 460: | ||
* use OBTCHIMERA as CLR if want to redo the assembly !!!  | * use OBTCHIMERA as CLR if want to redo the assembly !!!  | ||
=== Overlap ===  | |||
* job count : same as OBT   | |||
* Overlap counts  | * Overlap counts  | ||
| Line 383: | Line 482: | ||
   Lib    6  3'  43903838   50     |    Lib    6  3'  43903838   50     | ||
   Lib    7  5'  16221438   50     |    Lib    7  5'  16221438   50     | ||
   Lib    7  3'  16221438   50     |    Lib    7  3'  16221438   50  | ||
* 199542773 out of 257786946 (77%) usable reads had overlaps         | |||
* 7988792   out of 11443537  (69%) usable s_1,2 reads had overlaps  | |||
=== Frg/Ovl correction ===  | |||
* Frgcorr jobs: 151  | * Frgcorr jobs: 151  | ||
| Line 390: | Line 494: | ||
* Ovlcorr jobs: 151  | * Ovlcorr jobs: 151  | ||
   wc -l 3-overlapcorrection/cat-corrects.frgcorrlist  |    wc -l 3-overlapcorrection/cat-corrects.frgcorrlist  | ||
=== OverlapStore ===  | |||
* Failures  | * Failures  | ||
| Line 409: | Line 515: | ||
   wc -l 4-unitigger/asm.partitioningInfo  |    wc -l 4-unitigger/asm.partitioningInfo  | ||
   110  |    110  | ||
* Failures  | * Failures  | ||
   grep FAILED 5-consensus/asm*err  |    grep FAILED 5-consensus/asm*err  | ||
   5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED.  Could not align fragment 264046761.  |    5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED.  Could not align fragment 264046761.  | ||
   Fix:  | |||
   # update layout  |   # Fix:    | ||
   tigStore -g asm.gkpStore -t asm.tigStore 2 -u   |    # update layout (CA 6.1)  | ||
   ~/bin/updateUTGlayout.pl utg  |   mkdir utg  | ||
   tigStore -g asm.gkpStore -t asm.tigStore 2   | |||
  # update layout (CA 6.1)  | |||
   grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "tigStore -g asm.gkpStore -t asm.tigStore 2 -u  $F[0] -d layout > utg/$F[0].layout\n";'  | ~/bin/scheduler.pl  | |||
   grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "~/bin/updateUTGlayout.pl utg/$F[0].layout >  utg/$F[0].layout.update\n ";'              | ~/bin/scheduler.pl   | |||
   grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg/$F[0].layout.update\n" ;'             | ~/bin/scheduler.pl  -c 1  | |||
   grep FAILED 5-consensus/asm*err | sed 's/\.err:/ /' | awk '{print $1}' | sort -u | p 'print "touch $F[0].success\n";'                                               | ~/bin/scheduler.pl   | |||
  # create success file  | |||
  touch 5-consensus/consensus.success  | |||
   # relaunch runCA  |    # relaunch runCA  | ||
| Line 438: | Line 552: | ||
* Failure:    | * Failure:    | ||
** buildPosMap : segFault :ids too long???  | ** buildPosMap : segFault :ids too long???  | ||
  Fix: split asm file(optional) and reassign new accession numbers  | |||
  cd split/  | |||
  ~/bin/splitAsm.pl < asm.asm => ... CCO.asm, SCF.asm  | |||
  cat CCO.asm CLK.asm SCF.asm SLK.asm | ~/bin/newAsmIds.pl | buildPosMap -o asm  | |||
=== Stats ===  | === Stats ===  | ||
* [[Media:Bb.CA.s_1-8.cor.qc|Bb.qc]]  | |||
* Stats  | |||
   .                    elem       min    q1     q2     q3     max        mean       n50        sum               |    .                    elem       min    q1     q2     q3     max        mean       n50        sum               | ||
   scf                  15009      74     4786   9789   20021  415318     16916.33   29133      253897199         |    scf                  15009      74     4786   9789   20021  415318     16916.33   29133      253897199         | ||
| Line 453: | Line 567: | ||
   utg                  2786974    54     90     110    142    25644      197.04     277        549140918         |    utg                  2786974    54     90     110    142    25644      197.04     277        549140918         | ||
   singletons           61129865  |    singletons           61129865  | ||
   usableSeqs           257786946  | |||
* Stats (100bp+)  | |||
                        elem     min      max       mean    n50        sum   | |||
 contigs(>100)          99429    100      138630    2148    7972       213595680  | |||
 scaffolds(span)        14992    100      415318    16935   29133      253895676  | |||
 scaffolds(len)         14991    100      303946    14499   24091      217357577   | |||
* Library stats (totals)  | |||
  lib    ctg       sur      deg       singl     total  | |||
  s_1    2251597   162800   3119665   3988095   9522157  | |||
  s_2    424821    41107    563656    891795    1921379  | |||
  s_3    28880189  1078891  13600909  12735637  56295626  | |||
  s_5    28983834  1077334  13561390  12962731  56585289  | |||
  s_6    28663974  1062939  13559494  12896855  56183262  | |||
  s_7    28784505  1058364  13713200  13020274  56576343  | |||
  s_8    10455533  383065   5229814   4634478   20702874  | |||
  all    128444453 4864500  63348128  61129865  257786930  | |||
* Library stats (percentages)  | |||
  lib    ctg  sur  deg  singl  total  | |||
  s_1    24   2    33   42     100  | |||
  s_2    22   2    29   46     100  | |||
  s_3    51   2    24   23     100  | |||
  s_5    51   2    24   23     100  | |||
  s_6    51   2    24   23     100  | |||
  s_7    51   2    24   23     100  | |||
  s_8    51   2    25   22     100  | |||
=== Mate redundancy ===  | |||
* Criteria:   | |||
  Given 2 mates    | |||
    1: a1,a2   | |||
    2: b1,b1  | |||
  We have the following 2 overlaps  | |||
    1: a1,b1  a2,b2      | |||
    2: a1,b2  a2,b1    | |||
  At least 75% of one read overlaps the other read  | |||
* Location:  | |||
  cd /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor/9-terminator/  | |||
  wc -l s_1.mates asm.posmap.frgutg.s_1.*  | |||
  5320163 s_1.mates                                     # input mates (from FRG file; 2 reads/mate)  | |||
  1232734 asm.posmap.frgutg.s_1.mates                   # assembled mates (both reads assembled)   | |||
   366262 asm.posmap.frgutg.s_1.redundant_mates         # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 29.7% redundancy  | |||
   312206 asm.posmap.frgutg.s_1.redundant_mate_pairs    # redundant assembled mate pairs  | |||
  wc -l s_2.mates asm.posmap.frgutg.s_2.*   | |||
  1034086 s_2.mates                                     # input mates (from FRG file; 2 reads/mate)  | |||
   255905 asm.posmap.frgutg.s_2.mates                   # assembled mates (both reads assembled)   | |||
   110570 asm.posmap.frgutg.s_2.redundant_mates         # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 43.2% redundancy  | |||
  1871747 asm.posmap.frgutg.s_2.redundant_mate_pairs    # redundant assembled mate pairs  | |||
    11937 asm.posmap.frgutg.s_2.redundant_mate_clusters # redundant assembled mate clusters  | |||
=== Runtimes ===  | |||
  #     step            hours  | |||
  0     gatekeeper      5.36     | |||
  1     initialTrim     2.07     | |||
  2     meryl           9.90     | |||
  9     overlap.sh      11.31    | |||
  11    find            0.00     | |||
  16    overlapStore    13.79    | |||
  17    consolidate     5.12     | |||
  18    merge-trimming  1.41     | |||
  19    chimera(1)      10.04    | |||
  20    overlap.sh      49.86    | |||
  21    find            0.00     | |||
  22    overlapStore    16.68    | |||
  23    frgcorr.sh      5.57     | |||
  24    cat-corrects    0.10     | |||
  25    ovlcorr.sh      4.64     | |||
  26    cat-erates      0.90     | |||
  27    overlapStore    6.67     | |||
  28    buildUnitigs    13.88    | |||
  29    gatekeeper      1.31     | |||
  30    consensus.sh    3.41     | |||
  31    cgw(1)          65.00    | |||
  43    gatekeeper      4.39     | |||
  44    consensus.sh    4.70     | |||
  46    terminator      6.26     | |||
  47    asmOutputFasta  7.75     | |||
  48    dumpSingletons  3.68     | |||
  49    buildPosMap(1)  3.79  | |||
=== Location ===  | === Location ===  | ||
   ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/ # assembly  |    ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/                             # original assembly  | ||
   ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/asm.gkpStore.OBTCHIMERACLR   |    ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/asm.gkpStore.OBTCHIMERACLR   # dumped OBT clear ranges, 0/1 undel/del  | ||
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor                        # backup  | |||
== CA s_1..8 corr OBT redo (CBCB)   | == CA s_1..8 corr OBT redo (CBCB) ==  | ||
=== Issues ===  | === Issues ===  | ||
* Fixed outie s_1,2 lib : clear ranges will be different  | |||
* s_3..8 : stay the same  | |||
=== runCA spec ===  | === runCA spec ===  | ||
   doOverlapTrimming =        0         !!! already done in the prev assembly  |    doOverlapTrimming =        0         !!! already done in the prev assembly  | ||
  saveOverlaps  =            1         !!! it was actually set to 0 for this run; otherwise the ovb.gz files get deleted  | |||
   ovlOverlapper =            ovl       # same  |    ovlOverlapper =            ovl       # same  | ||
   unitigger =                bog       # same      |    unitigger =                bog       # same      | ||
   BatchSize =   |    BatchSize =                2000000    # must be the same so overlaps are not redone  | ||
   BlockSize =   |    BlockSize =                2000000    # must be the same  | ||
   ...  |    ...  | ||
  doExtendClearRanges =      1          # initially set to 2; seems to take too long => reset to 1   | |||
=== Input ===  | === Input ===  | ||
Modified frg files: dumped   | * Modified frg files:    | ||
** dumped "CA s_1..8 corr OBT" assembly gatekeeper    | |||
** undelete act:D fragments so the ids stay the save  | |||
** reverse s_1,2 reads  | |||
=== Gatekeeper ===  | === Gatekeeper ===  | ||
* Make sure the iids match  | |||
=== Overlap ===  | === Overlap ===  | ||
Rerun only   | * Rerun only the overlap jobs for the s_1,2 reads  | ||
=== OverlapStore ===  | |||
*  To build faster in the future  | |||
 e 'foreach ("0000000001".."0000000058") { print "overlapStore -c $_.ovlStore -g asm.gkpStore/ -i 0 -M 8192 1-overlapper/$_/*gz\n" }' | scheduler.pl  # 13min/job    | |||
 e 'foreach ("0000000002".."0000000058") { print "overlapStore -m 0000000001.ovlStore $_.ovlStore\n" }'                                               # 3 min/job   | |||
 total=2*13+57*3=3.5 hrs (instead of 13.85 reported in runCA.log)  | |||
=== Frgcorr ===    | === Frgcorr ===    | ||
* Initial failure: on 144 out of 151 jobs : the gatekeeper id's originally did not match (5 reads missing from s_5 lib shifted all the other reads)  | |||
=== Ovlcorr ===  | |||
* The error rate on overlaps with error went down by ~ 1.28%  | |||
=== Bog  ===  | |||
  cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1  | |||
  elem       min    q1     q2     q3     max        mean       n50        sum              | |||
  60970504   47     85     150    150    93191      126.17     150        7692801690  | |||
* Unitig stats  | |||
  tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep -c ^unitig_num_frags                                    # 60,970,523  : total unitigs  | |||
  tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep -c ' 1$'                       # 59,776,572  : unitigs with only 1 read  | |||
  tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep -v ' 1$' | getSummary.pl -i 2  # stats for unitig with 2+ reads  | |||
                elem         min    q1     q2     q3     max        mean       n50        sum              | |||
  1read/utg     59,776,572                                                                59,776,572  | |||
  2+reads/utg   1,193,951    2      3      6      28     74715      165.84     4712       198,010,374  | |||
=== Bog consensus ===   | |||
* job count :   | |||
  wc -l 4-unitigger/asm.partitioningInfo  | |||
  110  | |||
* Failures: 1 failed unitig  | |||
  grep FAILED 5-consensus/asm*err  | |||
  5-consensus/asm_022.err:MultiAlignUnitig()-- Unitig 115198 FAILED.    | |||
  Fix: same as before  | |||
=== Cgw ===  | |||
* Failure:   | |||
** after 7-0-CGW/asm.ckp.4  | |||
  tail runCA.log  | |||
  ERROR:  Negative variance for chunk 59905266* in scaffold 6409 (AEnd=-163479.174700 BEnd=-163491.446700)  | |||
  37980:Scaffold 6409:  | |||
  37981-  Idx    Chunk     (   Start, Variance)  (     End, Variance)      Len  | |||
  37982-    0: 60983787    (       0,        0)  (   31646,      823)    31646  | |||
  37983-    1:   69691     (   35389,      934)  (   31650,      837)     3739  | |||
  37984-    2: 59907028    (   35103,      942)  (   42035,     1122)     6932  | |||
  37985-    3: 60799298    (   41640,     1132)  (   64897,     1737)    23257  | |||
  37986-    4: 60975976    (   64924,     1784)  (   88439,     2395)    23515  | |||
  37987-    5: 60988396    (   88431,     2416)  (  122023,     3289)    33592  | |||
  37988-    6: 60988698    (  121551,     3324)  (  123589,     3377)     2038  | |||
  37989-    7: 59905266*   (  122331,  -163479)  (  121859,  -163491)      472  | |||
  37990-    8: 60988599    (  122275,   170138)  (  134564,   170457)    12289  | |||
  ./AS_REZ/GapFillREZ.C:5666: int Estimate_Chunk_Ends(VarArrayStack_Entry_t*, int, int, LengthT*, LengthT*, ChunkInstanceT*, float*, Scaffold_Fill_t*, int*, int*, int*): Assertion `min_right_variance >= 0.0' failed.  | |||
  ./AS_REZ/GapFillREZ.c:8339:       "ERROR:  Negative variance for chunk %d in scaffold %d (AEnd=%f BEnd=%f)\n",  #message  | |||
  Fix: modfy the code; comment assertion statemants  | |||
  cat /nfshomes/dpuiu/szdevel/SourceForge/wgs/src/AS_REZ/GapFillREZ.c  | |||
  // assert (0.0 <= ref_variance && ref_variance <= FLT_MAX);  | |||
  // assert (min_right_variance >= 0.0);  | |||
* Preliminary stats  | |||
  cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $sum,"\n"; $sum=0; } else { $sum+=$F[-1]}' | getSummary.pl -t scf  | |||
  cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -t ctg  | |||
  cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -min 200 -t ctg200+  | |||
  cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $count-1,"\n"; $count=0; } else { $count++}' | getSummary.pl -i 0 -min 1 -t scf.gaps  | |||
  #7-0-CGW/rez.i36.log :  | |||
  .                    elem       min    q1     q2     q3     max        mean       n50        sum              | |||
  scf                  1820       84     162    4583   82337  3161523    125884.07  667487     229109005       # ids from 0..13113  | |||
  ctg                  26453      64     962    3756   10485  255820     8759.59    21800      231717378        | |||
  ctg200+              23343      200    1760   4772   11873  255820     9908.34    21845      231290344   | |||
  scf.gaps             877        1      4      12     36     373        27.71      64         24298  | |||
  #7-2-CGW/rez.i01.log :  | |||
  .                    elem       min    q1     q2     q3     max        mean       n50        sum              | |||
  scf                  4903       66     1283   6908   39466  1148626    47212.57   209756     231483215        | |||
  ctg                  23480      64     1370   4397   11757  255820     9858.75    23476      231483360  | |||
  7-2-CGW/rez.i06.log   | |||
  .                    elem       min    q1     q2     q3     max        mean       n50        sum              | |||
  scf                  3818       64     268    6813   61948  1198480    60723.65   257457     231842914        | |||
  ctg                  24005      64     1165   4161   11439  255820     9659.47    23819      231875677   | |||
  #7-2-CGW/rez.i11.log   | |||
  .                    elem       min    q1     q2     q3     max        mean       n50        sum              | |||
  scf                  3733       64     268    6880   65584  1391034    62111.99   258823     231864045        | |||
  ctg                  24004      64     1165   4161   11439  255820     9659.98    23819      231878256       | |||
  #7-2-CGW/stone.i02.log  | |||
  .                    elem       min    q1     q2     q3     max        mean       n50        sum              | |||
  scf                  1854       83     153    4039   52506  2983045    125375.58  860583     232446325        | |||
  ctg                  23863      64     674    3833   11368  255820     9761.94    25892      232949194         | |||
  .                    elem       min    q1     q2     q3     max        mean       n50        sum              | |||
  scf                  1847       84     152    4440   62397  3337984    125470.25  835387     231743549        | |||
  ctg                  26088      64     429    3213   10218  297786     8941.40    25521      233263147        | |||
* TigStore  | |||
  tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D unitigs | tail  | |||
  maID            isPresent       isDeleted       ptID    svID    fileOffset      covStat  | |||
  60970540        1               0               0       3       58023924696     84  | |||
  tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D contigs | tail -1   | |||
  maID            isPresent       isDeleted       ptID    svID    fileOffset  | |||
  61001100        1               0               0       32      16307420  | |||
* Checkpoints   | |||
  # 7-0-CGW                | |||
                                                                    #times   iterMax  | |||
  CHECKPOINT_AFTER_BUILDING_SCAFFOLDS     "ckp01-ABS"                 | |||
  CHECKPOINT_DURING_CLEANING_SCAFFOLDS    "ckp02-DCS"               5        10  | |||
  CHECKPOINT_AFTER_CLEANING_SCAFFOLDS     "ckp03-ACD"               1  | |||
  CHECKPOINT_DURING_1ST_SCAFF_MERGE       "ckp04-1SM-partial"       12              # we should see more iterations if the scaffolding goes well; also ECR would take longer  | |||
  CHECKPOINT_AFTER_1ST_SCAFF_MERGE        "ckp05-1SM"               1  | |||
  CHECKPOINT_AFTER_STONES                 "ckp06-AS"             | |||
  CHECKPOINT_DURING_2ND_SCAFF_MERGE       "ckp07-2SM-partial"    | |||
  CHECKPOINT_AFTER_2ND_SCAFF_MERGE        "ckp08-2SM"               1  | |||
  CHECKPOINT_AFTER_FINAL_ROCKS            "ckp09-FR"                1  | |||
  # 7-2-CGW                | |||
                                                                    #times   iterMax  | |||
  CHECKPOINT_AFTER_BUILDING_SCAFFOLDS     "ckp01-ABS"                 | |||
  CHECKPOINT_DURING_CLEANING_SCAFFOLDS    "ckp02-DCS"               9        10  | |||
  CHECKPOINT_AFTER_CLEANING_SCAFFOLDS     "ckp03-ACD"               1   | |||
   ...  | |||
  CHECKPOINT_DURING_1ST_SCAFF_MERGE       "ckp04-1SM-partial"       10  | |||
  CHECKPOINT_AFTER_1ST_SCAFF_MERGE        "ckp05-1SM"               1  | |||
  CHECKPOINT_AFTER_STONES                 "ckp06-AS"                1  | |||
  CHECKPOINT_DURING_2ND_SCAFF_MERGE       "ckp07-2SM-partial"       1  | |||
  CHECKPOINT_AFTER_2ND_SCAFF_MERGE        "ckp08-2SM"               1  | |||
  CHECKPOINT_AFTER_FINAL_ROCKS            "ckp09-FR"                1  | |||
   ...  | |||
  CHECKPOINT_AFTER_PARTIAL_STONES         "ckp10-PS"                1  | |||
  CHECKPOINT_AFTER_FINAL_CONTAINED_STONES "ckp11-FCS"    | |||
  CHECKPOINT_AFTER_FINAL_CLEANUP          "ckp12-FC"     | |||
  CHECKPOINT_AFTER_RESOLVE_SURROGATES     "ckp13-RS"     | |||
  CHECKPOINT_AFTER_OUTPUT                 "ckp14-FIN"  | |||
* Long running time : implemented 2 cgw options -ic, -im ....  | |||
* Running times  | |||
  CGW-0: START Wed Apr 28 22:30:12 ; END Mon May  3 03:34:57  | |||
  ECR-1: START Mon May  3 03:34:58 ; END Mon May  3 12:28:54          | |||
  CGW-2: START Mon May  3 12:28:54 ; End Wed May 12 22:05:00  | |||
* Library insert histograms  | |||
  # Example  | |||
  cat 7-0-CGW/stat/scaffold_final.distlib_1.cgm | p 'print int($F[0]/1000),"\n";' | count.pl | sort -n > 7-0-CGW/stat/scaffold_final.distlib_1.cgm.hist  | |||
  cat 7-0-CGW/stat/scaffold_final.distlib_2.cgm | p 'print int($F[0]/1000),"\n";' | count.pl | sort -n > 7-0-CGW/stat/scaffold_final.distlib_2.cgm.hist  | |||
  gnuplot  | |||
    set xrange [0:10]  | |||
    plot "scaffold_final.distlib_1.cgm.hist" w l, "scaffold_final.distlib_2.cgm.hist" w l,   | |||
* Failure:   | |||
** after 7-2-CGW/asm.ckp.42 (after running about 2hrs)  | |||
  tail runCA.log  | |||
  * Screwed up scaffold 3790: Chunk 60994012 has bad mean  | |||
  * Scaffold 3790 iteration 1  1 bad CIs out of 33 LSE:2664.55 improvement:1  | |||
  * Reinserting chunk 60994012 in scaffold 3790  | |||
  * improvement = 0.0721104 LSE = 2664.55  | |||
  ----------------------------------------  | |||
  Failure message:  | |||
  scaffolder failed  | |||
=== ECR ===  | |||
* Count scaffold gaps : total/closed  | |||
  cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^examining gap'   # 24635  | |||
  cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^closed gap'      # 2803  | |||
=== Terminator ===  | |||
* Get id range  | |||
  grep -m 1 ^7 asm.posmap.utglen asm.posmap.ctglen asm.posmap.deglen asm.posmap.scflen   | |||
  asm.posmap.utglen:7180000000000 124      | |||
  asm.posmap.deglen:7180001635156 133       | |||
  asm.posmap.ctglen:7180001635158 124  | |||
  asm.posmap.scflen:7180002817601 161      | |||
  =>  | |||
  utg: 7180000000000..7180001635155  | |||
  ctg/deg:7180001635156..7180002817600  | |||
  scf:7180002817601..7180002819521  | |||
=== Stats ===  | |||
* Assembly stats  | |||
  .                    elem       min    q1     q2     q3     max        mean       n50        sum              | |||
  scf                  1921       65     148    3389   55043  3122263    149121.08  982222     286461602        | |||
  ctg                  91934      64     100    119    180    255828     2621.59    25123      241013378        | |||
  deg                  1090511    54     103    124    152    7068       139.41     135        152024633        | |||
  utg                  1635156    54     97     118    141    94720      267.81     1636       437917658  | |||
* Assembly stats(100bp+)  | |||
                          elem     min      max       mean    n50        sum   | |||
  contigs(>100)           69270    100      255828    3451    25383      239118680   | |||
  scaffolds(span)         1870     100      3122263   153185  982222     286457118   | |||
  scaffolds(len)          1870     100      2990211   128880  870182     241007381  | |||
=== Location ===  | === Location ===  | ||
   ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo/ # assembly  |    ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo/           # original assembly  | ||
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo      # backup  | |||
== CA s_1..8 corr   | == CA s_1, 2 partial ,3..8 corr OBT (CBCB) ** ==  | ||
=== Issues ===  | === Issues ===  | ||
* Use only s_2 lib reads which seem 8K apart :  ~ 14% of the s_2 reads aligned by soap2 to SOAPdenovo Illinois assembly within 8+-2Kbp of one another  | |||
* Alignments (very similar results)  | |||
  soap2-index  Bimp.fasta                 #  soap2-index -> 2bwt-builder  | |||
  bowtie-build Bimp.fasta  Bimp.fasta   | |||
  soap2 -D Bimp.fasta.index -a s_2.1.fastq -b s_2.2.fastq -l 32 -p 8 -v 2 -m 6000 -x 10000 -o s_2.mated.soap2 -2 s_2.single.soap2  | |||
  bowtie   Bimp.fasta       -1 s_2.1.fastq -2 s_2.2.fastq -l 32 -p 8 -n 2 -I 6000 -X 10000 >  s_2.mated.bowtie  | |||
  bowtie   Bimp.fasta         s_2_0.fastq -l 32 -p 8 -n 2                                  >  s_2_0.bowtie  | |||
  wc -l s_2.mated.*  | |||
  288880 s_2.mated.soap2  | |||
  288668 s_2.mated.bowtie  | |||
  intersect.pl s_2.mated.*  | |||
  288368  | |||
=== runCA spec ===  | |||
  doExtendClearRanges=0   | |||
===   | === Input ===  | ||
* Gatekeeper: same as prev assembly  | |||
* Overlaps: filter out all overlaps for most of the s_2 reads  | |||
===    | === Bog  ===  | ||
  cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1  | |||
   elem       min    q1     q2     q3     max        mean       n50        sum               |    elem       min    q1     q2     q3     max        mean       n50        sum               | ||
   61930057   47     85     150    150    113529     125.85     150        7793894671      # max went up from 93191->113529  | |||
=== Bog consensus ===  | |||
* Location:  | * Failures  | ||
  grep FAILED 5-consensus/asm*err : 113 unitigs between 5.. 33547 frags  | |||
  5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED.  Could not align fragment 264046761.  | |||
* Fix:  | |||
  grep    FAILED 5-consensus/asm*err | p 'print "$F[2]\n";' | perl ./fixLayout.pl | scheduler.pl   | |||
  grep -c FAILED 5-consensus/*err | grep -v ':0$' | p '/(.+).err/; print "touch $1.success\n";' | sort -u | scheduler.pl  | |||
=== Cgw ===  | |||
* Preliminary stats  | |||
  #7-0-CGW/stone.i02.log  | |||
  .                    elem       min    q1     q2     q3     max        mean       n50        sum              | |||
  scf                  1847       84     152    4440   62397  3337984    125470.25  835387     231743549        | |||
  ctg                  26088      64     429    3213   10218  297786     8941.40    25521      233263147  | |||
  cat 7-0-CGW/cgw.out  | |||
  ...  | |||
  Processed 61,930,044 unitigs with 301,738,113 fragments  | |||
=== Terminator ===  | |||
* Message counts  | |||
  CTP     91502  | |||
  SCF     1896  | |||
  CLK     3288923  | |||
cat CLK.asm | grep ^ori: | count.pl  | |||
ori:O   979139  | |||
ori:A   869982  | |||
ori:N   857732  | |||
ori:I   582070  | |||
=== Stats ===  | |||
* Assembly stats  | |||
<pre style="background:yellow">  | |||
  .                    elem       min    q1     q2     q3     max        mean       n50        sum  | |||
  scf                  1896       76     150    4044   67922  4021294**  151760     1017298**  287738041  | |||
  ctg                  92307      63     100    119    186    297795**   2612.99    24781**    241197400  | |||
  deg                  1086650    54     103    124    152    6822       139.41     135        151484589  | |||
  utg                  1574225    54     98     119    143    123354     274.33     2027       431859732  | |||
</pre>  | |||
* Assembly stats(100bp+)  | |||
  .                       elem     min      max       mean    n50        sum              | |||
  contigs(>100)           69944    100      297795    3421    25048      239324761        | |||
  scaffolds(span)         1869     100      4021294   153951  1017298    287735536        | |||
  scaffolds(len)          1869     100      3429736   128833  860270     240790206  | |||
* Library stats  | |||
  .                    badLong  badOuttie  badSame  bothChaff  bothDegen  bothSurrogate  diffScaffold  good   oneChaff  oneDegen  oneSurrogate    | |||
  HWI-EAS385_0001:1    0.15     0.06       0.01     19         10.18      0              0.45          46.9   21.41     1.59      0.19            | |||
  HWI-EAS385_0001:2    0.13     0.01       0        84.86      0.02       0              0.27          14.2   0         0.36      0.02            | |||
  HWI-EAS385_0001:3    0.06     0.01       0        14.45      9.03       0.04           0.05          60.11  15.17     0.68      0.16            | |||
  HWI-EAS385_0001:5    0.06     0.01       0        14.84      8.93       0.04           0.05          59.98  15.02     0.65      0.16            | |||
  HWI-EAS385_0001:6    0.06     0.01       0        14.76      9.04       0.04           0.05          59.7   15.29     0.65      0.16            | |||
  HWI-EAS385_0001:7    0.06     0.01       0        14.73      9.12       0.04           0.05          59.51  15.44     0.64      0.16            | |||
  HWI-EAS385_0001:8    0.06     0.01       0        14.05      10.1       0.03           0.05          58.66  15.92     0.7       0.16            | |||
* Issues  | |||
** Gap sizes vary between  -297,562 bp and 772,064  bp.  They should be no longer than ~ 8Kbp (max insert size)   | |||
  cat 9-terminator/split/SCF.asm | grep -A 3 "^{CTP" | grep ^mea: | sed 's/mea://' | sort -n | head  | |||
  -297562  | |||
  -216729  | |||
  ...  | |||
  162239    # scf7180002754932.314143..476382  : 26 s_1,3,5,6,7 mate pairs link the two sides of the gap  | |||
  772064    # scf7180002754479.867224..1639288 : 2  s_1 mate pairs link the two sides of the gap  | |||
  ~/bin/posmap2scaff.pl 9-terminator/asm.posmap.ctgscf | grep -v "^>" | sort -nk4 | tail  | |||
  7180002653372 EB 89    69161  | |||
  7180002699293 EB 78    92310  | |||
  7180002731234 BE 64936 162239  | |||
  7180002726839 BE 46924 772064  | |||
  grep ^mea: SCF.asm | sed 's/mea://' | getSummary.pl -i 0 -t scf.gap | pretty  | |||
  .          elem   min          q1      q2       q3       max         mean     n50     sum              | |||
  scf.gap    91502  -297562      19      222      667      772064      -141     772064  -12910029  | |||
  scf.gap-   13788  -297562      -2166   -89      -12      0           -4284    0       -59071991  | |||
  scf.gap+   77699  0            84      325      775      772064      594      1194    46161963  | |||
=== Location ===  | |||
  mulberry:/scratch0/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/       # original (to be deleted)  | |||
   /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/   # backup  | |||
== Other ==  | |||
=== SOAPdenovo on CA contigs ===  | |||
  cat asm.ctg.fasta | ~/bin/assignFastaIds.pl | awk '{print $1}' > asm.contig  | |||
  lenseq asm.contig > asm.contig.len  | |||
  echo Edge_num `wc -l asm.contig.len` `wc -l asm.contig.len` > asm.ContigIndex  | |||
  echo "index   length  reverseComplement" >> asm.ContigIndex  | |||
  lenseq asm.K31.contig | awk '{print $1,$2,1}' >> asm.ContigIndex  | |||
  SOAPdenovo map   -s SOAPdenovo.config -g asm > SOAPdenovo.log        # ok  | |||
  SOAPdenovo scaff -g asm -F >> SOAPdenovo.log                         # fails: a nucmer of files are missing  | |||
= Assembly comparison =  | |||
== CA.contigs vs SOAPdenovo.contigs ==  | |||
* Filter contigs >=100 bp  | |||
* Alignment  | |||
  nucmer --mum -c 100  CA.fasta SOAPdenovo.fasta  | |||
  show-coords -H /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | p ' $F[19]="." unless($F[19]); print $F[19],"\n";' | count.pl | pretty -o  | |||
  [CONTAINS]     80413    | |||
  [CONTAINED]    4646     | |||
  [IDENTITY]     1960     | |||
  [BEGIN]        4152      | |||
  [END]          4034     | |||
  .              157     # partial alignments  | |||
  # get "broken" contigs  | |||
  show-coords -H        /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | grep -v '\[' | awk '{print $18,$12}' | sort -u | getSummary.pl -i 1 -t all  | |||
  show-coords -L 500 -H /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | grep -v '\[' | awk '{print $18,$12}' | sort -u | getSummary.pl -i 1 -t 500+  | |||
  .                    elem       min    q1     q2     q3     max        mean       n50        sum              | |||
  all                  118        155    4630   9817   29042  144499     22498.97   54311      2654878       | |||
  500+                 16         1844   8205   16551  28426  66880      20514.88   28426      328238    | |||
* Location   | |||
  mulberry:/scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs  | |||
== CA.contigs vs SOAPdenovo.scaffolds ==  | |||
* All ctgs & scf  | |||
* Alignment  | |||
  #ref/qry all  | |||
     .                    elem       min    q1     q2     q3     max        mean       n50        sum              | |||
  SOAPdenovo.scf(span)    12473      100    113    144    498    5933710    20428      1086614    254800874      | |||
  SOAPdenovo.scf(len)     12473      100    113    143    341    5522831    18759.84   1034925    233991535  | |||
  CA.ctg                  92307      63     100    119    186    297795     2612       24781      241197400  | |||
  #best hits  | |||
  nucmer --mum -c 40 CA.fasta SOAPdenovo.fasta  | |||
  delta-filter -1 40 -i 95 -o 50 -1 SOAPdenovo-vs-CA.40.delta >! SOAPdenovo-vs-CA.40.filter-1.delta  | |||
= Genbank submission =  | |||
* Project registration : http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi  | |||
  http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi?show=20FF4536-24A4-4D8C-94FA-0755E1326E33  | |||
  Project ID: 61101  | |||
  Submitting Organization:              University of Illinois & University of Maryland  | |||
  Sequencing Center:                    Keck Center for Comparative and Functional Genomics, University of Illinois  ??? BCUI ???  Biotechnology Center, Univ. Illinois (BCUI)  | |||
  Consortium Name:  | |||
  Organism Name:                        Bombus impatiens  | |||
  Strain/isolate/breed:  | |||
  Locus Tag Prefix:                     BIMP       #3+ letters  | |||
  Source of DNA used for sequencing:    haploid males  | |||
  Sequencing Method:                    wgs       ???  | |||
  Sequencing Technology:                Illumina  | |||
  Estimated Genome Size:                250Mb     # the haploid genome size   | |||
  Brief description of the importnace:  #2 pollinator in the US  | |||
  Genome Coverage                       127x  | |||
  Sequencing Technology                 Illumina IIx   | |||
  Comments to the staff:                  | |||
    DNA   | |||
    whole genome sequencing   | |||
    single genome;   | |||
    no annotation  | |||
    assembly name BIMP_1.0   | |||
    assembly method: Celera Assembler 6.1   | |||
    plan to update: ?  | |||
    expect to release   | |||
    agp:  | |||
* Sequin:  | |||
  Manuscript:           | |||
  Authors:     Hugh Robertson, Matt Hudson, Kim Walden, Aarti Venkat, Tom Newman, Daniela Puiu, Tanja Magoc, David Kelley, Aleksey Zimin, Steven Salzberg, Gene Robinson  | |||
----  | |||
* Best assembly (CA)  | |||
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/  | |||
  ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bombus_impatiens/Assembly/CA/  | |||
  .                    elem       min    q1     q2     q3     max        mean       n50        sum              | |||
  ctg                  92307      63     100    119    186    297795     2613       24781      241197400        | |||
  ctg.contam           82         1331   10485  19924  31977  112638     25648      32851      2103095     | |||
  ctg.clean            92225      63     100    119    185    297795     2593       24628      239094305       | |||
  ctgNs                76         75     536    3124   6834   44990      5326       10261      404781           | |||
  scf                  1896       76     150    4044   67922  4021294    151761     1017298    287738041  | |||
  scf.ctg.contam       24         14315  45008  71878  141921 1116872    141140     164885     3387362          | |||
  deg                  1086650    54     103    124    152    6822       139        135        151484589     | |||
  deg.contam           1129       65     114    141    207    1035       198        211        223588           | |||
  deg.clean            1085521    54     103    124    152    6822       139        135        151261001        | |||
  deg.Ns               37         98     145    234    426    914        313        437        11574               | |||
* Contaminant list: ???  | |||
  /fs/szattic-asmg4/tmagoc/bees_bacteria/IDs  | |||
== 1st submission ==  | |||
* Submit all contigs (should have deleted the singletons < 200bp)  | |||
* Directories & files  | |||
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/   | |||
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/  | |||
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/BIMP_1.0.*  | |||
  server:      ftp-trace.ncbi.nlm.nih.gov  | |||
  user:        cbcb_trc  | |||
  directory:   test/  | |||
  files names: BIMP_1.1.sqn , BIMP_1.1.agp  | |||
  number contigs:   92,231   | |||
  number scaffolds: 1,895  | |||
* NCBI email (Dec 23rd)  | |||
  there are 3 contigs that need to be trimmed:  | |||
  Contig name	Length	Span(s)	Apparent source  | |||
  7180002728824	11695	72..148	Escherichia coli  | |||
  7180002730077	25748	22703..25611	Escherichia coli  | |||
  7180002740828	30102	134..1688	Escherichia coli  | |||
  Since these are internal hits, consider whether the contig needs to split (if the E.coli was the 'joining' element).  | |||
* NOTE: When I blastn these 3 contigs to nr (distant blast) , all 3 contigs had multiple good hits to other bacteria , so probably they are all contaminated and should be removed completely, not just trimmed  | |||
== 2nd submission ==  | |||
* Jan 24th  | |||
   number of scaffolds: 1191  | |||
   number of contigs:   91524  | |||
* Directories & files  | |||
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/BIMP_1.1.*  | |||
== NCBI release ==  | |||
* [http://www.ncbi.nlm.nih.gov/nuccore/AEQM00000000 AEQM00000000] WGS  | |||
* [ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/ ftp] GCA_000188095 to come by end of Feb 2011  | |||
Latest revision as of 19:51, 17 May 2011
Data
- ~ 500Mbp genome (diploid); 250Mbp halpoid
 
- Complete mitochondrion genomes:
 
NC_011923.1 15468 14.67 Bombus hypocrita sapporoensis mitochondrion, complete genome NC_010967.1 16434 13.22 Bombus ignitus mitochondrion, complete genome only 88% identity; no rearrangements, only snps, short indels
- NCBI TaxId
 
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=132113
Traces
- 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
 
Lane Insert ReadLen #Mates #Reads Coverage Orientation Comments #RepeatsReads 1 4K 124 34,944,099 14X outie 865,687(1.2%) reads qual==0 30,576,297 (44%) 2 8K 124 32,540,640 13X outie 26,972,424 (41%) 3 400 124 34,745,750 . 8,319,335 (12%) 5 400 124 34,601,239 . 8,398,759 (12%) 6 400 124 34,553,857 . 7 400 124 34,682,612 . 8 400 124 12,975,839 . 9 3K 124 29,615,036 . outie new3kb ; received 05/18/2010 1-8 219,044,036 438,088,072 108X 1-9 248,659,072 497,318,144 3-8 400 124 151,559,297 303,118,594 75X innie 1,9 3-4K 124 64,559,135 129,118,270 32X outie 2 8K 124 32,540,640 65,081,280 13X outie
Two new 3kb libraries to come; ~ 70% are true pair ends
Adaptors
>circ CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA >circ.revcomp TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG >5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGA >3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT[AGATCTCGGTGGTCGCCGTATCATTA][A+]
               0        1         2         3         4 
               123456789012345678901234567890123456789012
 circ          CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
                   .............   ..   ............. 
 circ.revcomp  TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG 
 CGGCATTCCTGCTGAACC revcmp GGTTCAGCAGGAATGCCG
              
               0        1         2         3         4 
               123456789012345678901234567890123456789012
                           GGTTCAGCAGGAATGCCG
 5             GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
 
               CGGCATTCCTGCTGAACC
 3             CGGCATTCCTGCTGAACCGAGATCGGAAGAG
#18mers (5') CGTAATAACTTCGTATAG TCGTATAACTTCGTATAA GATCGGAAGAGCGGTTCA CGGCATTCCTGCTGAACC
#18mers (5',3') CGTAATAACTTCGTATAG TTATACGAAGTTATACGA TCGTATAACTTCGTATAA CTATACGAAGTTATTACG GATCGGAAGAGCGGTTCA GCAGGAATGCCGAGACCG CGGCATTCCTGCTGAACC CGTGTAGGGAAAGAGTGT
#31 mers (5') # jellyfish uses 31mers CGTAATAACTTCGTATAGCATACATTATACG TCGTATAACTTCGTATAATGTATGCTATACG GATCGGAAGAGCGGTTCAGCAGGAATGCCGA CGGCATTCCTGCTGAACCGAGATCGGAAGAG
Frequent kmers
                                                        s_1   s_2    s_3
    1  CGTAATAACTTCGTATAGCATA|CGTATAACTTCGTATAATGTAT    15.2  4.59   2.829    # not assembled   circAdapter
    2  AATGTATGCTATACGAAGTTAT|ATAACTTCGTATAGCATACATT    5.56  19.02  0.525    # not assembled   circAdapter
    3  CGTTATAACGTATTACGTTATA|TGTATGCTATACGAAGTTATTA    5.2   4.56   .         
    4  TACATTATACGAAGTTATACGA|TATATGGTATGACGTTATAACG    3.39  1.87   0.365     
    5  ATATAACGTAATACGTTATAAC|TATAACGTATTACGTTATATGG    2.86  2.71   1.595     
    6  ACGTTATAACGTCATCCCATAT|ATACCATATAACGTAATACGTT    2.11  2.62   .
    7  TACGTTATAACGTCATACCATA|TCCATTTTACACTGCAAATGTA    1.98  2.04   1.749  
    8  CTTAAACGTCGAAATATCTCCA|TTACGTTATATGGGATGACGTT    0.87  1.16   1.136  
    9  GAGCGCTTAAACGTCGAAATAT|ATGAAATCTAGCCGTCATTTCC    0.73  1.02   1.442  
   10  ACGTAATACGTTATAACGTCAT|ATTTCCTTCCCGTTGGAGATAT    0.67  1.1    1.477  
   11  CTGCTGAACCGCTCTTCCGATC|GATCGGAAGAGCGGTTCAGCAG    0.61  3.92   0.045    # not assembled   5Adapter
   12  ACGGTCACAGCTTGTCTGTAAG|TGTAAAATGGAGTTCAACGAGC    0.31  1.73   0.798  
   13  ACGGTGGTGCATTACATTTGCA|GGCGAATGGCGCCTGATGCGGT    0.28  0.87   0.949    # not assembled
   14  CAGAAAAGCATCTTACGGATGG|TACATACCTCGCTCTGCTAATC    0.04  2.47   .        # not assembled
- Location
 
/fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt : s_1-8 libs /fs/szattic-asmg4/Bees/Bombus_impatiens/new3kb/s_1_?_sequence.txt : new3kb lib received 05/18/2010 /fs/szattic-asmg4/Bees/Bombus_impatiens/3kb_Bimpatiens.txt : 88 new3kbr test reads
Tasks to figure out
- Erroneous reads/bases, which we need to correct or discard
 - GC bias, so we can compute a-stats properly
 - Redundancy in the long paired ends, which are lane 1 and lane 2.
 
- Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument.
- Some reads will begin in the circularization adaptor and thus will have only one usable read
 - Some reads have a few bases of DNA sequence and hit the circularization adaptor right away
 - Most reads will have at least 36bp from each end before hitting the adaptor.
 - Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor)
 
 
- A small but significant number of reads from the 3kb and 8kb libraries are not recircularized.
- Thus their mate distance is +400bp rather than -3kb or -8kb.
 - It's apparently the result of a faulty batch of cre recombinase. This causes problems with contiging and scaffolding.
 - It is possible to remove these reads by removing
 
 
- mate pairs where neither read is trimmed (thus no adapter ligation may have occurred)
 - mate pairs where one read begins with the adapter sequence.
 
- We are working on this, up to now our paired assembly stages have been disappointing.
 
- Matt's idea is to exclude all mate pair reads that don't have evidence of the linker with flanking useful sequence, as a way to avoid uncircularized molecules that will give misleading "mate pairs" only 400 bp apart.
- There has been no trimming of the adaptor, which is the 42 base 454 adaptor, so its presence can be used to indicate potentially good mate pairs.
 - Even tossing half the mate pairs might not be a problem, as we have perhaps too many anyway.
 - But you will also need to toss redundant mate pairs, and that will indeed reduce the total a lot.
 - Just to be clear - the 500 base mate pairs should have no such problems, except that as Matt has found from his preliminary assembly, the mean fragment length is actually 400 bp rather than 500 bp (and the 3 and 8 kb PE reads are typically shorter than nominally given, e.g. more like 2.5 and 6 kb).
 - You'll also need to throw out all the reads where one of the mates /starts/ with linker, I assume you'd do that anyway. We're also working on ways round this; one might be when we get a better assembly we'll find some better characteristic to filter the unrecircularized reads on.
 
 
Trimming
- Quality trimming: trim all bases with fastq quality eq "B" (0)
 
cat *fastq | ~/bin/fastq2clq.pl
- Adaptor trimming: Align all subsets to adaptors
 
C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA 3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT 5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
id len gc% C 42 30.95 3 52 55.77 5 67 59.70
nucmer -l 8 -c 16 -b 32 -g 32 adaptors.seq ... delta-filter -l 16 -q ... cat *.filter-q.delta | ~/bin/delta2clr53.pl -5 5,3 ...
- Adaptor/primers median positions in the reads
 
Lib adaptorPos 5'primerPos 3'primerPos s_1 34-75 0-36 0-19 s_2 2-40 0-36 0-19
- Mate stats
 
Lib #mates adaptor.bothMates adaptor.noMate adaptor.oneMate adaptor.oneMate(filtered) s_1 34944099 269156 9804247 24870696 6048164(17.3%) s_2 32540640 91528 16181288 16267824 1061858( 3.2%) total 67484739 7110022(10.5%)
- Filtered read clr stats
 
Lib #mates #reads min q1 q2 q3 max mean n50 sum s_1 6048164 12096328 64 80 95 115 124 96.56 101 1,168,064,632 s_2 1061858 2123716 64 80 96 115 124 96.59 101 205,122,226 total 7110022 14220044 64 80 95 115 124 96.57 101 1,373,186,858
- Filtered read GC% stats
 
Lib #reads min q1 q2 q3 max mean n50 s_1 12096322 0.00 31.93 36.26 42.35 100.00 37.45 38 s_2 2123715 0.00 31.88 36.46 42.48 100.00 37.56 38
- Other frequent kmers
 
26mer : ACGTTATAACGTATTACGTTATATGG -> revcomp -> CCATATAACGTAATACGTTATAACGT : ~10% of the traces 10mer : AAAAAAAAAA TTTTTTTTTT : ~32% of the traces 53mer: CGATTTCCATGGCGTCGTTTGAGGATTCCAATACGGCGAACCTGTTGTGAGTG : ~2% of the mito seqs (either begin or end); not present in the 2 complete mito's (probably ok)
- Data processing pipeline
 
s_[12] : /nfshomes/dpuiu/bin/fastq2adaptorFreeFastq.amos s_[35678] : /nfshomes/dpuiu/bin/fastqfrg.amos
- Location:
 
/fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt # original fastq files /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/s_[1235678]_[12]_sequence.cor.txt # corrected fastq files
/fs/szattic-asmg5/Bees/Bombus_impatiens/s_[12]/all/s_[12].64.rev.frg # adaptor free s_[12] mated reads /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[1235678].frg # adaptor free & corrected s_[1235678] reads
D Kelly's trimming
- Method: http://github.com/davek44/error_correction
 - Read lens stats
 
lib reads min q1 q2 q3 max mean n50 sum #repeatReads s_1_1_sequence.cor.txt 5,436,814 30 71 90 115 124 89.70 100 487673256 1,071,779 (20%) s_1_2_sequence.cor.txt 5,864,225 30 77 92 110 124 93.27 98 546956864 1,208,767 (20%) s_2_1_sequence.cor.txt 1,053,858 30 80 96 118 124 97.16 102 102395785 222,290 (21%) s_2_2_sequence.cor.txt 1,041,846 30 78 93 110 124 93.51 99 97426679 208,029 (20%) s_3_1_sequence.cor.txt 33,668,302 30 105 124 124 124 111.04 124 3738530761 (12%) s_3_2_sequence.cor.txt 32,554,322 30 82 108 124 124 100.42 120 3269039697 s_5_1_sequence.cor.txt 33,579,744 30 109 124 124 124 111.65 124 3749192427 s_5_2_sequence.cor.txt 32,465,412 30 84 112 124 124 102.13 124 3315553171 s_6_1_sequence.cor.txt 33,535,725 30 109 124 124 124 111.60 124 3742602285 s_6_2_sequence.cor.txt 32,390,877 30 83 109 124 124 100.92 123 3268875471 s_7_1_sequence.cor.txt 33,674,235 30 112 124 124 124 112.19 124 3777917379 s_7_2_sequence.cor.txt 32,518,568 30 84 110 124 124 101.60 124 3303807321 s_8_1_sequence.cor.txt 11,777,821 30 113 124 124 124 112.30 124 1322658923 s_8_2_sequence.cor.txt 12,176,364 30 84 110 124 124 101.51 124 1236034348 s_9_1_sequence.cor.txt 21,273,603 30 87 124 124 124 104.52 124 2223545139 #new3kb added 05/18/2010 s_9_2_sequence.cor.txt 20,769,027 30 86 124 124 124 103.68 124 2153436097 #new3kb added 05/18/2010
s_1,2 13,396,743 s_1..8 301,738,113 . . . . . . . 31958664367 s_2..8 (64bp+) 275,763,594 . . . . . . . 30693543885
- Read/Mate counts
 
lib reads mates range s_1 11301039 5320163 s_2 2095704 1034086 11301040..13396743 s_3 66222624 31857304 s_5 66045156 31820553 s_6 65926602 31748713 s_7 66192803 31891468 s_8 23954185 11168258 s_9 42042630 13245740 s_1,2 13,396,743 6,354,249 s_1-8 301,738,113 144,840,545 s_1-9 343,780,743 158,086,285
- Files
 
/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt # adaptor free corrected reads (long inserts) /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt # corrected reads (short inserts) /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[12].rev.frg /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/frg/s_[35678].frg
Assembly
- SOAP version 1.04: /nfshomes/dpuiu/szdevel/SOAPdenovo_Release1.04/
 - CA version 6.0 /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/
 
SOAPdenovo s_1..8 CBCB-corr (Illinois)
- From: Kim Walden 04/30/2010
 - Input: s_1,2,3,5,6,7,8 reads
 - Error correction: yes; CBCB method
 - Kim's e-mail
 
I ran it in a stepwise fashion, not adding any of the available options and only changing K=31. I then ran their more recent GapCloser algorithm, not the version that is part of the package when using option -F in "scaff". I used their "all" assembly once, and I noticed that it automatically uses the -F option during scaffolding Also, I used your data from the ftp site without checking for redundancy in the 3kb and 8kb lanes; did Daniela do the same?
- Summary
 
                         elements min_len  max_len   mean    n50        sum
 contigs(>100)           103858   100      85850     2227    7541       231387390      # prior to scaffolding; 
 scaffolds&contigs(>100) 12663    100      4648588   19282   1040109    244171008      # post-scaffolding and gap-filling
 scaffolds(span)         3736     252      4648588   64874   1047164    242369898      # same prior data minus contigs that didn't assemble into scaffolds.
 scaffolds(len; no N's)  3736     172      4585806   64034   1027043    239233801      # computed by CBCB
     
- Locations
 
ginkgo:/scratch1/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC.redo/Bimp_Salz_corr_050310_NewGapCloser.fasta # backup /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC.redo/
# online http://www.life.illinois.edu/robertson/private/ Login: robertson password: dna
SOAPdenovo s_1..8 CBCB-corr (CBCB)
- Stats:
 
elem min max mean n50 sum contigs(all) 9943851 31 85850 60 45 594773555 contigs(>100) 106213 100 85850 2177 7084 231246190 scaffolds+contigs(>100) 12473 100 5933710 20428 1086614 254800874 # based on scafSeq file scaffolds(span) 3664 236 5933710 69090 1115496 253148288 # based on scafSeq file scaffolds(len) 3664 134 5522831 63411 1059378 232338949 readsAssembled 208389968 # based on readOnContig file readsFiltered 289681090 # based on readOnContig & log file readsTotal 301738113 # based on config file readsDeleted 441343 # based on log file
- Run gapClosing on assembly:
 
stillOpen 14784 closed 42379
- Contig cvg: GC bias:
 
 cat asm.K31.contig100.infoseq | ~/bin/getCorrelationDescriptive.pl -i 2 -j 3  => -0.709501493345893
 cat asm.K31.contig1K.infoseq | ~/bin/getCorrelationDescriptive.pl -i 2 -j 3   => -0.923151349747983
#100+ ctgs . elem min q1 q2 q3 max mean n50 sum len 106213 100 140 437 2420 85850 2177 7084 231246190 gc 106213 0.00 33.13 38.00 43.44 100.00 38.74 39 4115066 cvg 106213 0.6 39.0 50.8 58.9 160.8 47.38 54 5032329
#1K+ ctgs . elem min q1 q2 q3 max mean n50 sum len 41965 1000 1860 3295 6338 85850 5110 7726 214421699 gc 41965 19.18 34.31 37.96 41.79 59.25 38.19 38 1602711 cvg 41965 8.0 42.0 49.9 55.8 93.8 47.74 51 2003339
- Run soap2 to map reads back to the assembly: "-l 32 -v 2"
 
readsAligned 173063062
- Locations
 
mulberry:/scratch0/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor/ # original /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor/ # backup
SOAPdenovo s_1..9 CBCB-corr (CBCB) **
- Uses the new 3Kbp library : s_9 (reads have been reversed)
 
- Stats:
 
. elem min q1 q2 q3 max mean n50 sum scf 11178 100 111 135 390 5655980 23014 1205321 257251549 ctg 10856652 31 . . . 85850 57 43 627095607 ctg100+ 106741 100 . . . 85850** 2165 6939 231167576 scfSeqContigs 67900 100 340 1788 4373 85819 3437 7586 233372689 scfSeqContigsClosed 20651 100 128 488 7991 600179 11715 64083 241921308
- Location
 
mulberry:/scratch2/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/ # original /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/ # backup
CA s_1..8 corr OBT (CBCB)
Issues
- Used outie s_1,2 lib ; needs to be redone
 
runCA spec
doOverlapTrimming = 1 ovlOverlapper = ovl unitigger = bog # cmd buildUnitigs *BatchSize = 2000000 *BlockSize = 2000000 ovlMemory = 8GB doExtendClearRanges = 0 # fast run
Gatekeeper
- ~ 58X cvg
 
gatekeeper -dumpfragments asm.gkpStore | grep ^fragmentClear ... LOAD STATS 7 libInput 7 libLoaded 0 libErrors 301738113 frgInput 301738113 frgLoaded 25974519 frgErrors 144840545 lkgInput 124575590 lkgLoaded 20264955 lkgErrors 0 lkgWarnings 275763594 numRandom 301738113 numNormal GLOBAL STATS 301738113 FRG 7 LIB LibraryName numActiveFRG numDeletedFRG numMatedFRG readLength clearLength #egrep 'ATATAACGTAATACGTTATAAC|GTTATAACGTATTACGTTATAT' GLOBAL 274282535 27455578 247151634 30524746650 29445986650 LegacyUnmatedReads 0 0 0 0 0 s_1 10409320 891719 9039810 996306280 994802849 #1134233(11%) s_2 2054575 41129 1987728 197829588 197593453 #224648(11%) s_3 59894531 6328093 53925476 2382786529 2127257586 #3037085(5%) s_5 60127274 5917882 54512448 2459696212 2209078913 #5% s_6 59813485 6113117 54062058 2396359740 2152656620 #5% s_7 60278536 5914267 54647258 2476326572 2237423989 #5% s_8 21704814 2249371 18976856 2435572545 2347304056 #5%
gatekeeper -dumplibraries -tabular asm.gkpStore UID IID Orientation Mean StdDev NumFeatures s_1 1 I 4000 400 1582823088 s_2 2 I 8000 800 1582823088 s_3 3 I 400 40 1582823088 s_5 4 I 400 40 1582823088 s_6 5 I 400 40 1582823088 s_7 6 I 400 40 1582823088 s_8 7 I 400 40 1582823088
s_1,2 lib : ids:1..13,396,744 overlap jobs that need to be redone : 1036 out of 11476 < 10% ~/bin/getOvlJobsCount.pl 2000000 2000000 301738113 | p '/(\d+)-(\d+)\s+(\d+)-(\d+)/; print $_ if($1<14000000 or $3<14000000);'
Meryl
meryl -Dh -s 0-mercounts/asm-C-ms22-cm0 Found 23686053415 mers. Found 265670791 distinct mers. Found 8682767 unique mers. Largest mercount is 21,987,616; 668 mers are too big for histogram. ... #freq count % 1 8682767 0.0327 0.0004 2 4288602 0.0488 0.0007 3 3041005 0.0603 0.0011 ... 72* 3803015 0.7150 0.3229 ... 1048158 1 1.0000 0.9274
fasta2tab.pl /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/0-mercounts/asm.nmers.ovl.fasta | sort -n -r | head -5 21987616 ATATAACGTAATACGTTATAAC 20992872 CGTTATAACGTATTACGTTATA 20510821 ACGTTATAACGTATTACGTTAT 20484950 CCATATAACGTAATACGTTATA 20452950 CATATAACGTAATACGTTATAA
cat egrep.sh egrep -c 'ATATAACGTAATACGTTATAAC|GTTATAACGTATTACGTTATAT' /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/seq_obt/*clean.seq > egrep.count /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2 ...
OBT
- job count
 
n=((301738113/2000000)+0.5)=151 OBT/OVL jobs = n*n/2 =151*76=11476 ~ 20 min/OBT job =~ 0.33 hr/OBT job => 0.33*11476/32=5 days
- OBT trimming
 
key sum count mean CLR 31958664367 301738113 105.915239043733 OBTINITIAL 30779687358 301610679 102.051052900551 OBTMERGE 30052401719 301610679 99.6397137483318 OBTCHIMERA 30052138325 301610679 99.638840456972 LATEST 30052138325 301610679 99.638840456972
- use OBTCHIMERA as CLR if want to redo the assembly !!!
 
Overlap
- job count : same as OBT
 
- Overlap counts
 
 cat asm.repeatmodel.lib.00*stats | egrep '^Lib|nSamples|median' | p 'chomp; if(/Lib/) {print "\n$_"} else { print "  $F[1]"}' | pretty -o 
 Lib    0  5'  199542773  50  
 Lib    0  3'  199542773  50  
 Lib    1  5'  6682072    38  
 Lib    1  3'  6682072    37  
 Lib    2  5'  1306720    37  ???
 Lib    2  3'  1306720    42  ???
 Lib    3  5'  43868663   50  
 Lib    3  3'  43868663   50  
 Lib    4  5'  43939366   51  
 Lib    4  3'  43939366   50  
 Lib    5  5'  43620676   50  
 Lib    5  3'  43620676   50  
 Lib    6  5'  43903838   51  
 Lib    6  3'  43903838   50  
 Lib    7  5'  16221438   50  
 Lib    7  3'  16221438   50
- 199542773 out of 257786946 (77%) usable reads had overlaps
 - 7988792 out of 11443537 (69%) usable s_1,2 reads had overlaps
 
Frg/Ovl correction
- Frgcorr jobs: 151
 
wc -l 3-overlapcorrection/cat-corrects.frgcorrlist
- Ovlcorr jobs: 151
 
wc -l 3-overlapcorrection/cat-corrects.frgcorrlist
OverlapStore
- Failures
- overlapStore: too many files (max 10240) ; fix: recompile the code
 - chimera: FAILED file seems to be truncated but contains records from all libs; fix: link asm.chimera.report.FAILED asm.chimera.report
 
 
Bog
- Run time: 14 hr
 - default unitigger : does not seem to work on such big datasets
 - Utg len stats:
 
cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1 elem min q1 q2 q3 max mean n50 sum 63916892 47 85 150 150 22861 125.23 150 8004296170
Bog consensus
- job count :
 
wc -l 4-unitigger/asm.partitioningInfo 110
- Failures
 
 grep FAILED 5-consensus/asm*err
 5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED.  Could not align fragment 264046761.
 
 # Fix: 
 # update layout (CA 6.1)
 mkdir utg
 
 # update layout (CA 6.1)
 grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "tigStore -g asm.gkpStore -t asm.tigStore 2 -u  $F[0] -d layout > utg/$F[0].layout\n";'  | ~/bin/scheduler.pl
 grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "~/bin/updateUTGlayout.pl utg/$F[0].layout >  utg/$F[0].layout.update\n ";'              | ~/bin/scheduler.pl 
 grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg/$F[0].layout.update\n" ;'             | ~/bin/scheduler.pl  -c 1
 grep FAILED 5-consensus/asm*err | sed 's/\.err:/ /' | awk '{print $1}' | sort -u | p 'print "touch $F[0].success\n";'                                               | ~/bin/scheduler.pl 
 # create success file
 touch 5-consensus/consensus.success
 
 # relaunch runCA
Cgw
- Failures
 
grep -B 5 "^Failure message:" runCA.log
Cgw consensus
- job count : 84
 
wc -l 7-CGW/asm.partitionInfo
- ctg count: 1698486
 
cat 7-CGW/asm.partitionInfo | /nfshomes/dpuiu/bin/sum.pl -i 3 -t ctgs | pretty -o
Terminator
- Failure:
- buildPosMap : segFault :ids too long???
 
 
Fix: split asm file(optional) and reassign new accession numbers cd split/ ~/bin/splitAsm.pl < asm.asm => ... CCO.asm, SCF.asm cat CCO.asm CLK.asm SCF.asm SLK.asm | ~/bin/newAsmIds.pl | buildPosMap -o asm
Stats
- Bb.qc
 - Stats
 
. elem min q1 q2 q3 max mean n50 sum scf 15009 74 4786 9789 20021 415318 16916.33 29133 253897199 ctg 145036 64 93 116 1035 138630 1498.67 7726 217361459 deg 1507280 54 98 118 146 1434 148.28 137 223502287 utg 2786974 54 90 110 142 25644 197.04 277 549140918 singletons 61129865 usableSeqs 257786946
- Stats (100bp+)
 
elem min max mean n50 sum contigs(>100) 99429 100 138630 2148 7972 213595680 scaffolds(span) 14992 100 415318 16935 29133 253895676 scaffolds(len) 14991 100 303946 14499 24091 217357577
- Library stats (totals)
 
lib ctg sur deg singl total s_1 2251597 162800 3119665 3988095 9522157 s_2 424821 41107 563656 891795 1921379 s_3 28880189 1078891 13600909 12735637 56295626 s_5 28983834 1077334 13561390 12962731 56585289 s_6 28663974 1062939 13559494 12896855 56183262 s_7 28784505 1058364 13713200 13020274 56576343 s_8 10455533 383065 5229814 4634478 20702874 all 128444453 4864500 63348128 61129865 257786930
- Library stats (percentages)
 
lib ctg sur deg singl total s_1 24 2 33 42 100 s_2 22 2 29 46 100 s_3 51 2 24 23 100 s_5 51 2 24 23 100 s_6 51 2 24 23 100 s_7 51 2 24 23 100 s_8 51 2 25 22 100
Mate redundancy
- Criteria:
 
Given 2 mates 1: a1,a2 2: b1,b1 We have the following 2 overlaps 1: a1,b1 a2,b2 2: a1,b2 a2,b1 At least 75% of one read overlaps the other read
- Location:
 
cd /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor/9-terminator/
wc -l s_1.mates asm.posmap.frgutg.s_1.* 5320163 s_1.mates # input mates (from FRG file; 2 reads/mate) 1232734 asm.posmap.frgutg.s_1.mates # assembled mates (both reads assembled) 366262 asm.posmap.frgutg.s_1.redundant_mates # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 29.7% redundancy 312206 asm.posmap.frgutg.s_1.redundant_mate_pairs # redundant assembled mate pairs wc -l s_2.mates asm.posmap.frgutg.s_2.* 1034086 s_2.mates # input mates (from FRG file; 2 reads/mate) 255905 asm.posmap.frgutg.s_2.mates # assembled mates (both reads assembled) 110570 asm.posmap.frgutg.s_2.redundant_mates # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 43.2% redundancy 1871747 asm.posmap.frgutg.s_2.redundant_mate_pairs # redundant assembled mate pairs 11937 asm.posmap.frgutg.s_2.redundant_mate_clusters # redundant assembled mate clusters
Runtimes
# step hours 0 gatekeeper 5.36 1 initialTrim 2.07 2 meryl 9.90 9 overlap.sh 11.31 11 find 0.00 16 overlapStore 13.79 17 consolidate 5.12 18 merge-trimming 1.41 19 chimera(1) 10.04 20 overlap.sh 49.86 21 find 0.00 22 overlapStore 16.68 23 frgcorr.sh 5.57 24 cat-corrects 0.10 25 ovlcorr.sh 4.64 26 cat-erates 0.90 27 overlapStore 6.67 28 buildUnitigs 13.88 29 gatekeeper 1.31 30 consensus.sh 3.41 31 cgw(1) 65.00 43 gatekeeper 4.39 44 consensus.sh 4.70 46 terminator 6.26 47 asmOutputFasta 7.75 48 dumpSingletons 3.68 49 buildPosMap(1) 3.79
Location
ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/ # original assembly ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/asm.gkpStore.OBTCHIMERACLR # dumped OBT clear ranges, 0/1 undel/del /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor # backup
CA s_1..8 corr OBT redo (CBCB)
Issues
- Fixed outie s_1,2 lib : clear ranges will be different
 - s_3..8 : stay the same
 
runCA spec
doOverlapTrimming = 0 !!! already done in the prev assembly saveOverlaps = 1 !!! it was actually set to 0 for this run; otherwise the ovb.gz files get deleted ovlOverlapper = ovl # same unitigger = bog # same BatchSize = 2000000 # must be the same so overlaps are not redone BlockSize = 2000000 # must be the same ... doExtendClearRanges = 1 # initially set to 2; seems to take too long => reset to 1
Input
- Modified frg files:
- dumped "CA s_1..8 corr OBT" assembly gatekeeper
 - undelete act:D fragments so the ids stay the save
 - reverse s_1,2 reads
 
 
Gatekeeper
- Make sure the iids match
 
Overlap
- Rerun only the overlap jobs for the s_1,2 reads
 
OverlapStore
- To build faster in the future
 
e 'foreach ("0000000001".."0000000058") { print "overlapStore -c $_.ovlStore -g asm.gkpStore/ -i 0 -M 8192 1-overlapper/$_/*gz\n" }' | scheduler.pl  # 13min/job  
e 'foreach ("0000000002".."0000000058") { print "overlapStore -m 0000000001.ovlStore $_.ovlStore\n" }'                                               # 3 min/job 
total=2*13+57*3=3.5 hrs (instead of 13.85 reported in runCA.log)
Frgcorr
- Initial failure: on 144 out of 151 jobs : the gatekeeper id's originally did not match (5 reads missing from s_5 lib shifted all the other reads)
 
Ovlcorr
- The error rate on overlaps with error went down by ~ 1.28%
 
Bog
cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1 elem min q1 q2 q3 max mean n50 sum 60970504 47 85 150 150 93191 126.17 150 7692801690
- Unitig stats
 
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep -c ^unitig_num_frags                                    # 60,970,523  : total unitigs
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep -c ' 1$'                       # 59,776,572  : unitigs with only 1 read
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep -v ' 1$' | getSummary.pl -i 2  # stats for unitig with 2+ reads
               elem         min    q1     q2     q3     max        mean       n50        sum            
 1read/utg     59,776,572                                                                59,776,572
 2+reads/utg   1,193,951    2      3      6      28     74715      165.84     4712       198,010,374
Bog consensus
- job count :
 
wc -l 4-unitigger/asm.partitioningInfo 110
- Failures: 1 failed unitig
 
grep FAILED 5-consensus/asm*err 5-consensus/asm_022.err:MultiAlignUnitig()-- Unitig 115198 FAILED. Fix: same as before
Cgw
- Failure:
- after 7-0-CGW/asm.ckp.4
 
 
tail runCA.log ERROR: Negative variance for chunk 59905266* in scaffold 6409 (AEnd=-163479.174700 BEnd=-163491.446700) 37980:Scaffold 6409: 37981- Idx Chunk ( Start, Variance) ( End, Variance) Len 37982- 0: 60983787 ( 0, 0) ( 31646, 823) 31646 37983- 1: 69691 ( 35389, 934) ( 31650, 837) 3739 37984- 2: 59907028 ( 35103, 942) ( 42035, 1122) 6932 37985- 3: 60799298 ( 41640, 1132) ( 64897, 1737) 23257 37986- 4: 60975976 ( 64924, 1784) ( 88439, 2395) 23515 37987- 5: 60988396 ( 88431, 2416) ( 122023, 3289) 33592 37988- 6: 60988698 ( 121551, 3324) ( 123589, 3377) 2038 37989- 7: 59905266* ( 122331, -163479) ( 121859, -163491) 472 37990- 8: 60988599 ( 122275, 170138) ( 134564, 170457) 12289
./AS_REZ/GapFillREZ.C:5666: int Estimate_Chunk_Ends(VarArrayStack_Entry_t*, int, int, LengthT*, LengthT*, ChunkInstanceT*, float*, Scaffold_Fill_t*, int*, int*, int*): Assertion `min_right_variance >= 0.0' failed. ./AS_REZ/GapFillREZ.c:8339: "ERROR: Negative variance for chunk %d in scaffold %d (AEnd=%f BEnd=%f)\n", #message
Fix: modfy the code; comment assertion statemants cat /nfshomes/dpuiu/szdevel/SourceForge/wgs/src/AS_REZ/GapFillREZ.c // assert (0.0 <= ref_variance && ref_variance <= FLT_MAX); // assert (min_right_variance >= 0.0);
- Preliminary stats
 
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $sum,"\n"; $sum=0; } else { $sum+=$F[-1]}' | getSummary.pl -t scf
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -t ctg
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -min 200 -t ctg200+
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $count-1,"\n"; $count=0; } else { $count++}' | getSummary.pl -i 0 -min 1 -t scf.gaps
#7-0-CGW/rez.i36.log : . elem min q1 q2 q3 max mean n50 sum scf 1820 84 162 4583 82337 3161523 125884.07 667487 229109005 # ids from 0..13113 ctg 26453 64 962 3756 10485 255820 8759.59 21800 231717378 ctg200+ 23343 200 1760 4772 11873 255820 9908.34 21845 231290344 scf.gaps 877 1 4 12 36 373 27.71 64 24298
#7-2-CGW/rez.i01.log : . elem min q1 q2 q3 max mean n50 sum scf 4903 66 1283 6908 39466 1148626 47212.57 209756 231483215 ctg 23480 64 1370 4397 11757 255820 9858.75 23476 231483360
7-2-CGW/rez.i06.log . elem min q1 q2 q3 max mean n50 sum scf 3818 64 268 6813 61948 1198480 60723.65 257457 231842914 ctg 24005 64 1165 4161 11439 255820 9659.47 23819 231875677
#7-2-CGW/rez.i11.log . elem min q1 q2 q3 max mean n50 sum scf 3733 64 268 6880 65584 1391034 62111.99 258823 231864045 ctg 24004 64 1165 4161 11439 255820 9659.98 23819 231878256
#7-2-CGW/stone.i02.log . elem min q1 q2 q3 max mean n50 sum scf 1854 83 153 4039 52506 2983045 125375.58 860583 232446325 ctg 23863 64 674 3833 11368 255820 9761.94 25892 232949194
. elem min q1 q2 q3 max mean n50 sum scf 1847 84 152 4440 62397 3337984 125470.25 835387 231743549 ctg 26088 64 429 3213 10218 297786 8941.40 25521 233263147
- TigStore
 
tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D unitigs | tail maID isPresent isDeleted ptID svID fileOffset covStat 60970540 1 0 0 3 58023924696 84
tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D contigs | tail -1 maID isPresent isDeleted ptID svID fileOffset 61001100 1 0 0 32 16307420
- Checkpoints
 
 # 7-0-CGW              
                                                                   #times   iterMax
 CHECKPOINT_AFTER_BUILDING_SCAFFOLDS     "ckp01-ABS"               
 CHECKPOINT_DURING_CLEANING_SCAFFOLDS    "ckp02-DCS"               5        10
 CHECKPOINT_AFTER_CLEANING_SCAFFOLDS     "ckp03-ACD"               1
 CHECKPOINT_DURING_1ST_SCAFF_MERGE       "ckp04-1SM-partial"       12              # we should see more iterations if the scaffolding goes well; also ECR would take longer
 CHECKPOINT_AFTER_1ST_SCAFF_MERGE        "ckp05-1SM"               1
 CHECKPOINT_AFTER_STONES                 "ckp06-AS"           
 CHECKPOINT_DURING_2ND_SCAFF_MERGE       "ckp07-2SM-partial"  
 CHECKPOINT_AFTER_2ND_SCAFF_MERGE        "ckp08-2SM"               1
 CHECKPOINT_AFTER_FINAL_ROCKS            "ckp09-FR"                1
 # 7-2-CGW              
                                                                   #times   iterMax
 CHECKPOINT_AFTER_BUILDING_SCAFFOLDS     "ckp01-ABS"               
 CHECKPOINT_DURING_CLEANING_SCAFFOLDS    "ckp02-DCS"               9        10
 CHECKPOINT_AFTER_CLEANING_SCAFFOLDS     "ckp03-ACD"               1 
  ...
 CHECKPOINT_DURING_1ST_SCAFF_MERGE       "ckp04-1SM-partial"       10
 CHECKPOINT_AFTER_1ST_SCAFF_MERGE        "ckp05-1SM"               1
 CHECKPOINT_AFTER_STONES                 "ckp06-AS"                1
 CHECKPOINT_DURING_2ND_SCAFF_MERGE       "ckp07-2SM-partial"       1
 CHECKPOINT_AFTER_2ND_SCAFF_MERGE        "ckp08-2SM"               1
 CHECKPOINT_AFTER_FINAL_ROCKS            "ckp09-FR"                1
  ...
 CHECKPOINT_AFTER_PARTIAL_STONES         "ckp10-PS"                1
 CHECKPOINT_AFTER_FINAL_CONTAINED_STONES "ckp11-FCS"  
 CHECKPOINT_AFTER_FINAL_CLEANUP          "ckp12-FC"   
 CHECKPOINT_AFTER_RESOLVE_SURROGATES     "ckp13-RS"   
 CHECKPOINT_AFTER_OUTPUT                 "ckp14-FIN"
- Long running time : implemented 2 cgw options -ic, -im ....
 - Running times
 
CGW-0: START Wed Apr 28 22:30:12 ; END Mon May 3 03:34:57 ECR-1: START Mon May 3 03:34:58 ; END Mon May 3 12:28:54 CGW-2: START Mon May 3 12:28:54 ; End Wed May 12 22:05:00
- Library insert histograms
 
# Example cat 7-0-CGW/stat/scaffold_final.distlib_1.cgm | p 'print int($F[0]/1000),"\n";' | count.pl | sort -n > 7-0-CGW/stat/scaffold_final.distlib_1.cgm.hist cat 7-0-CGW/stat/scaffold_final.distlib_2.cgm | p 'print int($F[0]/1000),"\n";' | count.pl | sort -n > 7-0-CGW/stat/scaffold_final.distlib_2.cgm.hist gnuplot set xrange [0:10] plot "scaffold_final.distlib_1.cgm.hist" w l, "scaffold_final.distlib_2.cgm.hist" w l,
- Failure:
- after 7-2-CGW/asm.ckp.42 (after running about 2hrs)
 
 
tail runCA.log
* Screwed up scaffold 3790: Chunk 60994012 has bad mean * Scaffold 3790 iteration 1 1 bad CIs out of 33 LSE:2664.55 improvement:1 * Reinserting chunk 60994012 in scaffold 3790 * improvement = 0.0721104 LSE = 2664.55
---------------------------------------- Failure message: scaffolder failed
ECR
- Count scaffold gaps : total/closed
 
cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^examining gap' # 24635 cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^closed gap' # 2803
Terminator
- Get id range
 
grep -m 1 ^7 asm.posmap.utglen asm.posmap.ctglen asm.posmap.deglen asm.posmap.scflen asm.posmap.utglen:7180000000000 124 asm.posmap.deglen:7180001635156 133 asm.posmap.ctglen:7180001635158 124 asm.posmap.scflen:7180002817601 161 => utg: 7180000000000..7180001635155 ctg/deg:7180001635156..7180002817600 scf:7180002817601..7180002819521
Stats
- Assembly stats
 
. elem min q1 q2 q3 max mean n50 sum scf 1921 65 148 3389 55043 3122263 149121.08 982222 286461602 ctg 91934 64 100 119 180 255828 2621.59 25123 241013378 deg 1090511 54 103 124 152 7068 139.41 135 152024633 utg 1635156 54 97 118 141 94720 267.81 1636 437917658
- Assembly stats(100bp+)
 
elem min max mean n50 sum contigs(>100) 69270 100 255828 3451 25383 239118680 scaffolds(span) 1870 100 3122263 153185 982222 286457118 scaffolds(len) 1870 100 2990211 128880 870182 241007381
Location
ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo/ # original assembly /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo # backup
CA s_1, 2 partial ,3..8 corr OBT (CBCB) **
Issues
- Use only s_2 lib reads which seem 8K apart : ~ 14% of the s_2 reads aligned by soap2 to SOAPdenovo Illinois assembly within 8+-2Kbp of one another
 
- Alignments (very similar results)
 
soap2-index Bimp.fasta # soap2-index -> 2bwt-builder bowtie-build Bimp.fasta Bimp.fasta
soap2 -D Bimp.fasta.index -a s_2.1.fastq -b s_2.2.fastq -l 32 -p 8 -v 2 -m 6000 -x 10000 -o s_2.mated.soap2 -2 s_2.single.soap2 bowtie Bimp.fasta -1 s_2.1.fastq -2 s_2.2.fastq -l 32 -p 8 -n 2 -I 6000 -X 10000 > s_2.mated.bowtie
bowtie Bimp.fasta s_2_0.fastq -l 32 -p 8 -n 2 > s_2_0.bowtie wc -l s_2.mated.* 288880 s_2.mated.soap2 288668 s_2.mated.bowtie intersect.pl s_2.mated.* 288368
runCA spec
doExtendClearRanges=0
Input
- Gatekeeper: same as prev assembly
 - Overlaps: filter out all overlaps for most of the s_2 reads
 
Bog
cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1 elem min q1 q2 q3 max mean n50 sum 61930057 47 85 150 150 113529 125.85 150 7793894671 # max went up from 93191->113529
Bog consensus
- Failures
 
grep FAILED 5-consensus/asm*err : 113 unitigs between 5.. 33547 frags 5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED. Could not align fragment 264046761.
- Fix:
 
grep FAILED 5-consensus/asm*err | p 'print "$F[2]\n";' | perl ./fixLayout.pl | scheduler.pl grep -c FAILED 5-consensus/*err | grep -v ':0$' | p '/(.+).err/; print "touch $1.success\n";' | sort -u | scheduler.pl
Cgw
- Preliminary stats
 
#7-0-CGW/stone.i02.log . elem min q1 q2 q3 max mean n50 sum scf 1847 84 152 4440 62397 3337984 125470.25 835387 231743549 ctg 26088 64 429 3213 10218 297786 8941.40 25521 233263147
cat 7-0-CGW/cgw.out ... Processed 61,930,044 unitigs with 301,738,113 fragments
Terminator
- Message counts
 
CTP 91502 SCF 1896 CLK 3288923
cat CLK.asm | grep ^ori: | count.pl ori:O 979139 ori:A 869982 ori:N 857732 ori:I 582070
Stats
- Assembly stats
 
. elem min q1 q2 q3 max mean n50 sum scf 1896 76 150 4044 67922 4021294** 151760 1017298** 287738041 ctg 92307 63 100 119 186 297795** 2612.99 24781** 241197400 deg 1086650 54 103 124 152 6822 139.41 135 151484589 utg 1574225 54 98 119 143 123354 274.33 2027 431859732
- Assembly stats(100bp+)
 
. elem min max mean n50 sum contigs(>100) 69944 100 297795 3421 25048 239324761 scaffolds(span) 1869 100 4021294 153951 1017298 287735536 scaffolds(len) 1869 100 3429736 128833 860270 240790206
- Library stats
 
. badLong badOuttie badSame bothChaff bothDegen bothSurrogate diffScaffold good oneChaff oneDegen oneSurrogate HWI-EAS385_0001:1 0.15 0.06 0.01 19 10.18 0 0.45 46.9 21.41 1.59 0.19 HWI-EAS385_0001:2 0.13 0.01 0 84.86 0.02 0 0.27 14.2 0 0.36 0.02 HWI-EAS385_0001:3 0.06 0.01 0 14.45 9.03 0.04 0.05 60.11 15.17 0.68 0.16 HWI-EAS385_0001:5 0.06 0.01 0 14.84 8.93 0.04 0.05 59.98 15.02 0.65 0.16 HWI-EAS385_0001:6 0.06 0.01 0 14.76 9.04 0.04 0.05 59.7 15.29 0.65 0.16 HWI-EAS385_0001:7 0.06 0.01 0 14.73 9.12 0.04 0.05 59.51 15.44 0.64 0.16 HWI-EAS385_0001:8 0.06 0.01 0 14.05 10.1 0.03 0.05 58.66 15.92 0.7 0.16
- Issues
 
- Gap sizes vary between -297,562 bp and 772,064 bp. They should be no longer than ~ 8Kbp (max insert size)
 
 cat 9-terminator/split/SCF.asm | grep -A 3 "^{CTP" | grep ^mea: | sed 's/mea://' | sort -n | head
 -297562
 -216729
 ...
 162239    # scf7180002754932.314143..476382  : 26 s_1,3,5,6,7 mate pairs link the two sides of the gap
 772064    # scf7180002754479.867224..1639288 : 2  s_1 mate pairs link the two sides of the gap
~/bin/posmap2scaff.pl 9-terminator/asm.posmap.ctgscf | grep -v "^>" | sort -nk4 | tail 7180002653372 EB 89 69161 7180002699293 EB 78 92310 7180002731234 BE 64936 162239 7180002726839 BE 46924 772064
grep ^mea: SCF.asm | sed 's/mea://' | getSummary.pl -i 0 -t scf.gap | pretty . elem min q1 q2 q3 max mean n50 sum scf.gap 91502 -297562 19 222 667 772064 -141 772064 -12910029 scf.gap- 13788 -297562 -2166 -89 -12 0 -4284 0 -59071991 scf.gap+ 77699 0 84 325 775 772064 594 1194 46161963
Location
mulberry:/scratch0/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/ # original (to be deleted) /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/ # backup
Other
SOAPdenovo on CA contigs
 cat asm.ctg.fasta | ~/bin/assignFastaIds.pl | awk '{print $1}' > asm.contig
 lenseq asm.contig > asm.contig.len
 echo Edge_num `wc -l asm.contig.len` `wc -l asm.contig.len` > asm.ContigIndex
 echo "index   length  reverseComplement" >> asm.ContigIndex
 lenseq asm.K31.contig | awk '{print $1,$2,1}' >> asm.ContigIndex
SOAPdenovo map -s SOAPdenovo.config -g asm > SOAPdenovo.log # ok SOAPdenovo scaff -g asm -F >> SOAPdenovo.log # fails: a nucmer of files are missing
Assembly comparison
CA.contigs vs SOAPdenovo.contigs
- Filter contigs >=100 bp
 - Alignment
 
nucmer --mum -c 100 CA.fasta SOAPdenovo.fasta show-coords -H /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | p ' $F[19]="." unless($F[19]); print $F[19],"\n";' | count.pl | pretty -o [CONTAINS] 80413 [CONTAINED] 4646 [IDENTITY] 1960 [BEGIN] 4152 [END] 4034 . 157 # partial alignments
 # get "broken" contigs
 show-coords -H        /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | grep -v '\[' | awk '{print $18,$12}' | sort -u | getSummary.pl -i 1 -t all
 show-coords -L 500 -H /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | grep -v '\[' | awk '{print $18,$12}' | sort -u | getSummary.pl -i 1 -t 500+
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 all                  118        155    4630   9817   29042  144499     22498.97   54311      2654878     
 500+                 16         1844   8205   16551  28426  66880      20514.88   28426      328238  
- Location
 
mulberry:/scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs
CA.contigs vs SOAPdenovo.scaffolds
- All ctgs & scf
 - Alignment
 
 #ref/qry all
    .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 SOAPdenovo.scf(span)    12473      100    113    144    498    5933710    20428      1086614    254800874    
 SOAPdenovo.scf(len)     12473      100    113    143    341    5522831    18759.84   1034925    233991535
 CA.ctg                  92307      63     100    119    186    297795     2612       24781      241197400
#best hits nucmer --mum -c 40 CA.fasta SOAPdenovo.fasta delta-filter -1 40 -i 95 -o 50 -1 SOAPdenovo-vs-CA.40.delta >! SOAPdenovo-vs-CA.40.filter-1.delta
Genbank submission
- Project registration : http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi
 
http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi?show=20FF4536-24A4-4D8C-94FA-0755E1326E33 Project ID: 61101
Submitting Organization: University of Illinois & University of Maryland Sequencing Center: Keck Center for Comparative and Functional Genomics, University of Illinois ??? BCUI ??? Biotechnology Center, Univ. Illinois (BCUI) Consortium Name: Organism Name: Bombus impatiens Strain/isolate/breed: Locus Tag Prefix: BIMP #3+ letters Source of DNA used for sequencing: haploid males Sequencing Method: wgs ??? Sequencing Technology: Illumina Estimated Genome Size: 250Mb # the haploid genome size Brief description of the importnace: #2 pollinator in the US Genome Coverage 127x Sequencing Technology Illumina IIx Comments to the staff: DNA whole genome sequencing single genome; no annotation assembly name BIMP_1.0 assembly method: Celera Assembler 6.1 plan to update: ? expect to release agp:
- Sequin:
 
Manuscript: Authors: Hugh Robertson, Matt Hudson, Kim Walden, Aarti Venkat, Tom Newman, Daniela Puiu, Tanja Magoc, David Kelley, Aleksey Zimin, Steven Salzberg, Gene Robinson
- Best assembly (CA)
 
/fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/ ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bombus_impatiens/Assembly/CA/
. elem min q1 q2 q3 max mean n50 sum ctg 92307 63 100 119 186 297795 2613 24781 241197400 ctg.contam 82 1331 10485 19924 31977 112638 25648 32851 2103095 ctg.clean 92225 63 100 119 185 297795 2593 24628 239094305 ctgNs 76 75 536 3124 6834 44990 5326 10261 404781
scf 1896 76 150 4044 67922 4021294 151761 1017298 287738041 scf.ctg.contam 24 14315 45008 71878 141921 1116872 141140 164885 3387362
deg 1086650 54 103 124 152 6822 139 135 151484589 deg.contam 1129 65 114 141 207 1035 198 211 223588 deg.clean 1085521 54 103 124 152 6822 139 135 151261001 deg.Ns 37 98 145 234 426 914 313 437 11574
- Contaminant list: ???
 
/fs/szattic-asmg4/tmagoc/bees_bacteria/IDs
1st submission
- Submit all contigs (should have deleted the singletons < 200bp)
 
- Directories & files
 
/fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/ /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/ /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/BIMP_1.0.*
server: ftp-trace.ncbi.nlm.nih.gov user: cbcb_trc directory: test/ files names: BIMP_1.1.sqn , BIMP_1.1.agp
number contigs: 92,231 number scaffolds: 1,895
- NCBI email (Dec 23rd)
 
there are 3 contigs that need to be trimmed:
Contig name Length Span(s) Apparent source 7180002728824 11695 72..148 Escherichia coli 7180002730077 25748 22703..25611 Escherichia coli 7180002740828 30102 134..1688 Escherichia coli
Since these are internal hits, consider whether the contig needs to split (if the E.coli was the 'joining' element).
- NOTE: When I blastn these 3 contigs to nr (distant blast) , all 3 contigs had multiple good hits to other bacteria , so probably they are all contaminated and should be removed completely, not just trimmed
 
2nd submission
- Jan 24th
 
number of scaffolds: 1191 number of contigs: 91524
- Directories & files
 
/fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/BIMP_1.1.*
NCBI release
- AEQM00000000 WGS
 - ftp GCA_000188095 to come by end of Feb 2011