Bumblebee: Difference between revisions

From Cbcb
Jump to navigation Jump to search
No edit summary
 
(336 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Data =
= Data =
* ~ 500Mbp genome (diploid); 250Mbp halpoid
 
* Complete mitochondrion genomes:
  NC_011923.1    15468  14.67  Bombus hypocrita sapporoensis mitochondrion, complete genome
  NC_010967.1    16434  13.22  Bombus ignitus mitochondrion, complete genome
  only 88% identity; no rearrangements, only snps, short indels
* NCBI TaxId
  http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=132113
== Traces ==
* 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
<pre style="background:yellow">
  Lane  Insert  ReadLen    #Mates          #Reads        Coverage  Orientation    Comments                        #RepeatsReads
  1      4K      124        34,944,099                    14X        outie          865,687(1.2%) reads qual==0      30,576,297 (44%)
  2      8K      124        32,540,640                    13X        outie                                            26,972,424 (41%)
  3      400    124        34,745,750                    .                                                          8,319,335 (12%)
  5      400    124        34,601,239                    .                                                          8,398,759 (12%)
  6      400    124        34,553,857                    .
  7      400    124        34,682,612                    .
  8      400    124        12,975,839                    .
  9      3K      124        29,615,036                    .        outie            new3kb ; received 05/18/2010
 
  1-8                      219,044,036    438,088,072  108X
  1-9                      248,659,072    497,318,144
  3-8    400    124        151,559,297    303,118,594  75X        innie
  1,9    3-4K    124        64,559,135      129,118,270  32X        outie
  2      8K      124        32,540,640      65,081,280    13X        outie
</pre>
  Two new 3kb libraries to come; ~ 70% are true pair ends
== Adaptors ==
  >circ
  CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
  >circ.revcomp
  TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG
  >5                             
  GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
  >3
  CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT[AGATCTCGGTGGTCGCCGTATCATTA][A+]
                0        1        2        3        4
                123456789012345678901234567890123456789012
  circ          CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
                    .............  ..  .............
  circ.revcomp  TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG
  CGGCATTCCTGCTGAACC revcmp GGTTCAGCAGGAATGCCG
             
                0        1        2        3        4
                123456789012345678901234567890123456789012
                            GGTTCAGCAGGAATGCCG
  5            GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
 
                CGGCATTCCTGCTGAACC
  3            CGGCATTCCTGCTGAACCGAGATCGGAAGAG
  #18mers (5')
  CGTAATAACTTCGTATAG
  TCGTATAACTTCGTATAA
  GATCGGAAGAGCGGTTCA
  CGGCATTCCTGCTGAACC
  #18mers (5',3')
  CGTAATAACTTCGTATAG
  TTATACGAAGTTATACGA
  TCGTATAACTTCGTATAA
  CTATACGAAGTTATTACG 
  GATCGGAAGAGCGGTTCA
  GCAGGAATGCCGAGACCG
  CGGCATTCCTGCTGAACC
  CGTGTAGGGAAAGAGTGT
  #31 mers (5') # jellyfish uses 31mers
  CGTAATAACTTCGTATAGCATACATTATACG
  TCGTATAACTTCGTATAATGTATGCTATACG
  GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
  CGGCATTCCTGCTGAACCGAGATCGGAAGAG
== Frequent kmers ==
                                                        s_1  s_2    s_3
    1  CGTAATAACTTCGTATAGCATA|CGTATAACTTCGTATAATGTAT    15.2  4.59  2.829    # not assembled  circAdapter
    2  AATGTATGCTATACGAAGTTAT|ATAACTTCGTATAGCATACATT    5.56  19.02  0.525    # not assembled  circAdapter
    3  CGTTATAACGTATTACGTTATA|TGTATGCTATACGAAGTTATTA    5.2  4.56  .       
    4  TACATTATACGAAGTTATACGA|TATATGGTATGACGTTATAACG    3.39  1.87  0.365   
    5  ATATAACGTAATACGTTATAAC|TATAACGTATTACGTTATATGG    2.86  2.71  1.595   
    6  ACGTTATAACGTCATCCCATAT|ATACCATATAACGTAATACGTT    2.11  2.62  .
    7  TACGTTATAACGTCATACCATA|TCCATTTTACACTGCAAATGTA    1.98  2.04  1.749 
    8  CTTAAACGTCGAAATATCTCCA|TTACGTTATATGGGATGACGTT    0.87  1.16  1.136 
    9  GAGCGCTTAAACGTCGAAATAT|ATGAAATCTAGCCGTCATTTCC    0.73  1.02  1.442 
    10  ACGTAATACGTTATAACGTCAT|ATTTCCTTCCCGTTGGAGATAT    0.67  1.1    1.477 
    11  CTGCTGAACCGCTCTTCCGATC|GATCGGAAGAGCGGTTCAGCAG    0.61  3.92  0.045    # not assembled  5Adapter
    12  ACGGTCACAGCTTGTCTGTAAG|TGTAAAATGGAGTTCAACGAGC    0.31  1.73  0.798 
    13  ACGGTGGTGCATTACATTTGCA|GGCGAATGGCGCCTGATGCGGT    0.28  0.87  0.949    # not assembled
    14  CAGAAAAGCATCTTACGGATGG|TACATACCTCGCTCTGCTAATC    0.04  2.47  .        # not assembled
* Location
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt : s_1-8 libs
  /fs/szattic-asmg4/Bees/Bombus_impatiens/new3kb/s_1_?_sequence.txt    : new3kb lib received 05/18/2010
  /fs/szattic-asmg4/Bees/Bombus_impatiens/3kb_Bimpatiens.txt            : 88 new3kbr test reads
== Tasks to figure out ==
# Erroneous reads/bases, which we need to correct or discard
# GC bias, so we can compute a-stats properly
# Redundancy in the long paired ends, which are lane 1 and lane 2.
 
* Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument. 
** Some reads will begin in the circularization adaptor and thus will have only one usable read
** Some reads have a few bases of DNA sequence and hit the circularization adaptor right away
** Most reads will have at least 36bp from each end before hitting the adaptor.
** Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor)
* A small but significant number of reads from the 3kb and 8kb libraries are not recircularized.
** Thus their mate distance is +400bp rather than -3kb or -8kb.
** It's apparently the result of a faulty batch of cre recombinase. This causes problems with contiging and scaffolding.
** It is possible to remove these reads by removing
# mate pairs where neither read is trimmed (thus no adapter ligation may have occurred)
# mate pairs where one read begins with the adapter sequence.
** We are working on this, up to now our paired assembly stages have been disappointing.
* Matt's idea is to exclude all mate pair reads that don't have evidence of the linker with flanking useful sequence, as a way to avoid uncircularized molecules that will give misleading "mate pairs" only 400 bp apart.
** There has been no trimming of the adaptor, which is the 42 base 454 adaptor, so its presence can be used to indicate potentially good mate pairs.
** Even tossing half the mate pairs might not be a problem, as we have perhaps too many anyway.
** But you will also need to toss redundant mate pairs, and that will indeed reduce the total a lot.
** Just to be clear - the 500 base mate pairs should have no such problems, except that as Matt has found from his preliminary assembly, the mean fragment length is actually 400 bp rather than 500 bp (and the 3 and 8 kb PE reads are typically shorter than nominally given, e.g. more like 2.5 and 6 kb).
** You'll also need to throw out all the reads where one of the mates /starts/ with linker, I assume you'd do that anyway. We're also working on ways round this; one might be when we get a better assembly we'll find some better characteristic to filter the unrecircularized reads on.
== Trimming ==
* [[Media:S_1_sequence.qualSummary.png|s_1 library quality plots]]
* Quality trimming: trim all bases with fastq quality eq "B" (0)
  cat *fastq | ~/bin/fastq2clq.pl
* Adaptor trimming:  Align all subsets  to adaptors
  C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
  3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
  5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
  id  len    gc%
  C  42    30.95 
  3  52    55.77 
  5  67    59.70 
  nucmer -l 8 -c 16 -b 32 -g 32 adaptors.seq  ...
  delta-filter -l 16 -q  ...
  cat *.filter-q.delta | ~/bin/delta2clr53.pl -5 5,3 ...
* Adaptor/primers median positions in the reads
  Lib      adaptorPos    5'primerPos 3'primerPos
  s_1      34-75          0-36        0-19
  s_2      2-40          0-36        0-19
* Mate stats
  Lib        #mates    adaptor.bothMates  adaptor.noMate  adaptor.oneMate  adaptor.oneMate(filtered) 
  s_1        34944099  269156              9804247          24870696          6048164(17.3%)             
  s_2        32540640  91528              16181288        16267824          1061858( 3.2%) 
  total      67484739                                                        7110022(10.5%)
* Filtered read clr stats
  Lib        #mates    #reads      min    q1    q2    q3    max        mean      n50        sum           
  s_1        6048164    12096328    64    80    95    115    124        96.56      101        1,168,064,632   
  s_2        1061858    2123716      64    80    96    115    124        96.59      101          205,122,226
  total      7110022    14220044    64    80    95    115    124        96.57      101        1,373,186,858
* Filtered read GC% stats
  Lib        #reads      min    q1    q2    q3    max        mean      n50                   
  s_1        12096322    0.00  31.93  36.26  42.35  100.00    37.45      38       
  s_2        2123715      0.00  31.88  36.46  42.48  100.00    37.56      38       
* Other frequent kmers
  26mer : ACGTTATAACGTATTACGTTATATGG -> revcomp -> CCATATAACGTAATACGTTATAACGT : ~10% of the traces
  10mer : AAAAAAAAAA                              TTTTTTTTTT                : ~32% of the traces   
  53mer: CGATTTCCATGGCGTCGTTTGAGGATTCCAATACGGCGAACCTGTTGTGAGTG                : ~2% of the mito seqs (either begin or end); not present in the 2 complete mito's (probably ok)
* Data processing pipeline
  s_[12]    : /nfshomes/dpuiu/bin/fastq2adaptorFreeFastq.amos
  s_[35678] : /nfshomes/dpuiu/bin/fastqfrg.amos


* Location:
* Location:
   /fs/szattic-asmg4/Bees/Bombus_impatiens
   /fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt                  # original fastq files
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/s_[1235678]_[12]_sequence.cor.txt    # corrected fastq files
 
  /fs/szattic-asmg5/Bees/Bombus_impatiens/s_[12]/all/s_[12].64.rev.frg                    # adaptor free s_[12] mated reads
  /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[1235678].frg                  # adaptor free & corrected s_[1235678] reads
 
== D Kelly's trimming ==
 
* Method: http://github.com/davek44/error_correction
* Read lens stats
  lib                      reads        min  q1  q2  q3  max  mean    n50  sum          #repeatReads
  s_1_1_sequence.cor.txt    5,436,814    30  71  90  115  124  89.70  100  487673256    1,071,779 (20%)
  s_1_2_sequence.cor.txt    5,864,225    30  77  92  110  124  93.27  98  546956864    1,208,767 (20%)
  s_2_1_sequence.cor.txt    1,053,858    30  80  96  118  124  97.16  102  102395785    222,290  (21%)
  s_2_2_sequence.cor.txt    1,041,846    30  78  93  110  124  93.51  99  97426679      208,029  (20%)
  s_3_1_sequence.cor.txt    33,668,302  30  105  124  124  124  111.04  124  3738530761              (12%)
  s_3_2_sequence.cor.txt    32,554,322  30  82  108  124  124  100.42  120  3269039697 
  s_5_1_sequence.cor.txt    33,579,744  30  109  124  124  124  111.65  124  3749192427 
  s_5_2_sequence.cor.txt    32,465,412  30  84  112  124  124  102.13  124  3315553171 
  s_6_1_sequence.cor.txt    33,535,725  30  109  124  124  124  111.60  124  3742602285 
  s_6_2_sequence.cor.txt    32,390,877  30  83  109  124  124  100.92  123  3268875471 
  s_7_1_sequence.cor.txt    33,674,235  30  112  124  124  124  112.19  124  3777917379 
  s_7_2_sequence.cor.txt    32,518,568  30  84  110  124  124  101.60  124  3303807321 
  s_8_1_sequence.cor.txt    11,777,821  30  113  124  124  124  112.30  124  1322658923 
  s_8_2_sequence.cor.txt    12,176,364  30  84  110  124  124  101.51  124  1236034348
  s_9_1_sequence.cor.txt    21,273,603  30  87  124  124  124  104.52  124  2223545139    #new3kb added 05/18/2010
  s_9_2_sequence.cor.txt    20,769,027  30  86  124  124  124  103.68  124  2153436097    #new3kb added 05/18/2010
 
  s_1,2                    13,396,743 
  s_1..8                    301,738,113  .  .    .    .    .    .      .    31958664367 
  s_2..8 (64bp+)            275,763,594  .  .    .    .    .    .      .    30693543885 
 
 
* Read/Mate counts
  lib                      reads        mates            range
  s_1                      11301039    5320163   
  s_2                      2095704      1034086          11301040..13396743     
  s_3                      66222624    31857304 
  s_5                      66045156    31820553       
  s_6                      65926602    31748713 
  s_7                      66192803    31891468 
  s_8                      23954185    11168258 
  s_9                      42042630    13245740
  s_1,2                    13,396,743  6,354,249
  s_1-8                    301,738,113  144,840,545
  s_1-9                    343,780,743  158,086,285


* There are 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
* Files   
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt        # adaptor free corrected reads (long inserts)
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt          # corrected reads (short inserts)
  /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[12].rev.frg
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/frg/s_[35678].frg


* Tasks to figure out:
= Assembly =
1. Erroneous reads/bases, which we need to correct or discard
2. GC bias, so we can compute a-stats properly
3. Redundancy in the long paired ends, which are lane 1 and lane 2.


* Lane 1: 3Kbp insert; 34,944,099 reads
* SOAP version 1.04: /nfshomes/dpuiu/szdevel/SOAPdenovo_Release1.04/
* Lane 3: 8Kbp insert; 32,540,640 reads
* CA version  6.0    /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/


* Formatting:
== SOAPdenovo s_1..8 CBCB-corr (Illinois) ==
   /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/Linux-amd64/bin/fastqToCA \
 
     -insertsize 3000 300 \
* From: Kim Walden 04/30/2010
     -libraryname 1 \
* Input:  s_1,2,3,5,6,7,8 reads
    -type illumina \
* Error correction: yes; CBCB method
     -fastq /fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_1_sequence.txt,/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_2_sequence.txt > 1.frg
* Kim's e-mail
  ''I ran it in a stepwise fashion, not adding any of the available options and only changing K=31.''
  ''I then ran their more recent GapCloser algorithm, not the version that is part of the package when using option -F in "scaff".  ''
  ''I used their "all" assembly once, and I noticed that it automatically uses the -F option during scaffolding''
  ''Also, I used your data from the ftp site without checking for redundancy in the 3kb and 8kb lanes; did Daniela do the same?''
 
* Summary
                          elements min_len  max_len  mean    n50        sum
  contigs(>100)          103858  100      85850    2227    7541      231387390      # prior to scaffolding;
  scaffolds&contigs(>100) 12663    100      4648588  19282  1040109    244171008      # post-scaffolding and gap-filling
  scaffolds(span)        3736    252      4648588  64874  1047164    242369898      # same prior data minus contigs that didn't assemble into scaffolds.
  scaffolds(len; no N's)  3736    172      4585806  64034  1027043    239233801      # computed by CBCB
     
* Locations
  ginkgo:/scratch1/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC.redo/Bimp_Salz_corr_050310_NewGapCloser.fasta
  # backup
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC.redo/
 
  # online
  http://www.life.illinois.edu/robertson/private/
  Login: robertson
  password: dna
 
== SOAPdenovo s_1..8 CBCB-corr (CBCB) ==
 
* Stats:
                          elem    min      max      mean    n50        sum
  contigs(all)            9943851  31      85850    60      45        594773555
  contigs(>100)          106213  100      85850    2177    7084      231246190
  scaffolds+contigs(>100) 12473    100      5933710  20428  1086614    254800874    # based on scafSeq file 
  scaffolds(span)        3664    236      5933710  69090  1115496    253148288    # based on scafSeq file
  scaffolds(len)          3664    134      5522831  63411  1059378    232338949
  readsAssembled          208389968                                                  # based on readOnContig file
  readsFiltered          289681090                                                  # based on readOnContig & log file
  readsTotal              301738113                                                  # based on config file
  readsDeleted            441343                                                      # based on log file
* Run gapClosing on assembly:
  stillOpen 14784
  closed    42379
 
* Contig cvg: GC bias:
[[Image:Bombus.contig10000.png]]
  cat asm.K31.contig100.infoseq | ~/bin/getCorrelationDescriptive.pl -i 2 -j 3  => -0.709501493345893
  cat asm.K31.contig1K.infoseq | ~/bin/getCorrelationDescriptive.pl -i 2 -j 3  => -0.923151349747983
 
  #100+ ctgs
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  len                  106213    100    140    437    2420  85850      2177      7084      231246190     
  gc                  106213    0.00  33.13  38.00  43.44  100.00    38.74      39        4115066
  cvg                  106213    0.6    39.0  50.8  58.9  160.8      47.38      54        5032329
 
  #1K+ ctgs
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  len                  41965      1000  1860  3295  6338  85850      5110      7726      214421699     
  gc                  41965      19.18  34.31  37.96  41.79  59.25      38.19      38        1602711
  cvg                  41965      8.0    42.0  49.9  55.8  93.8      47.74      51        2003339
 
* Run soap2 to map reads back to the assembly:  "-l 32 -v 2"
  readsAligned            173063062
 
* Locations
  mulberry:/scratch0/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor/      # original
   /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor/  # backup
 
== SOAPdenovo s_1..9 CBCB-corr (CBCB) ** ==
 
* Uses the new 3Kbp library : s_9 (reads have been reversed)
 
* Stats:
<pre style="background:yellow">
  .                    elem      min    q1    q2    q3    max        mean      n50        sum
  scf                  11178      100    111    135    390    5655980    23014      1205321    257251549
  ctg                  10856652  31    .      .      .      85850      57        43        627095607
  ctg100+              106741    100    .      .      .      85850**    2165      6939      231167576
 
  scfSeqContigs        67900      100    340    1788  4373  85819      3437      7586      233372689
  scfSeqContigsClosed  20651      100    128    488    7991  600179    11715      64083      241921308
</pre>
 
* Location
  mulberry:/scratch2/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/      # original
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/  # backup
 
== CA s_1..8 corr OBT (CBCB) ==
 
=== Issues ===
* Used outie s_1,2 lib ; needs to be redone
 
=== runCA spec ===
  doOverlapTrimming =        1
  ovlOverlapper =            ovl
  unitigger =                bog        # cmd buildUnitigs
  *BatchSize =              2000000
  *BlockSize =              2000000
  ovlMemory  =              8GB
  doExtendClearRanges =      0          # fast run
 
=== Gatekeeper ===
* ~ 58X cvg
 
  gatekeeper -dumpfragments asm.gkpStore | grep ^fragmentClear ...
  LOAD                  STATS       
 
  7                    libInput     
  7                    libLoaded   
  0                    libErrors   
 
  301738113            frgInput     
  301738113            frgLoaded   
  25974519              frgErrors   
 
  144840545            lkgInput     
  124575590            lkgLoaded   
  20264955              lkgErrors   
  0                    lkgWarnings 
 
  275763594            numRandom   
  301738113            numNormal   
 
 
  GLOBAL                STATS       
 
  301738113            FRG         
  7                    LIB         
 
  LibraryName          numActiveFRG  numDeletedFRG  numMatedFRG  readLength  clearLength    #egrep 'ATATAACGTAATACGTTATAAC|GTTATAACGTATTACGTTATAT'
  GLOBAL                274282535    27455578      247151634    30524746650  29445986650 
  LegacyUnmatedReads    0            0              0            0            0           
 
  s_1                  10409320      891719        9039810      996306280    994802849      #1134233(11%)
  s_2                  2054575      41129          1987728      197829588    197593453      #224648(11%)
  s_3                  59894531      6328093        53925476    2382786529  2127257586    #3037085(5%) 
  s_5                  60127274      5917882        54512448    2459696212  2209078913    #5%
  s_6                  59813485      6113117        54062058    2396359740  2152656620    #5%
  s_7                  60278536      5914267        54647258    2476326572  2237423989    #5%
  s_8                  21704814      2249371        18976856    2435572545  2347304056    #5%
 
  gatekeeper -dumplibraries -tabular asm.gkpStore
  UID    IID  Orientation  Mean  StdDev  NumFeatures 
  s_1    1    I            4000  400    1582823088 
  s_2    2    I            8000  800    1582823088 
  s_3    3    I            400  40      1582823088 
  s_5    4    I            400  40      1582823088 
  s_6    5    I            400  40      1582823088 
  s_7    6    I            400  40      1582823088 
  s_8    7    I            400  40      1582823088 
 
  s_1,2 lib : ids:1..13,396,744
  overlap jobs that need to be redone : 1036 out of 11476 < 10%
  ~/bin/getOvlJobsCount.pl 2000000 2000000 301738113 | p '/(\d+)-(\d+)\s+(\d+)-(\d+)/; print $_ if($1<14000000 or $3<14000000);'
 
=== Meryl ===
 
  meryl -Dh -s 0-mercounts/asm-C-ms22-cm0
  Found 23686053415 mers.
  Found 265670791 distinct mers.
  Found 8682767 unique mers.
  Largest mercount is 21,987,616; 668 mers are too big for histogram.
  ...
  #freq  count  %
  1      8682767 0.0327  0.0004
  2      4288602 0.0488  0.0007
  3      3041005 0.0603  0.0011
  ...
  72*     3803015 0.7150  0.3229
  ...
  1048158 1      1.0000  0.9274
 
  fasta2tab.pl  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/0-mercounts/asm.nmers.ovl.fasta  | sort -n -r | head -5
  21987616        ATATAACGTAATACGTTATAAC
  20992872        CGTTATAACGTATTACGTTATA
  20510821        ACGTTATAACGTATTACGTTAT
  20484950        CCATATAACGTAATACGTTATA
  20452950        CATATAACGTAATACGTTATAA
 
  cat egrep.sh
  egrep -c 'ATATAACGTAATACGTTATAAC|GTTATAACGTATTACGTTATAT' /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/seq_obt/*clean.seq > egrep.count
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2
  ...
 
===  OBT ===
 
* job count
  n=((301738113/2000000)+0.5)=151
  OBT/OVL jobs = n*n/2 =151*76=11476
  ~ 20 min/OBT job =~ 0.33 hr/OBT job  => 0.33*11476/32=5 days
 
* OBT trimming
  key            sum            count          mean
  CLR            31958664367    301738113      105.915239043733
  OBTINITIAL      30779687358    301610679      102.051052900551
  OBTMERGE        30052401719    301610679      99.6397137483318
  OBTCHIMERA      30052138325    301610679      99.638840456972
  LATEST          30052138325    301610679      99.638840456972
 
* use OBTCHIMERA as CLR if want to redo the assembly !!!
 
=== Overlap ===
 
* job count : same as OBT
 
* Overlap counts
  cat asm.repeatmodel.lib.00*stats | egrep '^Lib|nSamples|median' | p 'chomp; if(/Lib/) {print "\n$_"} else { print "  $F[1]"}' | pretty -o
  Lib    0  5'  199542773  50 
  Lib    0  3'  199542773  50 
  Lib    1  5'  6682072    38 
  Lib    1  3'  6682072    37 
  Lib    2  5'  1306720    37  ???
  Lib    2  3'  1306720    42  ???
  Lib    3  5'  43868663  50 
  Lib    3  3'  43868663  50 
  Lib    4  5'  43939366  51 
  Lib    4  3'  43939366  50 
  Lib    5  5'  43620676  50 
  Lib    5  3'  43620676  50 
  Lib    6  5'  43903838  51 
  Lib    6  3'  43903838  50 
  Lib    7  5'  16221438  50 
  Lib    7  3'  16221438  50
 
* 199542773 out of 257786946 (77%) usable reads had overlaps     
* 7988792  out of 11443537  (69%) usable s_1,2 reads had overlaps
 
=== Frg/Ovl correction ===
 
* Frgcorr jobs: 151
  wc -l 3-overlapcorrection/cat-corrects.frgcorrlist
 
* Ovlcorr jobs: 151
  wc -l 3-overlapcorrection/cat-corrects.frgcorrlist
 
=== OverlapStore ===
 
* Failures
** overlapStore: too many files (max 10240) ; fix: recompile the code
** chimera: FAILED file seems to be truncated but contains records from all libs; fix: link asm.chimera.report.FAILED asm.chimera.report
 
=== Bog ===
 
* Run time: 14 hr
* default unitigger : does not seem to work on such big datasets
* Utg len stats:
  cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
  elem      min    q1     q2    q3    max        mean      n50        sum           
  63916892  47    85    150    150    22861      125.23    150        8004296170   
 
=== Bog consensus ===
 
* job count :
  wc -l 4-unitigger/asm.partitioningInfo
  110
 
* Failures
  grep FAILED 5-consensus/asm*err
  5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED.  Could not align fragment 264046761.
 
  # Fix:
  # update layout (CA 6.1)
  mkdir utg
 
  # update layout (CA 6.1)
  grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "tigStore -g asm.gkpStore -t asm.tigStore 2 -u  $F[0] -d layout > utg/$F[0].layout\n";'  | ~/bin/scheduler.pl
  grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "~/bin/updateUTGlayout.pl utg/$F[0].layout >  utg/$F[0].layout.update\n ";'              | ~/bin/scheduler.pl
  grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg/$F[0].layout.update\n" ;'            | ~/bin/scheduler.pl  -c 1
  grep FAILED 5-consensus/asm*err | sed 's/\.err:/ /' | awk '{print $1}' | sort -u | p 'print "touch $F[0].success\n";'                                              | ~/bin/scheduler.pl
  # create success file
  touch 5-consensus/consensus.success
 
  # relaunch runCA
 
=== Cgw ===
 
* Failures
  grep -B 5 "^Failure message:" runCA.log
 
=== Cgw consensus ===
 
* job count : 84
  wc -l 7-CGW/asm.partitionInfo
 
* ctg count: 1698486
  cat 7-CGW/asm.partitionInfo | /nfshomes/dpuiu/bin/sum.pl -i 3 -t ctgs | pretty -o
 
=== Terminator ===
 
* Failure:
** buildPosMap : segFault :ids too long???
  Fix: split asm file(optional) and reassign new accession numbers
  cd split/
  ~/bin/splitAsm.pl < asm.asm => ... CCO.asm, SCF.asm
  cat CCO.asm CLK.asm SCF.asm SLK.asm | ~/bin/newAsmIds.pl | buildPosMap -o asm
 
=== Stats ===
 
* [[Media:Bb.CA.s_1-8.cor.qc|Bb.qc]]
* Stats
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  15009      74    4786  9789  20021  415318    16916.33  29133      253897199     
  ctg                  145036    64    93    116    1035  138630    1498.67    7726      217361459     
  deg                  1507280    54    98    118    146    1434      148.28    137        223502287     
  utg                  2786974    54    90    110    142    25644      197.04    277        549140918     
  singletons          61129865
  usableSeqs          257786946
 
* Stats (100bp+)
                        elem    min      max      mean    n50        sum
contigs(>100)          99429    100      138630    2148    7972      213595680
scaffolds(span)        14992    100      415318    16935  29133      253895676
scaffolds(len)        14991    100      303946    14499  24091      217357577
 
* Library stats (totals)
  lib    ctg      sur      deg      singl    total
  s_1    2251597  162800  3119665  3988095  9522157
  s_2    424821    41107    563656    891795    1921379
  s_3    28880189  1078891  13600909  12735637  56295626
  s_5    28983834  1077334  13561390  12962731  56585289
  s_6    28663974  1062939  13559494  12896855  56183262
  s_7    28784505  1058364  13713200  13020274  56576343
  s_8    10455533  383065  5229814  4634478  20702874
  all    128444453 4864500  63348128  61129865  257786930
 
* Library stats (percentages)
  lib    ctg  sur  deg  singl  total
  s_1    24  2    33  42    100
  s_2    22  2    29  46    100
  s_3    51  2    24  23    100
  s_5    51  2    24  23    100
  s_6    51  2    24  23    100
  s_7    51  2    24  23    100
  s_8    51  2    25  22    100
 
=== Mate redundancy ===
 
* Criteria:
  Given 2 mates 
    1: a1,a2
    2: b1,b1
  We have the following 2 overlaps
    1: a1,b1  a2,b2   
     2: a1,b2  a2,b1 
  At least 75% of one read overlaps the other read
 
* Location:
  cd /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor/9-terminator/
 
  wc -l s_1.mates asm.posmap.frgutg.s_1.*
  5320163 s_1.mates                                    # input mates (from FRG file; 2 reads/mate)
  1232734 asm.posmap.frgutg.s_1.mates                  # assembled mates (both reads assembled)
  366262 asm.posmap.frgutg.s_1.redundant_mates        # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 29.7% redundancy
  312206 asm.posmap.frgutg.s_1.redundant_mate_pairs    # redundant assembled mate pairs
 
  wc -l s_2.mates asm.posmap.frgutg.s_2.*
  1034086 s_2.mates                                    # input mates (from FRG file; 2 reads/mate)
  255905 asm.posmap.frgutg.s_2.mates                  # assembled mates (both reads assembled)
  110570 asm.posmap.frgutg.s_2.redundant_mates        # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 43.2% redundancy
  1871747 asm.posmap.frgutg.s_2.redundant_mate_pairs    # redundant assembled mate pairs
    11937 asm.posmap.frgutg.s_2.redundant_mate_clusters # redundant assembled mate clusters
 
=== Runtimes ===
  #    step            hours
  0    gatekeeper      5.36 
  1    initialTrim    2.07 
  2    meryl          9.90 
  9    overlap.sh      11.31 
  11    find            0.00 
  16    overlapStore    13.79 
  17    consolidate    5.12 
  18    merge-trimming  1.41 
  19    chimera(1)      10.04 
  20    overlap.sh      49.86 
  21    find            0.00 
  22    overlapStore    16.68 
  23    frgcorr.sh      5.57 
  24    cat-corrects    0.10 
  25    ovlcorr.sh      4.64 
  26    cat-erates      0.90 
  27    overlapStore    6.67 
  28    buildUnitigs    13.88 
  29    gatekeeper      1.31 
  30    consensus.sh    3.41 
  31    cgw(1)          65.00 
  43    gatekeeper      4.39 
  44    consensus.sh    4.70 
  46    terminator      6.26 
  47    asmOutputFasta  7.75 
  48    dumpSingletons  3.68 
  49    buildPosMap(1)  3.79
 
=== Location ===
  ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/                            # original assembly
  ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/asm.gkpStore.OBTCHIMERACLR  # dumped OBT clear ranges, 0/1 undel/del
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor                        # backup
 
== CA s_1..8 corr OBT redo (CBCB) ==
 
=== Issues ===
* Fixed outie s_1,2 lib : clear ranges will be different
* s_3..8 : stay the same
 
=== runCA spec ===
  doOverlapTrimming =        0        !!! already done in the prev assembly
  saveOverlaps  =            1        !!! it was actually set to 0 for this run; otherwise the ovb.gz files get deleted
  ovlOverlapper =            ovl      # same
  unitigger =                bog      # same 
 
  BatchSize =                2000000    # must be the same so overlaps are not redone
  BlockSize =                2000000    # must be the same
  ...
  doExtendClearRanges =      1          # initially set to 2; seems to take too long => reset to 1
 
=== Input ===
 
* Modified frg files:
** dumped "CA s_1..8 corr OBT" assembly gatekeeper
** undelete act:D fragments so the ids stay the save
** reverse s_1,2 reads
 
=== Gatekeeper ===
 
* Make sure the iids match
 
=== Overlap ===
 
* Rerun only the overlap jobs for the s_1,2 reads
 
=== OverlapStore ===
 
*  To build faster in the future
e 'foreach ("0000000001".."0000000058") { print "overlapStore -c $_.ovlStore -g asm.gkpStore/ -i 0 -M 8192 1-overlapper/$_/*gz\n" }' | scheduler.pl  # 13min/job 
e 'foreach ("0000000002".."0000000058") { print "overlapStore -m 0000000001.ovlStore $_.ovlStore\n" }'                                              # 3 min/job
total=2*13+57*3=3.5 hrs (instead of 13.85 reported in runCA.log)
 
=== Frgcorr ===
 
* Initial failure: on 144 out of 151 jobs : the gatekeeper id's originally did not match (5 reads missing from s_5 lib shifted all the other reads)
 
=== Ovlcorr ===
 
* The error rate on overlaps with error went down by ~ 1.28%
 
=== Bog  ===
 
  cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
  elem      min    q1    q2    q3    max        mean      n50        sum           
  60970504  47    85    150    150    93191      126.17    150        7692801690
 
* Unitig stats
  tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep -c ^unitig_num_frags                                    # 60,970,523  : total unitigs
  tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep -c ' 1$'                      # 59,776,572  : unitigs with only 1 read
  tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep -v ' 1$' | getSummary.pl -i 2  # stats for unitig with 2+ reads
                elem        min    q1    q2    q3    max        mean      n50        sum           
  1read/utg    59,776,572                                                                59,776,572
  2+reads/utg  1,193,951    2      3      6      28    74715      165.84    4712      198,010,374
 
=== Bog consensus ===
 
* job count :
  wc -l 4-unitigger/asm.partitioningInfo
  110
* Failures: 1 failed unitig
  grep FAILED 5-consensus/asm*err
  5-consensus/asm_022.err:MultiAlignUnitig()-- Unitig 115198 FAILED. 
  Fix: same as before
 
=== Cgw ===
 
* Failure:
** after 7-0-CGW/asm.ckp.4
  tail runCA.log
  ERROR:  Negative variance for chunk 59905266* in scaffold 6409 (AEnd=-163479.174700 BEnd=-163491.446700)
  37980:Scaffold 6409:
  37981-  Idx    Chunk    (  Start, Variance)  (    End, Variance)      Len
  37982-    0: 60983787    (      0,        0)  (  31646,      823)    31646
  37983-    1:  69691    (  35389,      934)  (  31650,      837)    3739
  37984-    2: 59907028    (  35103,      942)  (  42035,    1122)    6932
  37985-    3: 60799298    (  41640,    1132)  (  64897,    1737)    23257
  37986-    4: 60975976    (  64924,    1784)  (  88439,    2395)    23515
  37987-    5: 60988396    (  88431,    2416)  (  122023,    3289)    33592
  37988-    6: 60988698    (  121551,    3324)  (  123589,    3377)    2038
  37989-    7: 59905266*  (  122331,  -163479)  (  121859,  -163491)      472
  37990-    8: 60988599    (  122275,  170138)  (  134564,  170457)    12289
 
  ./AS_REZ/GapFillREZ.C:5666: int Estimate_Chunk_Ends(VarArrayStack_Entry_t*, int, int, LengthT*, LengthT*, ChunkInstanceT*, float*, Scaffold_Fill_t*, int*, int*, int*): Assertion `min_right_variance >= 0.0' failed.
  ./AS_REZ/GapFillREZ.c:8339:      "ERROR:  Negative variance for chunk %d in scaffold %d (AEnd=%f BEnd=%f)\n",  #message
 
  Fix: modfy the code; comment assertion statemants
  cat /nfshomes/dpuiu/szdevel/SourceForge/wgs/src/AS_REZ/GapFillREZ.c
  // assert (0.0 <= ref_variance && ref_variance <= FLT_MAX);
  // assert (min_right_variance >= 0.0);
 
* Preliminary stats
  cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $sum,"\n"; $sum=0; } else { $sum+=$F[-1]}' | getSummary.pl -t scf
  cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -t ctg
  cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -min 200 -t ctg200+
  cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $count-1,"\n"; $count=0; } else { $count++}' | getSummary.pl -i 0 -min 1 -t scf.gaps
 
  #7-0-CGW/rez.i36.log :
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  1820      84    162    4583  82337  3161523    125884.07  667487    229109005      # ids from 0..13113
  ctg                  26453      64    962    3756  10485  255820    8759.59    21800      231717378     
  ctg200+              23343      200    1760  4772  11873  255820    9908.34    21845      231290344
  scf.gaps            877        1      4      12    36    373        27.71      64        24298
 
  #7-2-CGW/rez.i01.log :
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  4903      66    1283  6908  39466  1148626    47212.57  209756    231483215     
  ctg                  23480      64    1370  4397  11757  255820    9858.75    23476      231483360
 
  7-2-CGW/rez.i06.log
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  3818      64    268    6813  61948  1198480    60723.65  257457    231842914     
  ctg                  24005      64    1165  4161  11439  255820    9659.47    23819      231875677
 
  #7-2-CGW/rez.i11.log
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  3733      64    268    6880  65584  1391034    62111.99  258823    231864045     
  ctg                  24004      64    1165  4161  11439  255820    9659.98    23819      231878256   
 
  #7-2-CGW/stone.i02.log
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  1854      83    153    4039  52506  2983045    125375.58  860583    232446325     
  ctg                  23863      64    674    3833  11368  255820    9761.94    25892      232949194     
 
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  1847      84    152    4440  62397  3337984    125470.25  835387    231743549     
  ctg                  26088      64    429    3213  10218  297786    8941.40    25521      233263147     
 
* TigStore
  tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D unitigs | tail
  maID            isPresent      isDeleted      ptID    svID    fileOffset      covStat
  60970540        1               0              0      3      58023924696    84
 
  tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D contigs | tail -1
  maID            isPresent      isDeleted      ptID    svID    fileOffset
  61001100        1              0              0      32      16307420
 
* Checkpoints
  # 7-0-CGW             
                                                                    #times  iterMax
  CHECKPOINT_AFTER_BUILDING_SCAFFOLDS    "ckp01-ABS"             
  CHECKPOINT_DURING_CLEANING_SCAFFOLDS    "ckp02-DCS"              5        10
  CHECKPOINT_AFTER_CLEANING_SCAFFOLDS    "ckp03-ACD"              1
  CHECKPOINT_DURING_1ST_SCAFF_MERGE      "ckp04-1SM-partial"      12              # we should see more iterations if the scaffolding goes well; also ECR would take longer
  CHECKPOINT_AFTER_1ST_SCAFF_MERGE        "ckp05-1SM"              1
  CHECKPOINT_AFTER_STONES                "ckp06-AS"         
  CHECKPOINT_DURING_2ND_SCAFF_MERGE      "ckp07-2SM-partial" 
  CHECKPOINT_AFTER_2ND_SCAFF_MERGE        "ckp08-2SM"              1
  CHECKPOINT_AFTER_FINAL_ROCKS            "ckp09-FR"                1
  # 7-2-CGW             
                                                                    #times  iterMax
  CHECKPOINT_AFTER_BUILDING_SCAFFOLDS    "ckp01-ABS"             
  CHECKPOINT_DURING_CLEANING_SCAFFOLDS    "ckp02-DCS"              9        10
  CHECKPOINT_AFTER_CLEANING_SCAFFOLDS    "ckp03-ACD"              1
  ...
  CHECKPOINT_DURING_1ST_SCAFF_MERGE      "ckp04-1SM-partial"      10
  CHECKPOINT_AFTER_1ST_SCAFF_MERGE        "ckp05-1SM"              1
  CHECKPOINT_AFTER_STONES                "ckp06-AS"                1
  CHECKPOINT_DURING_2ND_SCAFF_MERGE      "ckp07-2SM-partial"      1
  CHECKPOINT_AFTER_2ND_SCAFF_MERGE        "ckp08-2SM"              1
  CHECKPOINT_AFTER_FINAL_ROCKS            "ckp09-FR"                1
  ...
  CHECKPOINT_AFTER_PARTIAL_STONES        "ckp10-PS"                1
  CHECKPOINT_AFTER_FINAL_CONTAINED_STONES "ckp11-FCS" 
  CHECKPOINT_AFTER_FINAL_CLEANUP          "ckp12-FC" 
  CHECKPOINT_AFTER_RESOLVE_SURROGATES    "ckp13-RS" 
  CHECKPOINT_AFTER_OUTPUT                "ckp14-FIN"
 
* Long running time : implemented 2 cgw options -ic, -im ....
* Running times
 
  CGW-0: START Wed Apr 28 22:30:12 ; END Mon May  3 03:34:57
  ECR-1: START Mon May  3 03:34:58 ; END Mon May  3 12:28:54       
  CGW-2: START Mon May  3 12:28:54 ; End Wed May 12 22:05:00
 
* Library insert histograms
  # Example
  cat 7-0-CGW/stat/scaffold_final.distlib_1.cgm | p 'print int($F[0]/1000),"\n";' | count.pl | sort -n > 7-0-CGW/stat/scaffold_final.distlib_1.cgm.hist
  cat 7-0-CGW/stat/scaffold_final.distlib_2.cgm | p 'print int($F[0]/1000),"\n";' | count.pl | sort -n > 7-0-CGW/stat/scaffold_final.distlib_2.cgm.hist
  gnuplot
    set xrange [0:10]
    plot "scaffold_final.distlib_1.cgm.hist" w l, "scaffold_final.distlib_2.cgm.hist" w l,
 
 
* Failure:
** after 7-2-CGW/asm.ckp.42 (after running about 2hrs)
  tail runCA.log
 
  * Screwed up scaffold 3790: Chunk 60994012 has bad mean
  * Scaffold 3790 iteration 1  1 bad CIs out of 33 LSE:2664.55 improvement:1
  * Reinserting chunk 60994012 in scaffold 3790
  * improvement = 0.0721104 LSE = 2664.55
 
  ----------------------------------------
  Failure message:
  scaffolder failed
 
=== ECR ===
 
* Count scaffold gaps : total/closed
  cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^examining gap'  # 24635
  cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^closed gap'      # 2803
 
=== Terminator ===
 
* Get id range
  grep -m 1 ^7 asm.posmap.utglen asm.posmap.ctglen asm.posmap.deglen asm.posmap.scflen
  asm.posmap.utglen:7180000000000 124   
  asm.posmap.deglen:7180001635156 133   
  asm.posmap.ctglen:7180001635158 124
  asm.posmap.scflen:7180002817601 161   
   =>
   =>
   {LIB                                                                                                                                     
   utg: 7180000000000..7180001635155
   act:A                                                                                                                                   
   ctg/deg:7180001635156..7180002817600
   acc:1                                                                                                                                   
   scf:7180002817601..7180002819521
   ori:I                                                                                                                                   
 
   mea:3000.000                                                                                                                             
=== Stats ===
   std:300.000                                                                                                                             
 
   src:                                                                                                                                    
* Assembly stats
   .                                                                                                                                        
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
   nft:10                                                                                                                                   
  scf                  1921      65    148    3389  55043  3122263    149121.08  982222    286461602     
   fea:                                                                                                                                     
  ctg                  91934      64    100    119    180    255828    2621.59    25123      241013378     
   forceBOGunitigger=1                                                                                                                      
   deg                  1090511    54    103    124    152    7068      139.41    135        152024633     
   isNotRandom=0                                                                                                                           
   utg                  1635156    54    97    118    141    94720      267.81    1636      437917658
   doNotTrustHomopolymerRuns=0                                                                                                             
 
   doRemoveDuplicateReads=0                                                                                                                
* Assembly stats(100bp+)
  doNotQVTrim=0                                                                                                                           
                          elem    min      max      mean    n50        sum
  goodBadQVThreshold=0                                                                                                                     
   contigs(>100)          69270    100      255828    3451    25383      239118680
   doNotOverlapTrim=0                                                                                                                      
  scaffolds(span)        1870    100      3122263  153185  982222    286457118
   usePackedFragments=1                                                                                                                     
  scaffolds(len)          1870    100      2990211  128880  870182    241007381
   illuminaFastQType=illumina                                                                                                               
 
   illuminaSequence=/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_1_sequence.txt,/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_2_sequence.txt    
=== Location ===
   .                                                                                                                                        
 
   }                                                                                                                                       
  ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo/          # original assembly
   /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo      # backup
 
== CA s_1, 2 partial ,3..8 corr OBT (CBCB) ** ==
 
=== Issues ===
* Use only s_2 lib reads which seem 8K apart : ~ 14% of the s_2 reads aligned by soap2 to SOAPdenovo Illinois assembly within 8+-2Kbp of one another
* Alignments (very similar results)
 
   soap2-index  Bimp.fasta                #  soap2-index -> 2bwt-builder
   bowtie-build Bimp.fasta  Bimp.fasta
 
   soap2 -D Bimp.fasta.index -a s_2.1.fastq -b s_2.2.fastq -l 32 -p 8 -v 2 -m 6000 -x 10000 -o s_2.mated.soap2 -2 s_2.single.soap2
   bowtie  Bimp.fasta      -1 s_2.1.fastq -2 s_2.2.fastq -l 32 -p 8 -n 2 -I 6000 -X 10000 >  s_2.mated.bowtie
 
  bowtie  Bimp.fasta        s_2_0.fastq -l 32 -p 8 -n 2                                  >  s_2_0.bowtie
  wc -l s_2.mated.*
  288880 s_2.mated.soap2
  288668 s_2.mated.bowtie
   intersect.pl s_2.mated.*
   288368
 
=== runCA spec ===
   doExtendClearRanges=0  
 
=== Input ===
 
* Gatekeeper: same as prev assembly
* Overlaps: filter out all overlaps for most of the s_2 reads
 
=== Bog  ===
   cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
   elem      min    q1    q2    q3    max        mean      n50        sum           
   61930057  47    85    150    150    113529    125.85    150        7793894671      # max went up from 93191->113529
 
=== Bog consensus ===
 
* Failures
   grep FAILED 5-consensus/asm*err : 113 unitigs between 5.. 33547 frags
  5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED.  Could not align fragment 264046761.
 
* Fix:
  grep    FAILED 5-consensus/asm*err | p 'print "$F[2]\n";' | perl ./fixLayout.pl | scheduler.pl
  grep -c FAILED 5-consensus/*err | grep -v ':0$' | p '/(.+).err/; print "touch $1.success\n";' | sort -u | scheduler.pl
 
=== Cgw ===
 
* Preliminary stats
  #7-0-CGW/stone.i02.log
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  1847      84    152    4440  62397  3337984    125470.25  835387    231743549     
  ctg                  26088      64    429    3213  10218  297786    8941.40    25521      233263147
 
  cat 7-0-CGW/cgw.out
  ...
  Processed 61,930,044 unitigs with 301,738,113 fragments
 
=== Terminator ===
 
* Message counts
  CTP    91502
  SCF    1896
  CLK    3288923
 
cat CLK.asm | grep ^ori: | count.pl
ori:O  979139
ori:A  869982
ori:N  857732
ori:I  582070
 
=== Stats ===
 
* Assembly stats
<pre style="background:yellow">
  .                    elem      min    q1    q2    q3    max        mean      n50        sum
  scf                  1896      76    150    4044  67922  4021294**  151760    1017298**  287738041
  ctg                  92307      63    100    119    186    297795**  2612.99    24781**   241197400
   deg                  1086650    54    103    124    152    6822      139.41    135        151484589
   utg                  1574225    54    98    119    143    123354    274.33    2027      431859732
</pre>


= Assembly =
* Assembly stats(100bp+)
  .                      elem    min      max      mean    n50        sum           
  contigs(>100)          69944    100      297795    3421    25048      239324761     
  scaffolds(span)        1869    100      4021294  153951  1017298    287735536     
  scaffolds(len)          1869    100      3429736  128833  860270    240790206
 
* Library stats
  .                    badLong  badOuttie  badSame  bothChaff  bothDegen  bothSurrogate  diffScaffold  good  oneChaff  oneDegen  oneSurrogate 
  HWI-EAS385_0001:1    0.15    0.06      0.01    19        10.18      0              0.45          46.9  21.41    1.59      0.19         
  HWI-EAS385_0001:2    0.13    0.01      0        84.86      0.02      0              0.27          14.2  0        0.36      0.02         
  HWI-EAS385_0001:3    0.06    0.01      0        14.45      9.03      0.04          0.05          60.11  15.17    0.68      0.16         
  HWI-EAS385_0001:5    0.06    0.01      0        14.84      8.93      0.04          0.05          59.98  15.02    0.65      0.16         
  HWI-EAS385_0001:6    0.06    0.01      0        14.76      9.04      0.04          0.05          59.7  15.29    0.65      0.16         
  HWI-EAS385_0001:7    0.06    0.01      0        14.73      9.12      0.04          0.05          59.51  15.44    0.64      0.16         
  HWI-EAS385_0001:8    0.06    0.01      0        14.05      10.1      0.03          0.05          58.66  15.92    0.7      0.16         
 
* Issues
 
** Gap sizes vary between  -297,562 bp and 772,064  bp.  They should be no longer than ~ 8Kbp (max insert size)
  cat 9-terminator/split/SCF.asm | grep -A 3 "^{CTP" | grep ^mea: | sed 's/mea://' | sort -n | head
  -297562
  -216729
  ...
  162239    # scf7180002754932.314143..476382  : 26 s_1,3,5,6,7 mate pairs link the two sides of the gap
  772064    # scf7180002754479.867224..1639288 : 2  s_1 mate pairs link the two sides of the gap
 
  ~/bin/posmap2scaff.pl 9-terminator/asm.posmap.ctgscf | grep -v "^>" | sort -nk4 | tail
  7180002653372 EB 89    69161
  7180002699293 EB 78    92310
  7180002731234 BE 64936 162239
  7180002726839 BE 46924 772064
 
  grep ^mea: SCF.asm | sed 's/mea://' | getSummary.pl -i 0 -t scf.gap | pretty
  .          elem  min          q1      q2      q3      max        mean    n50    sum           
  scf.gap    91502  -297562      19      222      667      772064      -141    772064  -12910029
  scf.gap-  13788  -297562      -2166  -89      -12      0          -4284    0      -59071991
  scf.gap+  77699  0            84      325      775      772064      594      1194    46161963
 
=== Location ===
  mulberry:/scratch0/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/      # original (to be deleted)
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/  # backup
 
== Other ==
 
=== SOAPdenovo on CA contigs ===
 
  cat asm.ctg.fasta | ~/bin/assignFastaIds.pl | awk '{print $1}' > asm.contig
  lenseq asm.contig > asm.contig.len
 
  echo Edge_num `wc -l asm.contig.len` `wc -l asm.contig.len` > asm.ContigIndex
  echo "index  length  reverseComplement" >> asm.ContigIndex
  lenseq asm.K31.contig | awk '{print $1,$2,1}' >> asm.ContigIndex
 
  SOAPdenovo map  -s SOAPdenovo.config -g asm > SOAPdenovo.log        # ok
  SOAPdenovo scaff -g asm -F >> SOAPdenovo.log                        # fails: a nucmer of files are missing
 
= Assembly comparison =
 
== CA.contigs vs SOAPdenovo.contigs ==
 
* Filter contigs >=100 bp
* Alignment
  nucmer --mum -c 100  CA.fasta SOAPdenovo.fasta
  show-coords -H /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | p ' $F[19]="." unless($F[19]); print $F[19],"\n";' | count.pl | pretty -o
  [CONTAINS]    80413 
  [CONTAINED]    4646 
  [IDENTITY]    1960 
  [BEGIN]        4152   
  [END]          4034 
  .              157    # partial alignments
 
  # get "broken" contigs
  show-coords -H        /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | grep -v '\[' | awk '{print $18,$12}' | sort -u | getSummary.pl -i 1 -t all
  show-coords -L 500 -H /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | grep -v '\[' | awk '{print $18,$12}' | sort -u | getSummary.pl -i 1 -t 500+
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  all                  118        155    4630  9817  29042  144499    22498.97  54311      2654878   
  500+                16        1844  8205  16551  28426  66880      20514.88  28426      328238 
 
* Location
  mulberry:/scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs
 
== CA.contigs vs SOAPdenovo.scaffolds ==
 
* All ctgs & scf
* Alignment
  #ref/qry all
    .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  SOAPdenovo.scf(span)    12473      100    113    144    498    5933710    20428      1086614    254800874   
  SOAPdenovo.scf(len)    12473      100    113    143    341    5522831    18759.84  1034925    233991535
  CA.ctg                  92307      63    100    119    186    297795    2612      24781      241197400
 
  #best hits
  nucmer --mum -c 40 CA.fasta SOAPdenovo.fasta
  delta-filter -1 40 -i 95 -o 50 -1 SOAPdenovo-vs-CA.40.delta >! SOAPdenovo-vs-CA.40.filter-1.delta
 
= Genbank submission =
 
* Project registration : http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi
  http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi?show=20FF4536-24A4-4D8C-94FA-0755E1326E33
  Project ID: 61101
 
  Submitting Organization:              University of Illinois & University of Maryland
  Sequencing Center:                    Keck Center for Comparative and Functional Genomics, University of Illinois  ??? BCUI ???  Biotechnology Center, Univ. Illinois (BCUI)
  Consortium Name:
  Organism Name:                        Bombus impatiens
  Strain/isolate/breed:
  Locus Tag Prefix:                    BIMP      #3+ letters
  Source of DNA used for sequencing:    haploid males
  Sequencing Method:                    wgs      ???
  Sequencing Technology:                Illumina
  Estimated Genome Size:                250Mb    # the haploid genome size
  Brief description of the importnace:  #2 pollinator in the US
  Genome Coverage                      127x
  Sequencing Technology                Illumina IIx
  Comments to the staff:               
    DNA
    whole genome sequencing
    single genome;
    no annotation
    assembly name BIMP_1.0
    assembly method: Celera Assembler 6.1
    plan to update: ?
    expect to release
    agp:
 
* Sequin:
  Manuscript:       
  Authors:    Hugh Robertson, Matt Hudson, Kim Walden, Aarti Venkat, Tom Newman, Daniela Puiu, Tanja Magoc, David Kelley, Aleksey Zimin, Steven Salzberg, Gene Robinson
 
----
 
* Best assembly (CA)
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/
  ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bombus_impatiens/Assembly/CA/
 
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  ctg                  92307      63    100    119    186    297795    2613      24781      241197400     
  ctg.contam          82        1331  10485  19924  31977  112638    25648      32851      2103095 
  ctg.clean            92225      63    100    119    185    297795    2593      24628      239094305   
  ctgNs                76        75    536    3124  6834  44990      5326      10261      404781       
 
  scf                  1896      76    150    4044  67922  4021294    151761    1017298    287738041
  scf.ctg.contam      24        14315  45008  71878  141921 1116872    141140    164885    3387362       
 
  deg                  1086650    54    103    124    152    6822      139        135        151484589 
  deg.contam          1129      65    114    141    207    1035      198        211        223588       
  deg.clean            1085521    54    103    124    152    6822      139        135        151261001     
  deg.Ns              37        98    145    234    426    914        313        437        11574           
 
* Contaminant list: ???
  /fs/szattic-asmg4/tmagoc/bees_bacteria/IDs
 
== 1st submission ==
 
* Submit all contigs (should have deleted the singletons < 200bp)
 
* Directories & files
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/BIMP_1.0.*
 
  server:      ftp-trace.ncbi.nlm.nih.gov
  user:        cbcb_trc
  directory:  test/
  files names: BIMP_1.1.sqn , BIMP_1.1.agp
 
  number contigs:  92,231
  number scaffolds: 1,895
 
* NCBI email (Dec 23rd)
  there are 3 contigs that need to be trimmed:
 
  Contig name Length Span(s) Apparent source
  7180002728824 11695 72..148 Escherichia coli
  7180002730077 25748 22703..25611 Escherichia coli
  7180002740828 30102 134..1688 Escherichia coli
 
  Since these are internal hits, consider whether the contig needs to split (if the E.coli was the 'joining' element).
 
* NOTE: When I blastn these 3 contigs to nr (distant blast) , all 3 contigs had multiple good hits to other bacteria , so probably they are all contaminated and should be removed completely, not just trimmed
 
== 2nd submission ==
 
* Jan 24th
  number of scaffolds: 1191
  number of contigs:  91524
 
* Directories & files
  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/BIMP_1.1.*
 
== NCBI release ==


*
* [http://www.ncbi.nlm.nih.gov/nuccore/AEQM00000000 AEQM00000000] WGS
  /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/Linux-amd64/bin/runCA
* [ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/ ftp] GCA_000188095 to come by end of Feb 2011

Latest revision as of 19:51, 17 May 2011

Data

  • ~ 500Mbp genome (diploid); 250Mbp halpoid
  • Complete mitochondrion genomes:
 NC_011923.1    15468  14.67  Bombus hypocrita sapporoensis mitochondrion, complete genome
 NC_010967.1    16434  13.22  Bombus ignitus mitochondrion, complete genome
 only 88% identity; no rearrangements, only snps, short indels
  • NCBI TaxId
 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=132113

Traces

  • 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
  Lane   Insert  ReadLen    #Mates          #Reads        Coverage   Orientation     Comments                         #RepeatsReads
  1      4K      124        34,944,099                    14X        outie           865,687(1.2%) reads qual==0      30,576,297 (44%)
  2      8K      124        32,540,640                    13X        outie                                            26,972,424 (41%)
 
  3      400     124        34,745,750                    .                                                           8,319,335 (12%)
  5      400     124        34,601,239                    .                                                           8,398,759 (12%)
  6      400     124        34,553,857                    . 
  7      400     124        34,682,612                    . 
  8      400     124        12,975,839                    . 
 
  9      3K      124        29,615,036                    .         outie            new3kb ; received 05/18/2010
   
  1-8                       219,044,036     438,088,072   108X
  1-9                       248,659,072     497,318,144

  3-8    400     124        151,559,297     303,118,594   75X        innie
  1,9    3-4K    124        64,559,135      129,118,270   32X        outie
  2      8K      124        32,540,640      65,081,280    13X        outie

 Two new 3kb libraries to come; ~ 70% are true pair ends

Adaptors

 >circ
 CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
 >circ.revcomp
 TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG 
 >5                               
 GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
 >3
 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT[AGATCTCGGTGGTCGCCGTATCATTA][A+]
               0        1         2         3         4 
               123456789012345678901234567890123456789012
 circ          CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
                   .............   ..   ............. 
 circ.revcomp  TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG 
 CGGCATTCCTGCTGAACC revcmp GGTTCAGCAGGAATGCCG
              
               0        1         2         3         4 
               123456789012345678901234567890123456789012
                           GGTTCAGCAGGAATGCCG
 5             GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
 
               CGGCATTCCTGCTGAACC
 3             CGGCATTCCTGCTGAACCGAGATCGGAAGAG


 #18mers (5')
 CGTAATAACTTCGTATAG
 TCGTATAACTTCGTATAA
 GATCGGAAGAGCGGTTCA
 CGGCATTCCTGCTGAACC
 #18mers (5',3')
 CGTAATAACTTCGTATAG
 TTATACGAAGTTATACGA 
 TCGTATAACTTCGTATAA
 CTATACGAAGTTATTACG  
 GATCGGAAGAGCGGTTCA
 GCAGGAATGCCGAGACCG
 CGGCATTCCTGCTGAACC
 CGTGTAGGGAAAGAGTGT
 #31 mers (5') # jellyfish uses 31mers
 CGTAATAACTTCGTATAGCATACATTATACG
 TCGTATAACTTCGTATAATGTATGCTATACG
 GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
 CGGCATTCCTGCTGAACCGAGATCGGAAGAG

Frequent kmers

                                                        s_1   s_2    s_3
    1  CGTAATAACTTCGTATAGCATA|CGTATAACTTCGTATAATGTAT    15.2  4.59   2.829    # not assembled   circAdapter
    2  AATGTATGCTATACGAAGTTAT|ATAACTTCGTATAGCATACATT    5.56  19.02  0.525    # not assembled   circAdapter
    3  CGTTATAACGTATTACGTTATA|TGTATGCTATACGAAGTTATTA    5.2   4.56   .         
    4  TACATTATACGAAGTTATACGA|TATATGGTATGACGTTATAACG    3.39  1.87   0.365     
    5  ATATAACGTAATACGTTATAAC|TATAACGTATTACGTTATATGG    2.86  2.71   1.595     
    6  ACGTTATAACGTCATCCCATAT|ATACCATATAACGTAATACGTT    2.11  2.62   .
    7  TACGTTATAACGTCATACCATA|TCCATTTTACACTGCAAATGTA    1.98  2.04   1.749  
    8  CTTAAACGTCGAAATATCTCCA|TTACGTTATATGGGATGACGTT    0.87  1.16   1.136  
    9  GAGCGCTTAAACGTCGAAATAT|ATGAAATCTAGCCGTCATTTCC    0.73  1.02   1.442  
   10  ACGTAATACGTTATAACGTCAT|ATTTCCTTCCCGTTGGAGATAT    0.67  1.1    1.477  
   11  CTGCTGAACCGCTCTTCCGATC|GATCGGAAGAGCGGTTCAGCAG    0.61  3.92   0.045    # not assembled   5Adapter
   12  ACGGTCACAGCTTGTCTGTAAG|TGTAAAATGGAGTTCAACGAGC    0.31  1.73   0.798  
   13  ACGGTGGTGCATTACATTTGCA|GGCGAATGGCGCCTGATGCGGT    0.28  0.87   0.949    # not assembled
   14  CAGAAAAGCATCTTACGGATGG|TACATACCTCGCTCTGCTAATC    0.04  2.47   .        # not assembled


  • Location
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt : s_1-8 libs
 /fs/szattic-asmg4/Bees/Bombus_impatiens/new3kb/s_1_?_sequence.txt     : new3kb lib received 05/18/2010

 /fs/szattic-asmg4/Bees/Bombus_impatiens/3kb_Bimpatiens.txt            : 88 new3kbr test reads

Tasks to figure out

  1. Erroneous reads/bases, which we need to correct or discard
  2. GC bias, so we can compute a-stats properly
  3. Redundancy in the long paired ends, which are lane 1 and lane 2.
  • Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument.
    • Some reads will begin in the circularization adaptor and thus will have only one usable read
    • Some reads have a few bases of DNA sequence and hit the circularization adaptor right away
    • Most reads will have at least 36bp from each end before hitting the adaptor.
    • Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor)
  • A small but significant number of reads from the 3kb and 8kb libraries are not recircularized.
    • Thus their mate distance is +400bp rather than -3kb or -8kb.
    • It's apparently the result of a faulty batch of cre recombinase. This causes problems with contiging and scaffolding.
    • It is possible to remove these reads by removing
  1. mate pairs where neither read is trimmed (thus no adapter ligation may have occurred)
  2. mate pairs where one read begins with the adapter sequence.
    • We are working on this, up to now our paired assembly stages have been disappointing.
  • Matt's idea is to exclude all mate pair reads that don't have evidence of the linker with flanking useful sequence, as a way to avoid uncircularized molecules that will give misleading "mate pairs" only 400 bp apart.
    • There has been no trimming of the adaptor, which is the 42 base 454 adaptor, so its presence can be used to indicate potentially good mate pairs.
    • Even tossing half the mate pairs might not be a problem, as we have perhaps too many anyway.
    • But you will also need to toss redundant mate pairs, and that will indeed reduce the total a lot.
    • Just to be clear - the 500 base mate pairs should have no such problems, except that as Matt has found from his preliminary assembly, the mean fragment length is actually 400 bp rather than 500 bp (and the 3 and 8 kb PE reads are typically shorter than nominally given, e.g. more like 2.5 and 6 kb).
    • You'll also need to throw out all the reads where one of the mates /starts/ with linker, I assume you'd do that anyway. We're also working on ways round this; one might be when we get a better assembly we'll find some better characteristic to filter the unrecircularized reads on.

Trimming

  • Quality trimming: trim all bases with fastq quality eq "B" (0)
  cat *fastq | ~/bin/fastq2clq.pl 
  • Adaptor trimming: Align all subsets to adaptors
 C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
 3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
 5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
 id  len    gc%
 C   42     30.95  
 3   52     55.77  
 5   67     59.70  
 nucmer -l 8 -c 16 -b 32 -g 32 adaptors.seq  ...
 delta-filter -l 16 -q   ...
 cat *.filter-q.delta | ~/bin/delta2clr53.pl -5 5,3 ...
  • Adaptor/primers median positions in the reads
 Lib       adaptorPos     5'primerPos 3'primerPos 
 s_1       34-75          0-36        0-19
 s_2       2-40           0-36        0-19
  • Mate stats
 Lib        #mates    adaptor.bothMates   adaptor.noMate   adaptor.oneMate   adaptor.oneMate(filtered)   
 s_1        34944099  269156              9804247          24870696          6048164(17.3%)               
 s_2        32540640  91528               16181288         16267824          1061858( 3.2%)   
 total      67484739                                                         7110022(10.5%)
  • Filtered read clr stats
 Lib        #mates     #reads       min    q1     q2     q3     max        mean       n50        sum            
 s_1        6048164    12096328     64     80     95     115    124        96.56      101        1,168,064,632     
 s_2        1061858    2123716      64     80     96     115    124        96.59      101          205,122,226 
 total      7110022    14220044     64     80     95     115    124        96.57      101        1,373,186,858 
  • Filtered read GC% stats
 Lib        #reads       min    q1     q2     q3     max        mean       n50                    
 s_1        12096322     0.00   31.93  36.26  42.35  100.00     37.45      38        
 s_2        2123715      0.00   31.88  36.46  42.48  100.00     37.56      38         
  • Other frequent kmers
 26mer : ACGTTATAACGTATTACGTTATATGG -> revcomp -> CCATATAACGTAATACGTTATAACGT : ~10% of the traces
 10mer : AAAAAAAAAA                               TTTTTTTTTT                 : ~32% of the traces    
 53mer: CGATTTCCATGGCGTCGTTTGAGGATTCCAATACGGCGAACCTGTTGTGAGTG                : ~2% of the mito seqs (either begin or end); not present in the 2 complete mito's (probably ok)
  • Data processing pipeline
 s_[12]    : /nfshomes/dpuiu/bin/fastq2adaptorFreeFastq.amos
 s_[35678] : /nfshomes/dpuiu/bin/fastqfrg.amos
  • Location:
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt                   # original fastq files
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/s_[1235678]_[12]_sequence.cor.txt    # corrected fastq files
 /fs/szattic-asmg5/Bees/Bombus_impatiens/s_[12]/all/s_[12].64.rev.frg                    # adaptor free s_[12] mated reads
 /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[1235678].frg                  # adaptor free & corrected s_[1235678] reads

D Kelly's trimming

 lib                       reads        min  q1   q2   q3   max  mean    n50  sum           #repeatReads
 s_1_1_sequence.cor.txt    5,436,814    30   71   90   115  124  89.70   100  487673256     1,071,779 (20%)
 s_1_2_sequence.cor.txt    5,864,225    30   77   92   110  124  93.27   98   546956864     1,208,767 (20%)
 s_2_1_sequence.cor.txt    1,053,858    30   80   96   118  124  97.16   102  102395785     222,290   (21%)
 s_2_2_sequence.cor.txt    1,041,846    30   78   93   110  124  93.51   99   97426679      208,029   (20%)
 s_3_1_sequence.cor.txt    33,668,302   30   105  124  124  124  111.04  124  3738530761              (12%)
 s_3_2_sequence.cor.txt    32,554,322   30   82   108  124  124  100.42  120  3269039697   
 s_5_1_sequence.cor.txt    33,579,744   30   109  124  124  124  111.65  124  3749192427   
 s_5_2_sequence.cor.txt    32,465,412   30   84   112  124  124  102.13  124  3315553171   
 s_6_1_sequence.cor.txt    33,535,725   30   109  124  124  124  111.60  124  3742602285   
 s_6_2_sequence.cor.txt    32,390,877   30   83   109  124  124  100.92  123  3268875471   
 s_7_1_sequence.cor.txt    33,674,235   30   112  124  124  124  112.19  124  3777917379   
 s_7_2_sequence.cor.txt    32,518,568   30   84   110  124  124  101.60  124  3303807321   
 s_8_1_sequence.cor.txt    11,777,821   30   113  124  124  124  112.30  124  1322658923   
 s_8_2_sequence.cor.txt    12,176,364   30   84   110  124  124  101.51  124  1236034348 

 s_9_1_sequence.cor.txt    21,273,603   30   87   124  124  124  104.52  124  2223545139     #new3kb added 05/18/2010
 s_9_2_sequence.cor.txt    20,769,027   30   86   124  124  124  103.68  124  2153436097     #new3kb added 05/18/2010
 s_1,2                     13,396,743   
 s_1..8                    301,738,113  .  .    .    .    .    .       .    31958664367  
 s_2..8 (64bp+)            275,763,594  .  .    .    .    .    .       .    30693543885  


  • Read/Mate counts
 lib                       reads        mates             range
 s_1                       11301039     5320163    
 s_2                       2095704      1034086           11301040..13396743      
 s_3                       66222624     31857304   
 s_5                       66045156     31820553         
 s_6                       65926602     31748713   
 s_7                       66192803     31891468   
 s_8                       23954185     11168258   
 s_9                       42042630     13245740

 s_1,2                     13,396,743   6,354,249
 s_1-8                     301,738,113  144,840,545
 s_1-9                     343,780,743  158,086,285
  • Files
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt        # adaptor free corrected reads (long inserts)
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt          # corrected reads (short inserts)

 /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[12].rev.frg
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/frg/s_[35678].frg

Assembly

  • SOAP version 1.04: /nfshomes/dpuiu/szdevel/SOAPdenovo_Release1.04/
  • CA version 6.0 /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/

SOAPdenovo s_1..8 CBCB-corr (Illinois)

  • From: Kim Walden 04/30/2010
  • Input: s_1,2,3,5,6,7,8 reads
  • Error correction: yes; CBCB method
  • Kim's e-mail
 I ran it in a stepwise fashion, not adding any of the available options and only changing K=31.
 I then ran their more recent GapCloser algorithm, not the version that is part of the package when using option -F in "scaff".  
 I used their "all" assembly once, and I noticed that it automatically uses the -F option during scaffolding
 Also, I used your data from the ftp site without checking for redundancy in the 3kb and 8kb lanes; did Daniela do the same?
  • Summary
                         elements min_len  max_len   mean    n50        sum
 contigs(>100)           103858   100      85850     2227    7541       231387390      # prior to scaffolding; 
 scaffolds&contigs(>100) 12663    100      4648588   19282   1040109    244171008      # post-scaffolding and gap-filling
 scaffolds(span)         3736     252      4648588   64874   1047164    242369898      # same prior data minus contigs that didn't assemble into scaffolds.
 scaffolds(len; no N's)  3736     172      4585806   64034   1027043    239233801      # computed by CBCB
     
  • Locations
 ginkgo:/scratch1/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC.redo/Bimp_Salz_corr_050310_NewGapCloser.fasta

 # backup
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC.redo/
 # online
 http://www.life.illinois.edu/robertson/private/
 Login: robertson
 password: dna

SOAPdenovo s_1..8 CBCB-corr (CBCB)

  • Stats:
                         elem     min      max       mean    n50        sum 
 contigs(all)            9943851  31       85850     60      45         594773555

 contigs(>100)           106213   100      85850     2177    7084       231246190
 scaffolds+contigs(>100) 12473    100      5933710   20428   1086614    254800874    # based on scafSeq file  
 scaffolds(span)         3664     236      5933710   69090   1115496    253148288    # based on scafSeq file
 scaffolds(len)          3664     134      5522831   63411   1059378    232338949 

 readsAssembled          208389968                                                   # based on readOnContig file
 readsFiltered           289681090                                                   # based on readOnContig & log file
 readsTotal              301738113                                                   # based on config file

 readsDeleted            441343                                                      # based on log file

  • Run gapClosing on assembly:
 stillOpen 14784
 closed    42379
  • Contig cvg: GC bias:
Bombus.contig10000.png

 cat asm.K31.contig100.infoseq | ~/bin/getCorrelationDescriptive.pl -i 2 -j 3  => -0.709501493345893
 cat asm.K31.contig1K.infoseq | ~/bin/getCorrelationDescriptive.pl -i 2 -j 3   => -0.923151349747983
 #100+ ctgs
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 len                  106213     100    140    437    2420   85850      2177       7084       231246190      
 gc                   106213     0.00   33.13  38.00  43.44  100.00     38.74      39         4115066
 cvg                  106213     0.6    39.0   50.8   58.9   160.8      47.38      54         5032329
 #1K+ ctgs
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 len                  41965      1000   1860   3295   6338   85850      5110       7726       214421699      
 gc                   41965      19.18  34.31  37.96  41.79  59.25      38.19      38         1602711
 cvg                  41965      8.0    42.0   49.9   55.8   93.8       47.74      51         2003339
  • Run soap2 to map reads back to the assembly: "-l 32 -v 2"
 readsAligned            173063062
  • Locations
 mulberry:/scratch0/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor/       # original
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor/   # backup

SOAPdenovo s_1..9 CBCB-corr (CBCB) **

  • Uses the new 3Kbp library : s_9 (reads have been reversed)
  • Stats:
  .                    elem       min    q1     q2     q3     max        mean       n50        sum
  scf                  11178      100    111    135    390    5655980    23014      1205321    257251549
  ctg                  10856652   31     .      .      .      85850      57         43         627095607
  ctg100+              106741     100    .      .      .      85850**    2165       6939       231167576

  scfSeqContigs        67900      100    340    1788   4373   85819      3437       7586       233372689
  scfSeqContigsClosed  20651      100    128    488    7991   600179     11715      64083      241921308
  • Location
 mulberry:/scratch2/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/      # original
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/   # backup

CA s_1..8 corr OBT (CBCB)

Issues

  • Used outie s_1,2 lib ; needs to be redone

runCA spec

 doOverlapTrimming =        1
 ovlOverlapper =            ovl
 unitigger =                bog         # cmd buildUnitigs

 *BatchSize =               2000000
 *BlockSize =               2000000

 ovlMemory   =              8GB
 doExtendClearRanges =      0          # fast run

Gatekeeper

  • ~ 58X cvg
 gatekeeper -dumpfragments asm.gkpStore | grep ^fragmentClear ...
 LOAD                  STATS         
 
 7                     libInput      
 7                     libLoaded     
 0                     libErrors     
 
 301738113             frgInput      
 301738113             frgLoaded     
 25974519              frgErrors     
 
 144840545             lkgInput      
 124575590             lkgLoaded     
 20264955              lkgErrors     
 0                     lkgWarnings   
 
 275763594             numRandom     
 301738113             numNormal     
 
 
 GLOBAL                STATS         
 
 301738113             FRG           
 7                     LIB           
 
 LibraryName           numActiveFRG  numDeletedFRG  numMatedFRG  readLength   clearLength    #egrep 'ATATAACGTAATACGTTATAAC|GTTATAACGTATTACGTTATAT'
 GLOBAL                274282535     27455578       247151634    30524746650  29445986650  
 LegacyUnmatedReads    0             0              0            0            0            
 
 s_1                   10409320      891719         9039810      996306280    994802849      #1134233(11%)
 s_2                   2054575       41129          1987728      197829588    197593453      #224648(11%)
 s_3                   59894531      6328093        53925476     2382786529   2127257586     #3037085(5%)  
 s_5                   60127274      5917882        54512448     2459696212   2209078913     #5%
 s_6                   59813485      6113117        54062058     2396359740   2152656620     #5%
 s_7                   60278536      5914267        54647258     2476326572   2237423989     #5%
 s_8                   21704814      2249371        18976856     2435572545   2347304056     #5%
 gatekeeper -dumplibraries -tabular asm.gkpStore
 UID    IID  Orientation  Mean  StdDev  NumFeatures  
 s_1    1    I            4000  400     1582823088   
 s_2    2    I            8000  800     1582823088   
 s_3    3    I            400   40      1582823088   
 s_5    4    I            400   40      1582823088   
 s_6    5    I            400   40      1582823088   
 s_7    6    I            400   40      1582823088   
 s_8    7    I            400   40      1582823088   
 s_1,2 lib : ids:1..13,396,744
 overlap jobs that need to be redone : 1036 out of 11476 < 10%
 ~/bin/getOvlJobsCount.pl 2000000 2000000 301738113 | p '/(\d+)-(\d+)\s+(\d+)-(\d+)/; print $_ if($1<14000000 or $3<14000000);'

Meryl

 meryl -Dh -s 0-mercounts/asm-C-ms22-cm0 
 Found 23686053415 mers.
 Found 265670791 distinct mers.
 Found 8682767 unique mers.
 Largest mercount is 21,987,616; 668 mers are too big for histogram.
 ...
 #freq   count   %
 1       8682767 0.0327  0.0004
 2       4288602 0.0488  0.0007
 3       3041005 0.0603  0.0011
 ...
 72*     3803015 0.7150  0.3229
 ...
 1048158 1       1.0000  0.9274
 fasta2tab.pl  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/0-mercounts/asm.nmers.ovl.fasta   | sort -n -r | head -5
 21987616        ATATAACGTAATACGTTATAAC
 20992872        CGTTATAACGTATTACGTTATA
 20510821        ACGTTATAACGTATTACGTTAT
 20484950        CCATATAACGTAATACGTTATA
 20452950        CATATAACGTAATACGTTATAA
 cat egrep.sh 
 egrep -c 'ATATAACGTAATACGTTATAAC|GTTATAACGTATTACGTTATAT' /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/seq_obt/*clean.seq > egrep.count
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2
 ...

OBT

  • job count
 n=((301738113/2000000)+0.5)=151
 OBT/OVL jobs = n*n/2 =151*76=11476
 ~ 20 min/OBT job =~ 0.33 hr/OBT job  => 0.33*11476/32=5 days
  • OBT trimming
 key             sum             count           mean
 CLR             31958664367     301738113       105.915239043733
 OBTINITIAL      30779687358     301610679       102.051052900551
 OBTMERGE        30052401719     301610679       99.6397137483318
 OBTCHIMERA      30052138325     301610679       99.638840456972
 LATEST          30052138325     301610679       99.638840456972
  • use OBTCHIMERA as CLR if want to redo the assembly !!!

Overlap

  • job count : same as OBT
  • Overlap counts
 cat asm.repeatmodel.lib.00*stats | egrep '^Lib|nSamples|median' | p 'chomp; if(/Lib/) {print "\n$_"} else { print "  $F[1]"}' | pretty -o 
 Lib    0  5'  199542773  50  
 Lib    0  3'  199542773  50  
 Lib    1  5'  6682072    38  
 Lib    1  3'  6682072    37  
 Lib    2  5'  1306720    37  ???
 Lib    2  3'  1306720    42  ???
 Lib    3  5'  43868663   50  
 Lib    3  3'  43868663   50  
 Lib    4  5'  43939366   51  
 Lib    4  3'  43939366   50  
 Lib    5  5'  43620676   50  
 Lib    5  3'  43620676   50  
 Lib    6  5'  43903838   51  
 Lib    6  3'  43903838   50  
 Lib    7  5'  16221438   50  
 Lib    7  3'  16221438   50
  • 199542773 out of 257786946 (77%) usable reads had overlaps
  • 7988792 out of 11443537 (69%) usable s_1,2 reads had overlaps

Frg/Ovl correction

  • Frgcorr jobs: 151
 wc -l 3-overlapcorrection/cat-corrects.frgcorrlist
  • Ovlcorr jobs: 151
 wc -l 3-overlapcorrection/cat-corrects.frgcorrlist

OverlapStore

  • Failures
    • overlapStore: too many files (max 10240) ; fix: recompile the code
    • chimera: FAILED file seems to be truncated but contains records from all libs; fix: link asm.chimera.report.FAILED asm.chimera.report

Bog

  • Run time: 14 hr
  • default unitigger : does not seem to work on such big datasets
  • Utg len stats:
 cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
 elem       min    q1     q2     q3     max        mean       n50        sum            
 63916892   47     85     150    150    22861      125.23     150        8004296170     

Bog consensus

  • job count :
 wc -l 4-unitigger/asm.partitioningInfo
 110
  • Failures
 grep FAILED 5-consensus/asm*err
 5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED.  Could not align fragment 264046761.
 
 # Fix: 
 # update layout (CA 6.1)
 mkdir utg
 
 # update layout (CA 6.1)
 grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "tigStore -g asm.gkpStore -t asm.tigStore 2 -u  $F[0] -d layout > utg/$F[0].layout\n";'  | ~/bin/scheduler.pl
 grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "~/bin/updateUTGlayout.pl utg/$F[0].layout >  utg/$F[0].layout.update\n ";'              | ~/bin/scheduler.pl 
 grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg/$F[0].layout.update\n" ;'             | ~/bin/scheduler.pl  -c 1

 grep FAILED 5-consensus/asm*err | sed 's/\.err:/ /' | awk '{print $1}' | sort -u | p 'print "touch $F[0].success\n";'                                               | ~/bin/scheduler.pl 
 # create success file
 touch 5-consensus/consensus.success
 
 # relaunch runCA

Cgw

  • Failures
 grep -B 5 "^Failure message:" runCA.log 
 

Cgw consensus

  • job count : 84
 wc -l 7-CGW/asm.partitionInfo
  • ctg count: 1698486
 cat 7-CGW/asm.partitionInfo | /nfshomes/dpuiu/bin/sum.pl -i 3 -t ctgs | pretty -o

Terminator

  • Failure:
    • buildPosMap : segFault :ids too long???
 Fix: split asm file(optional) and reassign new accession numbers
 cd split/
 ~/bin/splitAsm.pl < asm.asm => ... CCO.asm, SCF.asm
 cat CCO.asm CLK.asm SCF.asm SLK.asm | ~/bin/newAsmIds.pl | buildPosMap -o asm

Stats

 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  15009      74     4786   9789   20021  415318     16916.33   29133      253897199      
 ctg                  145036     64     93     116    1035   138630     1498.67    7726       217361459      
 deg                  1507280    54     98     118    146    1434       148.28     137        223502287      
 utg                  2786974    54     90     110    142    25644      197.04     277        549140918      
 singletons           61129865
 usableSeqs           257786946
  • Stats (100bp+)
                       elem     min      max       mean    n50        sum 
contigs(>100)          99429    100      138630    2148    7972       213595680
scaffolds(span)        14992    100      415318    16935   29133      253895676
scaffolds(len)         14991    100      303946    14499   24091      217357577 
  • Library stats (totals)
 lib    ctg       sur      deg       singl     total
 s_1    2251597   162800   3119665   3988095   9522157
 s_2    424821    41107    563656    891795    1921379
 s_3    28880189  1078891  13600909  12735637  56295626
 s_5    28983834  1077334  13561390  12962731  56585289
 s_6    28663974  1062939  13559494  12896855  56183262
 s_7    28784505  1058364  13713200  13020274  56576343
 s_8    10455533  383065   5229814   4634478   20702874
 all    128444453 4864500  63348128  61129865  257786930
  • Library stats (percentages)
 lib    ctg  sur  deg  singl  total
 s_1    24   2    33   42     100
 s_2    22   2    29   46     100
 s_3    51   2    24   23     100
 s_5    51   2    24   23     100
 s_6    51   2    24   23     100
 s_7    51   2    24   23     100
 s_8    51   2    25   22     100

Mate redundancy

  • Criteria:
 Given 2 mates  
   1: a1,a2 
   2: b1,b1
 We have the following 2 overlaps
   1: a1,b1  a2,b2    
   2: a1,b2  a2,b1  
 At least 75% of one read overlaps the other read
  • Location:
 cd /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor/9-terminator/
 wc -l s_1.mates asm.posmap.frgutg.s_1.*
 5320163 s_1.mates                                     # input mates (from FRG file; 2 reads/mate)
 1232734 asm.posmap.frgutg.s_1.mates                   # assembled mates (both reads assembled) 
  366262 asm.posmap.frgutg.s_1.redundant_mates         # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 29.7% redundancy
  312206 asm.posmap.frgutg.s_1.redundant_mate_pairs    # redundant assembled mate pairs
 
 wc -l s_2.mates asm.posmap.frgutg.s_2.* 
 1034086 s_2.mates                                     # input mates (from FRG file; 2 reads/mate)
  255905 asm.posmap.frgutg.s_2.mates                   # assembled mates (both reads assembled) 
  110570 asm.posmap.frgutg.s_2.redundant_mates         # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 43.2% redundancy
 1871747 asm.posmap.frgutg.s_2.redundant_mate_pairs    # redundant assembled mate pairs
   11937 asm.posmap.frgutg.s_2.redundant_mate_clusters # redundant assembled mate clusters

Runtimes

 #     step            hours
 0     gatekeeper      5.36   
 1     initialTrim     2.07   
 2     meryl           9.90   
 9     overlap.sh      11.31  
 11    find            0.00   
 16    overlapStore    13.79  
 17    consolidate     5.12   
 18    merge-trimming  1.41   
 19    chimera(1)      10.04  
 20    overlap.sh      49.86  
 21    find            0.00   
 22    overlapStore    16.68  
 23    frgcorr.sh      5.57   
 24    cat-corrects    0.10   
 25    ovlcorr.sh      4.64   
 26    cat-erates      0.90   
 27    overlapStore    6.67   
 28    buildUnitigs    13.88  
 29    gatekeeper      1.31   
 30    consensus.sh    3.41   
 31    cgw(1)          65.00  
 43    gatekeeper      4.39   
 44    consensus.sh    4.70   
 46    terminator      6.26   
 47    asmOutputFasta  7.75   
 48    dumpSingletons  3.68   
 49    buildPosMap(1)  3.79

Location

 ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/                             # original assembly
 ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/asm.gkpStore.OBTCHIMERACLR   # dumped OBT clear ranges, 0/1 undel/del
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor                        # backup

CA s_1..8 corr OBT redo (CBCB)

Issues

  • Fixed outie s_1,2 lib : clear ranges will be different
  • s_3..8 : stay the same

runCA spec

 doOverlapTrimming =        0         !!! already done in the prev assembly
 saveOverlaps  =            1         !!! it was actually set to 0 for this run; otherwise the ovb.gz files get deleted
 ovlOverlapper =            ovl       # same
 unitigger =                bog       # same   
 
 BatchSize =                2000000    # must be the same so overlaps are not redone
 BlockSize =                2000000    # must be the same
 ...
 doExtendClearRanges =      1          # initially set to 2; seems to take too long => reset to 1 

Input

  • Modified frg files:
    • dumped "CA s_1..8 corr OBT" assembly gatekeeper
    • undelete act:D fragments so the ids stay the save
    • reverse s_1,2 reads

Gatekeeper

  • Make sure the iids match

Overlap

  • Rerun only the overlap jobs for the s_1,2 reads

OverlapStore

  • To build faster in the future
e 'foreach ("0000000001".."0000000058") { print "overlapStore -c $_.ovlStore -g asm.gkpStore/ -i 0 -M 8192 1-overlapper/$_/*gz\n" }' | scheduler.pl  # 13min/job  
e 'foreach ("0000000002".."0000000058") { print "overlapStore -m 0000000001.ovlStore $_.ovlStore\n" }'                                               # 3 min/job 
total=2*13+57*3=3.5 hrs (instead of 13.85 reported in runCA.log)

Frgcorr

  • Initial failure: on 144 out of 151 jobs : the gatekeeper id's originally did not match (5 reads missing from s_5 lib shifted all the other reads)

Ovlcorr

  • The error rate on overlaps with error went down by ~ 1.28%

Bog

 cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
 elem       min    q1     q2     q3     max        mean       n50        sum            
 60970504   47     85     150    150    93191      126.17     150        7692801690
  • Unitig stats
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep -c ^unitig_num_frags                                    # 60,970,523  : total unitigs
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep -c ' 1$'                       # 59,776,572  : unitigs with only 1 read
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep -v ' 1$' | getSummary.pl -i 2  # stats for unitig with 2+ reads

               elem         min    q1     q2     q3     max        mean       n50        sum            
 1read/utg     59,776,572                                                                59,776,572
 2+reads/utg   1,193,951    2      3      6      28     74715      165.84     4712       198,010,374

Bog consensus

  • job count :
 wc -l 4-unitigger/asm.partitioningInfo
 110
  • Failures: 1 failed unitig
 grep FAILED 5-consensus/asm*err
 5-consensus/asm_022.err:MultiAlignUnitig()-- Unitig 115198 FAILED.  

 Fix: same as before

Cgw

  • Failure:
    • after 7-0-CGW/asm.ckp.4
 tail runCA.log
 ERROR:  Negative variance for chunk 59905266* in scaffold 6409 (AEnd=-163479.174700 BEnd=-163491.446700)

 37980:Scaffold 6409:
 37981-  Idx    Chunk     (   Start, Variance)  (     End, Variance)      Len
 37982-    0: 60983787    (       0,        0)  (   31646,      823)    31646
 37983-    1:   69691     (   35389,      934)  (   31650,      837)     3739
 37984-    2: 59907028    (   35103,      942)  (   42035,     1122)     6932
 37985-    3: 60799298    (   41640,     1132)  (   64897,     1737)    23257
 37986-    4: 60975976    (   64924,     1784)  (   88439,     2395)    23515
 37987-    5: 60988396    (   88431,     2416)  (  122023,     3289)    33592
 37988-    6: 60988698    (  121551,     3324)  (  123589,     3377)     2038
 37989-    7: 59905266*   (  122331,  -163479)  (  121859,  -163491)      472
 37990-    8: 60988599    (  122275,   170138)  (  134564,   170457)    12289
 ./AS_REZ/GapFillREZ.C:5666: int Estimate_Chunk_Ends(VarArrayStack_Entry_t*, int, int, LengthT*, LengthT*, ChunkInstanceT*, float*, Scaffold_Fill_t*, int*, int*, int*): Assertion `min_right_variance >= 0.0' failed.
 ./AS_REZ/GapFillREZ.c:8339:       "ERROR:  Negative variance for chunk %d in scaffold %d (AEnd=%f BEnd=%f)\n",  #message
 Fix: modfy the code; comment assertion statemants
 cat /nfshomes/dpuiu/szdevel/SourceForge/wgs/src/AS_REZ/GapFillREZ.c
 // assert (0.0 <= ref_variance && ref_variance <= FLT_MAX);
 // assert (min_right_variance >= 0.0);
  • Preliminary stats
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $sum,"\n"; $sum=0; } else { $sum+=$F[-1]}' | getSummary.pl -t scf
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -t ctg
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -min 200 -t ctg200+
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $count-1,"\n"; $count=0; } else { $count++}' | getSummary.pl -i 0 -min 1 -t scf.gaps
 #7-0-CGW/rez.i36.log :
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  1820       84     162    4583   82337  3161523    125884.07  667487     229109005       # ids from 0..13113
 ctg                  26453      64     962    3756   10485  255820     8759.59    21800      231717378      
 ctg200+              23343      200    1760   4772   11873  255820     9908.34    21845      231290344 
 scf.gaps             877        1      4      12     36     373        27.71      64         24298
 #7-2-CGW/rez.i01.log :
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  4903       66     1283   6908   39466  1148626    47212.57   209756     231483215      
 ctg                  23480      64     1370   4397   11757  255820     9858.75    23476      231483360
 7-2-CGW/rez.i06.log 
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  3818       64     268    6813   61948  1198480    60723.65   257457     231842914      
 ctg                  24005      64     1165   4161   11439  255820     9659.47    23819      231875677 
 #7-2-CGW/rez.i11.log 
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  3733       64     268    6880   65584  1391034    62111.99   258823     231864045      
 ctg                  24004      64     1165   4161   11439  255820     9659.98    23819      231878256     
 #7-2-CGW/stone.i02.log
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  1854       83     153    4039   52506  2983045    125375.58  860583     232446325      
 ctg                  23863      64     674    3833   11368  255820     9761.94    25892      232949194       
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  1847       84     152    4440   62397  3337984    125470.25  835387     231743549      
 ctg                  26088      64     429    3213   10218  297786     8941.40    25521      233263147      
  • TigStore
 tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D unitigs | tail
 maID            isPresent       isDeleted       ptID    svID    fileOffset      covStat
 60970540        1               0               0       3       58023924696     84
 tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D contigs | tail -1 
 maID            isPresent       isDeleted       ptID    svID    fileOffset
 61001100        1               0               0       32      16307420
  • Checkpoints
 # 7-0-CGW              
                                                                   #times   iterMax
 CHECKPOINT_AFTER_BUILDING_SCAFFOLDS     "ckp01-ABS"               
 CHECKPOINT_DURING_CLEANING_SCAFFOLDS    "ckp02-DCS"               5        10
 CHECKPOINT_AFTER_CLEANING_SCAFFOLDS     "ckp03-ACD"               1
 CHECKPOINT_DURING_1ST_SCAFF_MERGE       "ckp04-1SM-partial"       12              # we should see more iterations if the scaffolding goes well; also ECR would take longer
 CHECKPOINT_AFTER_1ST_SCAFF_MERGE        "ckp05-1SM"               1
 CHECKPOINT_AFTER_STONES                 "ckp06-AS"           
 CHECKPOINT_DURING_2ND_SCAFF_MERGE       "ckp07-2SM-partial"  
 CHECKPOINT_AFTER_2ND_SCAFF_MERGE        "ckp08-2SM"               1
 CHECKPOINT_AFTER_FINAL_ROCKS            "ckp09-FR"                1

 # 7-2-CGW              
                                                                   #times   iterMax
 CHECKPOINT_AFTER_BUILDING_SCAFFOLDS     "ckp01-ABS"               
 CHECKPOINT_DURING_CLEANING_SCAFFOLDS    "ckp02-DCS"               9        10
 CHECKPOINT_AFTER_CLEANING_SCAFFOLDS     "ckp03-ACD"               1 
  ...
 CHECKPOINT_DURING_1ST_SCAFF_MERGE       "ckp04-1SM-partial"       10
 CHECKPOINT_AFTER_1ST_SCAFF_MERGE        "ckp05-1SM"               1
 CHECKPOINT_AFTER_STONES                 "ckp06-AS"                1
 CHECKPOINT_DURING_2ND_SCAFF_MERGE       "ckp07-2SM-partial"       1
 CHECKPOINT_AFTER_2ND_SCAFF_MERGE        "ckp08-2SM"               1
 CHECKPOINT_AFTER_FINAL_ROCKS            "ckp09-FR"                1
  ...
 CHECKPOINT_AFTER_PARTIAL_STONES         "ckp10-PS"                1
 CHECKPOINT_AFTER_FINAL_CONTAINED_STONES "ckp11-FCS"  
 CHECKPOINT_AFTER_FINAL_CLEANUP          "ckp12-FC"   
 CHECKPOINT_AFTER_RESOLVE_SURROGATES     "ckp13-RS"   
 CHECKPOINT_AFTER_OUTPUT                 "ckp14-FIN"
  • Long running time : implemented 2 cgw options -ic, -im ....
  • Running times
 CGW-0: START Wed Apr 28 22:30:12 ; END Mon May  3 03:34:57
 ECR-1: START Mon May  3 03:34:58 ; END Mon May  3 12:28:54        
 CGW-2: START Mon May  3 12:28:54 ; End Wed May 12 22:05:00
  • Library insert histograms
 # Example
 cat 7-0-CGW/stat/scaffold_final.distlib_1.cgm | p 'print int($F[0]/1000),"\n";' | count.pl | sort -n > 7-0-CGW/stat/scaffold_final.distlib_1.cgm.hist
 cat 7-0-CGW/stat/scaffold_final.distlib_2.cgm | p 'print int($F[0]/1000),"\n";' | count.pl | sort -n > 7-0-CGW/stat/scaffold_final.distlib_2.cgm.hist
 gnuplot
   set xrange [0:10]
   plot "scaffold_final.distlib_1.cgm.hist" w l, "scaffold_final.distlib_2.cgm.hist" w l, 


  • Failure:
    • after 7-2-CGW/asm.ckp.42 (after running about 2hrs)
 tail runCA.log
 * Screwed up scaffold 3790: Chunk 60994012 has bad mean
 * Scaffold 3790 iteration 1  1 bad CIs out of 33 LSE:2664.55 improvement:1
 * Reinserting chunk 60994012 in scaffold 3790
 * improvement = 0.0721104 LSE = 2664.55
 ----------------------------------------
 Failure message:
 scaffolder failed

ECR

  • Count scaffold gaps : total/closed
 cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^examining gap'   # 24635
 cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^closed gap'      # 2803

Terminator

  • Get id range
 grep -m 1 ^7 asm.posmap.utglen asm.posmap.ctglen asm.posmap.deglen asm.posmap.scflen 
 asm.posmap.utglen:7180000000000 124    
 asm.posmap.deglen:7180001635156 133     
 asm.posmap.ctglen:7180001635158 124
 asm.posmap.scflen:7180002817601 161    
 =>
 utg: 7180000000000..7180001635155
 ctg/deg:7180001635156..7180002817600
 scf:7180002817601..7180002819521

Stats

  • Assembly stats
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  1921       65     148    3389   55043  3122263    149121.08  982222     286461602      
 ctg                  91934      64     100    119    180    255828     2621.59    25123      241013378      
 deg                  1090511    54     103    124    152    7068       139.41     135        152024633      
 utg                  1635156    54     97     118    141    94720      267.81     1636       437917658
  • Assembly stats(100bp+)
                         elem     min      max       mean    n50        sum 
 contigs(>100)           69270    100      255828    3451    25383      239118680 
 scaffolds(span)         1870     100      3122263   153185  982222     286457118 
 scaffolds(len)          1870     100      2990211   128880  870182     241007381

Location

 ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo/           # original assembly
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo      # backup

CA s_1, 2 partial ,3..8 corr OBT (CBCB) **

Issues

  • Use only s_2 lib reads which seem 8K apart : ~ 14% of the s_2 reads aligned by soap2 to SOAPdenovo Illinois assembly within 8+-2Kbp of one another
  • Alignments (very similar results)
 soap2-index  Bimp.fasta                 #  soap2-index -> 2bwt-builder
 bowtie-build Bimp.fasta  Bimp.fasta 
 soap2 -D Bimp.fasta.index -a s_2.1.fastq -b s_2.2.fastq -l 32 -p 8 -v 2 -m 6000 -x 10000 -o s_2.mated.soap2 -2 s_2.single.soap2
 bowtie   Bimp.fasta       -1 s_2.1.fastq -2 s_2.2.fastq -l 32 -p 8 -n 2 -I 6000 -X 10000 >  s_2.mated.bowtie
 bowtie   Bimp.fasta         s_2_0.fastq -l 32 -p 8 -n 2                                  >  s_2_0.bowtie

 wc -l s_2.mated.*
 288880 s_2.mated.soap2
 288668 s_2.mated.bowtie

 intersect.pl s_2.mated.*
 288368

runCA spec

 doExtendClearRanges=0 

Input

  • Gatekeeper: same as prev assembly
  • Overlaps: filter out all overlaps for most of the s_2 reads

Bog

 cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
 elem       min    q1     q2     q3     max        mean       n50        sum            
 61930057   47     85     150    150    113529     125.85     150        7793894671      # max went up from 93191->113529

Bog consensus

  • Failures
 grep FAILED 5-consensus/asm*err : 113 unitigs between 5.. 33547 frags
 5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED.  Could not align fragment 264046761.
  • Fix:
 grep    FAILED 5-consensus/asm*err | p 'print "$F[2]\n";' | perl ./fixLayout.pl | scheduler.pl 
 grep -c FAILED 5-consensus/*err | grep -v ':0$' | p '/(.+).err/; print "touch $1.success\n";' | sort -u | scheduler.pl

Cgw

  • Preliminary stats
 #7-0-CGW/stone.i02.log
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  1847       84     152    4440   62397  3337984    125470.25  835387     231743549      
 ctg                  26088      64     429    3213   10218  297786     8941.40    25521      233263147
 cat 7-0-CGW/cgw.out
 ...
 Processed 61,930,044 unitigs with 301,738,113 fragments

Terminator

  • Message counts
 CTP     91502
 SCF     1896
 CLK     3288923

cat CLK.asm | grep ^ori: | count.pl ori:O 979139 ori:A 869982 ori:N 857732 ori:I 582070

Stats

  • Assembly stats
  .                    elem       min    q1     q2     q3     max        mean       n50        sum
  scf                  1896       76     150    4044   67922  4021294**  151760     1017298**  287738041
  ctg                  92307      63     100    119    186    297795**   2612.99    24781**    241197400
  deg                  1086650    54     103    124    152    6822       139.41     135        151484589
  utg                  1574225    54     98     119    143    123354     274.33     2027       431859732
  • Assembly stats(100bp+)
 .                       elem     min      max       mean    n50        sum            
 contigs(>100)           69944    100      297795    3421    25048      239324761      
 scaffolds(span)         1869     100      4021294   153951  1017298    287735536      
 scaffolds(len)          1869     100      3429736   128833  860270     240790206
  • Library stats
 .                    badLong  badOuttie  badSame  bothChaff  bothDegen  bothSurrogate  diffScaffold  good   oneChaff  oneDegen  oneSurrogate  
 HWI-EAS385_0001:1    0.15     0.06       0.01     19         10.18      0              0.45          46.9   21.41     1.59      0.19          
 HWI-EAS385_0001:2    0.13     0.01       0        84.86      0.02       0              0.27          14.2   0         0.36      0.02          
 HWI-EAS385_0001:3    0.06     0.01       0        14.45      9.03       0.04           0.05          60.11  15.17     0.68      0.16          
 HWI-EAS385_0001:5    0.06     0.01       0        14.84      8.93       0.04           0.05          59.98  15.02     0.65      0.16          
 HWI-EAS385_0001:6    0.06     0.01       0        14.76      9.04       0.04           0.05          59.7   15.29     0.65      0.16          
 HWI-EAS385_0001:7    0.06     0.01       0        14.73      9.12       0.04           0.05          59.51  15.44     0.64      0.16          
 HWI-EAS385_0001:8    0.06     0.01       0        14.05      10.1       0.03           0.05          58.66  15.92     0.7       0.16          
  • Issues
    • Gap sizes vary between -297,562 bp and 772,064 bp. They should be no longer than ~ 8Kbp (max insert size)
 cat 9-terminator/split/SCF.asm | grep -A 3 "^{CTP" | grep ^mea: | sed 's/mea://' | sort -n | head
 -297562
 -216729
 ...
 162239    # scf7180002754932.314143..476382  : 26 s_1,3,5,6,7 mate pairs link the two sides of the gap
 772064    # scf7180002754479.867224..1639288 : 2  s_1 mate pairs link the two sides of the gap
 ~/bin/posmap2scaff.pl 9-terminator/asm.posmap.ctgscf | grep -v "^>" | sort -nk4 | tail
 7180002653372 EB 89    69161
 7180002699293 EB 78    92310
 7180002731234 BE 64936 162239
 7180002726839 BE 46924 772064
 grep ^mea: SCF.asm | sed 's/mea://' | getSummary.pl -i 0 -t scf.gap | pretty
 .          elem   min          q1      q2       q3       max         mean     n50     sum            
 scf.gap    91502  -297562      19      222      667      772064      -141     772064  -12910029
 scf.gap-   13788  -297562      -2166   -89      -12      0           -4284    0       -59071991
 scf.gap+   77699  0            84      325      775      772064      594      1194    46161963

Location

 mulberry:/scratch0/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/       # original (to be deleted)
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/   # backup

Other

SOAPdenovo on CA contigs

 cat asm.ctg.fasta | ~/bin/assignFastaIds.pl | awk '{print $1}' > asm.contig
 lenseq asm.contig > asm.contig.len
 echo Edge_num `wc -l asm.contig.len` `wc -l asm.contig.len` > asm.ContigIndex
 echo "index   length  reverseComplement" >> asm.ContigIndex
 lenseq asm.K31.contig | awk '{print $1,$2,1}' >> asm.ContigIndex
 SOAPdenovo map   -s SOAPdenovo.config -g asm > SOAPdenovo.log        # ok

 SOAPdenovo scaff -g asm -F >> SOAPdenovo.log                         # fails: a nucmer of files are missing

Assembly comparison

CA.contigs vs SOAPdenovo.contigs

  • Filter contigs >=100 bp
  • Alignment
 nucmer --mum -c 100  CA.fasta SOAPdenovo.fasta

 show-coords -H /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | p ' $F[19]="." unless($F[19]); print $F[19],"\n";' | count.pl | pretty -o
 [CONTAINS]     80413  
 [CONTAINED]    4646   
 [IDENTITY]     1960   
 [BEGIN]        4152    
 [END]          4034   
 .              157     # partial alignments
 # get "broken" contigs
 show-coords -H        /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | grep -v '\[' | awk '{print $18,$12}' | sort -u | getSummary.pl -i 1 -t all
 show-coords -L 500 -H /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | grep -v '\[' | awk '{print $18,$12}' | sort -u | getSummary.pl -i 1 -t 500+
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 all                  118        155    4630   9817   29042  144499     22498.97   54311      2654878     
 500+                 16         1844   8205   16551  28426  66880      20514.88   28426      328238  
  • Location
 mulberry:/scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs

CA.contigs vs SOAPdenovo.scaffolds

  • All ctgs & scf
  • Alignment
 #ref/qry all
    .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 SOAPdenovo.scf(span)    12473      100    113    144    498    5933710    20428      1086614    254800874    
 SOAPdenovo.scf(len)     12473      100    113    143    341    5522831    18759.84   1034925    233991535
 CA.ctg                  92307      63     100    119    186    297795     2612       24781      241197400
 #best hits
 nucmer --mum -c 40 CA.fasta SOAPdenovo.fasta
 delta-filter -1 40 -i 95 -o 50 -1 SOAPdenovo-vs-CA.40.delta >! SOAPdenovo-vs-CA.40.filter-1.delta

Genbank submission

 http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi?show=20FF4536-24A4-4D8C-94FA-0755E1326E33
 Project ID: 61101
 Submitting Organization:              University of Illinois & University of Maryland
 Sequencing Center:                    Keck Center for Comparative and Functional Genomics, University of Illinois  ??? BCUI ???  Biotechnology Center, Univ. Illinois (BCUI)
 Consortium Name:
 Organism Name:                        Bombus impatiens
 Strain/isolate/breed:
 Locus Tag Prefix:                     BIMP       #3+ letters
 Source of DNA used for sequencing:    haploid males
 Sequencing Method:                    wgs       ???
 Sequencing Technology:                Illumina
 Estimated Genome Size:                250Mb     # the haploid genome size 
 Brief description of the importnace:  #2 pollinator in the US
 Genome Coverage                       127x
 Sequencing Technology                 Illumina IIx 
 Comments to the staff:                
   DNA 
   whole genome sequencing 
   single genome; 
   no annotation
   assembly name BIMP_1.0 
   assembly method: Celera Assembler 6.1 
   plan to update: ?
   expect to release 
   agp:
  • Sequin:
 Manuscript:         
 Authors:     Hugh Robertson, Matt Hudson, Kim Walden, Aarti Venkat, Tom Newman, Daniela Puiu, Tanja Magoc, David Kelley, Aleksey Zimin, Steven Salzberg, Gene Robinson

  • Best assembly (CA)
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/
 ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bombus_impatiens/Assembly/CA/
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 ctg                  92307      63     100    119    186    297795     2613       24781      241197400      
 ctg.contam           82         1331   10485  19924  31977  112638     25648      32851      2103095   
 ctg.clean            92225      63     100    119    185    297795     2593       24628      239094305     
 ctgNs                76         75     536    3124   6834   44990      5326       10261      404781         
 scf                  1896       76     150    4044   67922  4021294    151761     1017298    287738041
 scf.ctg.contam       24         14315  45008  71878  141921 1116872    141140     164885     3387362        
 deg                  1086650    54     103    124    152    6822       139        135        151484589   
 deg.contam           1129       65     114    141    207    1035       198        211        223588         
 deg.clean            1085521    54     103    124    152    6822       139        135        151261001      
 deg.Ns               37         98     145    234    426    914        313        437        11574             
  • Contaminant list: ???
 /fs/szattic-asmg4/tmagoc/bees_bacteria/IDs

1st submission

  • Submit all contigs (should have deleted the singletons < 200bp)
  • Directories & files
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/ 
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/BIMP_1.0.*
 server:      ftp-trace.ncbi.nlm.nih.gov
 user:        cbcb_trc
 directory:   test/
 files names: BIMP_1.1.sqn , BIMP_1.1.agp
 number contigs:   92,231 
 number scaffolds: 1,895
  • NCBI email (Dec 23rd)
 there are 3 contigs that need to be trimmed:
 Contig name	Length	Span(s)	Apparent source
 7180002728824	11695	72..148	Escherichia coli
 7180002730077	25748	22703..25611	Escherichia coli
 7180002740828	30102	134..1688	Escherichia coli
 Since these are internal hits, consider whether the contig needs to split (if the E.coli was the 'joining' element).
  • NOTE: When I blastn these 3 contigs to nr (distant blast) , all 3 contigs had multiple good hits to other bacteria , so probably they are all contaminated and should be removed completely, not just trimmed

2nd submission

  • Jan 24th
  number of scaffolds: 1191
  number of contigs:   91524
  • Directories & files
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/BIMP_1.1.*

NCBI release