Bumblebee

From Cbcb
Jump to navigation Jump to search

Data

  • ~ 500Mbp genome (diploid); 250Mbp halpoid
  • Complete mitochondrion genomes:
 NC_011923.1    15468  14.67  Bombus hypocrita sapporoensis mitochondrion, complete genome
 NC_010967.1    16434  13.22  Bombus ignitus mitochondrion, complete genome
 only 88% identity; no rearrangements, only snps, short indels
  • NCBI TaxId
 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=132113

Traces

  • 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
  Lane   Insert  ReadLen    #Mates          #Reads        Coverage   Orientation     Comments                         #RepeatsReads
  1      4K      124        34,944,099                    14X        outie           865,687(1.2%) reads qual==0      30,576,297 (44%)
  2      8K      124        32,540,640                    13X        outie                                            26,972,424 (41%)
 
  3      400     124        34,745,750                    .                                                           8,319,335 (12%)
  5      400     124        34,601,239                    .                                                           8,398,759 (12%)
  6      400     124        34,553,857                    . 
  7      400     124        34,682,612                    . 
  8      400     124        12,975,839                    . 
 
  9      3K      124        29,615,036                    .         outie            new3kb ; received 05/18/2010
   
  1-8                       219,044,036     438,088,072   108X
  1-9                       248,659,072     497,318,144

  3-8    400     124        151,559,297     303,118,594   75X        innie
  1,9    3-4K    124        64,559,135      129,118,270   32X        outie
  2      8K      124        32,540,640      65,081,280    13X        outie

 Two new 3kb libraries to come; ~ 70% are true pair ends

Adaptors

 >circ
 CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
 >circ.revcomp
 TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG 
 >5
 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG[ATATCG]
 >3
 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT[AGATCTCGGTGGTCGCCGTATCATTA][A+]
 #42mers
 CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
 TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG 
 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAG
 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGG
 #18mers (5')
 CGTAATAACTTCGTATAG
 TCGTATAACTTCGTATAA
 GATCGGAAGAGCGGTTCA
 CGGCATTCCTGCTGAACC
 #18mers (5',3')
 CGTAATAACTTCGTATAG
 TTATACGAAGTTATACGA 
 TCGTATAACTTCGTATAA
 CTATACGAAGTTATTACG  
 GATCGGAAGAGCGGTTCA
 GCAGGAATGCCGAGACCG
 CGGCATTCCTGCTGAACC
 CGTGTAGGGAAAGAGTGT
 #31 mers (5') # jellyfish uses 31mers
 CGTAATAACTTCGTATAGCATACATTATACG
 TCGTATAACTTCGTATAATGTATGCTATACG
 GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
 CGGCATTCCTGCTGAACCGAGATCGGAAGAG

Frequent kmers

                                                        s_1   s_2    s_3
    1  CGTAATAACTTCGTATAGCATA|CGTATAACTTCGTATAATGTAT    15.2  4.59   2.829    # not assembled   circAdapter
    2  AATGTATGCTATACGAAGTTAT|ATAACTTCGTATAGCATACATT    5.56  19.02  0.525    # not assembled   circAdapter
    3  CGTTATAACGTATTACGTTATA|TGTATGCTATACGAAGTTATTA    5.2   4.56   .         
    4  TACATTATACGAAGTTATACGA|TATATGGTATGACGTTATAACG    3.39  1.87   0.365     
    5  ATATAACGTAATACGTTATAAC|TATAACGTATTACGTTATATGG    2.86  2.71   1.595     
    6  ACGTTATAACGTCATCCCATAT|ATACCATATAACGTAATACGTT    2.11  2.62   .
    7  TACGTTATAACGTCATACCATA|TCCATTTTACACTGCAAATGTA    1.98  2.04   1.749  
    8  CTTAAACGTCGAAATATCTCCA|TTACGTTATATGGGATGACGTT    0.87  1.16   1.136  
    9  GAGCGCTTAAACGTCGAAATAT|ATGAAATCTAGCCGTCATTTCC    0.73  1.02   1.442  
   10  ACGTAATACGTTATAACGTCAT|ATTTCCTTCCCGTTGGAGATAT    0.67  1.1    1.477  
   11  CTGCTGAACCGCTCTTCCGATC|GATCGGAAGAGCGGTTCAGCAG    0.61  3.92   0.045    # not assembled   5Adapter
   12  ACGGTCACAGCTTGTCTGTAAG|TGTAAAATGGAGTTCAACGAGC    0.31  1.73   0.798  
   13  ACGGTGGTGCATTACATTTGCA|GGCGAATGGCGCCTGATGCGGT    0.28  0.87   0.949    # not assembled
   14  CAGAAAAGCATCTTACGGATGG|TACATACCTCGCTCTGCTAATC    0.04  2.47   .        # not assembled


  • Location
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt : s_1-8 libs
 /fs/szattic-asmg4/Bees/Bombus_impatiens/new3kb/s_1_?_sequence.txt     : new3kb lib received 05/18/2010

 /fs/szattic-asmg4/Bees/Bombus_impatiens/3kb_Bimpatiens.txt            : 88 new3kbr test reads

Tasks to figure out

  1. Erroneous reads/bases, which we need to correct or discard
  2. GC bias, so we can compute a-stats properly
  3. Redundancy in the long paired ends, which are lane 1 and lane 2.
  • Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument.
    • Some reads will begin in the circularization adaptor and thus will have only one usable read
    • Some reads have a few bases of DNA sequence and hit the circularization adaptor right away
    • Most reads will have at least 36bp from each end before hitting the adaptor.
    • Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor)
  • A small but significant number of reads from the 3kb and 8kb libraries are not recircularized.
    • Thus their mate distance is +400bp rather than -3kb or -8kb.
    • It's apparently the result of a faulty batch of cre recombinase. This causes problems with contiging and scaffolding.
    • It is possible to remove these reads by removing
  1. mate pairs where neither read is trimmed (thus no adapter ligation may have occurred)
  2. mate pairs where one read begins with the adapter sequence.
    • We are working on this, up to now our paired assembly stages have been disappointing.
  • Matt's idea is to exclude all mate pair reads that don't have evidence of the linker with flanking useful sequence, as a way to avoid uncircularized molecules that will give misleading "mate pairs" only 400 bp apart.
    • There has been no trimming of the adaptor, which is the 42 base 454 adaptor, so its presence can be used to indicate potentially good mate pairs.
    • Even tossing half the mate pairs might not be a problem, as we have perhaps too many anyway.
    • But you will also need to toss redundant mate pairs, and that will indeed reduce the total a lot.
    • Just to be clear - the 500 base mate pairs should have no such problems, except that as Matt has found from his preliminary assembly, the mean fragment length is actually 400 bp rather than 500 bp (and the 3 and 8 kb PE reads are typically shorter than nominally given, e.g. more like 2.5 and 6 kb).
    • You'll also need to throw out all the reads where one of the mates /starts/ with linker, I assume you'd do that anyway. We're also working on ways round this; one might be when we get a better assembly we'll find some better characteristic to filter the unrecircularized reads on.

Trimming

  • Quality trimming: trim all bases with fastq quality eq "B" (0)
  cat *fastq | ~/bin/fastq2clq.pl 
  • Adaptor trimming: Align all subsets to adaptors
 C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
 3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
 5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
 id  len    gc%
 C   42     30.95  
 3   52     55.77  
 5   67     59.70  
 nucmer -l 8 -c 16 -b 32 -g 32 adaptors.seq  ...
 delta-filter -l 16 -q   ...
 cat *.filter-q.delta | ~/bin/delta2clr53.pl -5 5,3 ...
  • Adaptor/primers median positions in the reads
 Lib       adaptorPos     5'primerPos 3'primerPos 
 s_1       34-75          0-36        0-19
 s_2       2-40           0-36        0-19
  • Mate stats
 Lib        #mates    adaptor.bothMates   adaptor.noMate   adaptor.oneMate   adaptor.oneMate(filtered)   
 s_1        34944099  269156              9804247          24870696          6048164(17.3%)               
 s_2        32540640  91528               16181288         16267824          1061858( 3.2%)   
 total      67484739                                                         7110022(10.5%)
  • Filtered read clr stats
 Lib        #mates     #reads       min    q1     q2     q3     max        mean       n50        sum            
 s_1        6048164    12096328     64     80     95     115    124        96.56      101        1,168,064,632     
 s_2        1061858    2123716      64     80     96     115    124        96.59      101          205,122,226 
 total      7110022    14220044     64     80     95     115    124        96.57      101        1,373,186,858 
  • Filtered read GC% stats
 Lib        #reads       min    q1     q2     q3     max        mean       n50                    
 s_1        12096322     0.00   31.93  36.26  42.35  100.00     37.45      38        
 s_2        2123715      0.00   31.88  36.46  42.48  100.00     37.56      38         
  • Other frequent kmers
 26mer : ACGTTATAACGTATTACGTTATATGG -> revcomp -> CCATATAACGTAATACGTTATAACGT : ~10% of the traces
 10mer : AAAAAAAAAA                               TTTTTTTTTT                 : ~32% of the traces    
 53mer: CGATTTCCATGGCGTCGTTTGAGGATTCCAATACGGCGAACCTGTTGTGAGTG                : ~2% of the mito seqs (either begin or end); not present in the 2 complete mito's (probably ok)
  • Data processing pipeline
 s_[12]    : /nfshomes/dpuiu/bin/fastq2adaptorFreeFastq.amos
 s_[35678] : /nfshomes/dpuiu/bin/fastqfrg.amos
  • Location:
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt                   # original fastq files
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/s_[1235678]_[12]_sequence.cor.txt    # corrected fastq files
 /fs/szattic-asmg5/Bees/Bombus_impatiens/s_[12]/all/s_[12].64.rev.frg                    # adaptor free s_[12] mated reads
 /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[1235678].frg                  # adaptor free & corrected s_[1235678] reads

D Kelly's trimming

 lib                       reads        min  q1   q2   q3   max  mean    n50  sum           #repeatReads
 s_1_1_sequence.cor.txt    5,436,814    30   71   90   115  124  89.70   100  487673256     1,071,779 (20%)
 s_1_2_sequence.cor.txt    5,864,225    30   77   92   110  124  93.27   98   546956864     1,208,767 (20%)
 s_2_1_sequence.cor.txt    1,053,858    30   80   96   118  124  97.16   102  102395785     222,290   (21%)
 s_2_2_sequence.cor.txt    1,041,846    30   78   93   110  124  93.51   99   97426679      208,029   (20%)
 s_3_1_sequence.cor.txt    33,668,302   30   105  124  124  124  111.04  124  3738530761              (12%)
 s_3_2_sequence.cor.txt    32,554,322   30   82   108  124  124  100.42  120  3269039697   
 s_5_1_sequence.cor.txt    33,579,744   30   109  124  124  124  111.65  124  3749192427   
 s_5_2_sequence.cor.txt    32,465,412   30   84   112  124  124  102.13  124  3315553171   
 s_6_1_sequence.cor.txt    33,535,725   30   109  124  124  124  111.60  124  3742602285   
 s_6_2_sequence.cor.txt    32,390,877   30   83   109  124  124  100.92  123  3268875471   
 s_7_1_sequence.cor.txt    33,674,235   30   112  124  124  124  112.19  124  3777917379   
 s_7_2_sequence.cor.txt    32,518,568   30   84   110  124  124  101.60  124  3303807321   
 s_8_1_sequence.cor.txt    11,777,821   30   113  124  124  124  112.30  124  1322658923   
 s_8_2_sequence.cor.txt    12,176,364   30   84   110  124  124  101.51  124  1236034348 

 s_9_1_sequence.cor.txt    21,273,603   30   87   124  124  124  104.52  124  2223545139     #new3kb added 05/18/2010
 s_9_2_sequence.cor.txt    20,769,027   30   86   124  124  124  103.68  124  2153436097     #new3kb added 05/18/2010
 s_1,2                     13,396,743   
 s_1..8                    301,738,113  .  .    .    .    .    .       .    31958664367  
 s_2..8 (64bp+)            275,763,594  .  .    .    .    .    .       .    30693543885  


  • Read/Mate counts
 lib                       reads        mates             range
 s_1                       11301039     5320163    
 s_2                       2095704      1034086           11301040..13396743      
 s_3                       66222624     31857304   
 s_5                       66045156     31820553         
 s_6                       65926602     31748713   
 s_7                       66192803     31891468   
 s_8                       23954185     11168258   
 s_9                       42042630     13245740

 s_1,2                     13,396,743   6,354,249
 s_1-8                     301,738,113  144,840,545
 s_1-9                     343,780,743  158,086,285
  • Files
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt        # adaptor free corrected reads (long inserts)
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt          # corrected reads (short inserts)

 /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[12].rev.frg
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/frg/s_[35678].frg

Assembly

  • SOAP version 1.04: /nfshomes/dpuiu/szdevel/SOAPdenovo_Release1.04/
  • CA version 6.0 /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/

SOAPdenovo s_1..8 CBCB-corr (Illinois)

  • From: Kim Walden 04/30/2010
  • Input: s_1,2,3,5,6,7,8 reads
  • Error correction: yes; CBCB method
  • Kim's e-mail
 I ran it in a stepwise fashion, not adding any of the available options and only changing K=31.
 I then ran their more recent GapCloser algorithm, not the version that is part of the package when using option -F in "scaff".  
 I used their "all" assembly once, and I noticed that it automatically uses the -F option during scaffolding
 Also, I used your data from the ftp site without checking for redundancy in the 3kb and 8kb lanes; did Daniela do the same?
  • Summary
                         elements min_len  max_len   mean    n50        sum
 contigs(>100)           103858   100      85850     2227    7541       231387390      # prior to scaffolding; 
 scaffolds&contigs(>100) 12663    100      4648588   19282   1040109    244171008      # post-scaffolding and gap-filling
 scaffolds(span)         3736     252      4648588   64874   1047164    242369898      # same prior data minus contigs that didn't assemble into scaffolds.
 scaffolds(len; no N's)  3736     172      4585806   64034   1027043    239233801      # computed by CBCB
     
  • Locations
 ginkgo:/scratch1/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC.redo/Bimp_Salz_corr_050310_NewGapCloser.fasta

 # backup
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC.redo/
 # online
 http://www.life.illinois.edu/robertson/private/
 Login: robertson
 password: dna

SOAPdenovo s_1..8 CBCB-corr (CBCB)

  • Stats:
                         elem     min      max       mean    n50        sum 
 contigs(all)            9943851  31       85850     60      45         594773555

 contigs(>100)           106213   100      85850     2177    7084       231246190
 scaffolds+contigs(>100) 12473    100      5933710   20428   1086614    254800874    # based on scafSeq file  
 scaffolds(span)         3664     236      5933710   69090   1115496    253148288    # based on scafSeq file
 scaffolds(len)          3664     134      5522831   63411   1059378    232338949 

 readsAssembled          208389968                                                   # based on readOnContig file
 readsFiltered           289681090                                                   # based on readOnContig & log file
 readsTotal              301738113                                                   # based on config file

 readsDeleted            441343                                                      # based on log file

  • Run gapClosing on assembly:
 stillOpen 14784
 closed    42379
  • Contig cvg: GC bias:
Bombus.contig10000.png

 cat asm.K31.contig100.infoseq | ~/bin/getCorrelationDescriptive.pl -i 2 -j 3  => -0.709501493345893
 cat asm.K31.contig1K.infoseq | ~/bin/getCorrelationDescriptive.pl -i 2 -j 3   => -0.923151349747983
 #100+ ctgs
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 len                  106213     100    140    437    2420   85850      2177       7084       231246190      
 gc                   106213     0.00   33.13  38.00  43.44  100.00     38.74      39         4115066
 cvg                  106213     0.6    39.0   50.8   58.9   160.8      47.38      54         5032329
 #1K+ ctgs
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 len                  41965      1000   1860   3295   6338   85850      5110       7726       214421699      
 gc                   41965      19.18  34.31  37.96  41.79  59.25      38.19      38         1602711
 cvg                  41965      8.0    42.0   49.9   55.8   93.8       47.74      51         2003339
  • Run soap2 to map reads back to the assembly: "-l 32 -v 2"
 readsAligned            173063062
  • Locations
 mulberry:/scratch0/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor/       # original
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor/   # backup

SOAPdenovo s_1..9 CBCB-corr (CBCB) **

  • Uses the new 3Kbp library : s_9 (reads have been reversed)
  • Stats:
  .                    elem       min    q1     q2     q3     max        mean       n50        sum
  scf                  11178      100    111    135    390    5655980    23014      1205321    257251549
  ctg                  10856652   31     .      .      .      85850      57         43         627095607
  ctg100+              106741     100    .      .      .      85850**    2165       6939       231167576

  scfSeqContigs        67900      100    340    1788   4373   85819      3437       7586       233372689
  scfSeqContigsClosed  20651      100    128    488    7991   600179     11715      64083      241921308
  • Location
 mulberry:/scratch2/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/      # original
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/   # backup

CA s_1..8 corr OBT (CBCB)

Issues

  • Used outie s_1,2 lib ; needs to be redone

runCA spec

 doOverlapTrimming =        1
 ovlOverlapper =            ovl
 unitigger =                bog         # cmd buildUnitigs

 *BatchSize =               2000000
 *BlockSize =               2000000

 ovlMemory   =              8GB
 doExtendClearRanges =      0          # fast run

Gatekeeper

  • ~ 58X cvg
 gatekeeper -dumpfragments asm.gkpStore | grep ^fragmentClear ...
 LOAD                  STATS         
 
 7                     libInput      
 7                     libLoaded     
 0                     libErrors     
 
 301738113             frgInput      
 301738113             frgLoaded     
 25974519              frgErrors     
 
 144840545             lkgInput      
 124575590             lkgLoaded     
 20264955              lkgErrors     
 0                     lkgWarnings   
 
 275763594             numRandom     
 301738113             numNormal     
 
 
 GLOBAL                STATS         
 
 301738113             FRG           
 7                     LIB           
 
 LibraryName           numActiveFRG  numDeletedFRG  numMatedFRG  readLength   clearLength    #egrep 'ATATAACGTAATACGTTATAAC|GTTATAACGTATTACGTTATAT'
 GLOBAL                274282535     27455578       247151634    30524746650  29445986650  
 LegacyUnmatedReads    0             0              0            0            0            
 
 s_1                   10409320      891719         9039810      996306280    994802849      #1134233(11%)
 s_2                   2054575       41129          1987728      197829588    197593453      #224648(11%)
 s_3                   59894531      6328093        53925476     2382786529   2127257586     #3037085(5%)  
 s_5                   60127274      5917882        54512448     2459696212   2209078913     #5%
 s_6                   59813485      6113117        54062058     2396359740   2152656620     #5%
 s_7                   60278536      5914267        54647258     2476326572   2237423989     #5%
 s_8                   21704814      2249371        18976856     2435572545   2347304056     #5%
 gatekeeper -dumplibraries -tabular asm.gkpStore
 UID    IID  Orientation  Mean  StdDev  NumFeatures  
 s_1    1    I            4000  400     1582823088   
 s_2    2    I            8000  800     1582823088   
 s_3    3    I            400   40      1582823088   
 s_5    4    I            400   40      1582823088   
 s_6    5    I            400   40      1582823088   
 s_7    6    I            400   40      1582823088   
 s_8    7    I            400   40      1582823088   
 s_1,2 lib : ids:1..13,396,744
 overlap jobs that need to be redone : 1036 out of 11476 < 10%
 ~/bin/getOvlJobsCount.pl 2000000 2000000 301738113 | p '/(\d+)-(\d+)\s+(\d+)-(\d+)/; print $_ if($1<14000000 or $3<14000000);'

Meryl

 meryl -Dh -s 0-mercounts/asm-C-ms22-cm0 
 Found 23686053415 mers.
 Found 265670791 distinct mers.
 Found 8682767 unique mers.
 Largest mercount is 21,987,616; 668 mers are too big for histogram.
 ...
 #freq   count   %
 1       8682767 0.0327  0.0004
 2       4288602 0.0488  0.0007
 3       3041005 0.0603  0.0011
 ...
 72*     3803015 0.7150  0.3229
 ...
 1048158 1       1.0000  0.9274
 fasta2tab.pl  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/0-mercounts/asm.nmers.ovl.fasta   | sort -n -r | head -5
 21987616        ATATAACGTAATACGTTATAAC
 20992872        CGTTATAACGTATTACGTTATA
 20510821        ACGTTATAACGTATTACGTTAT
 20484950        CCATATAACGTAATACGTTATA
 20452950        CATATAACGTAATACGTTATAA
 cat egrep.sh 
 egrep -c 'ATATAACGTAATACGTTATAAC|GTTATAACGTATTACGTTATAT' /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/seq_obt/*clean.seq > egrep.count
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2
 ...

OBT

  • job count
 n=((301738113/2000000)+0.5)=151
 OBT/OVL jobs = n*n/2 =151*76=11476
 ~ 20 min/OBT job =~ 0.33 hr/OBT job  => 0.33*11476/32=5 days
  • OBT trimming
 key             sum             count           mean
 CLR             31958664367     301738113       105.915239043733
 OBTINITIAL      30779687358     301610679       102.051052900551
 OBTMERGE        30052401719     301610679       99.6397137483318
 OBTCHIMERA      30052138325     301610679       99.638840456972
 LATEST          30052138325     301610679       99.638840456972
  • use OBTCHIMERA as CLR if want to redo the assembly !!!

Overlap

  • job count : same as OBT
  • Overlap counts
 cat asm.repeatmodel.lib.00*stats | egrep '^Lib|nSamples|median' | p 'chomp; if(/Lib/) {print "\n$_"} else { print "  $F[1]"}' | pretty -o 
 Lib    0  5'  199542773  50  
 Lib    0  3'  199542773  50  
 Lib    1  5'  6682072    38  
 Lib    1  3'  6682072    37  
 Lib    2  5'  1306720    37  ???
 Lib    2  3'  1306720    42  ???
 Lib    3  5'  43868663   50  
 Lib    3  3'  43868663   50  
 Lib    4  5'  43939366   51  
 Lib    4  3'  43939366   50  
 Lib    5  5'  43620676   50  
 Lib    5  3'  43620676   50  
 Lib    6  5'  43903838   51  
 Lib    6  3'  43903838   50  
 Lib    7  5'  16221438   50  
 Lib    7  3'  16221438   50
  • 199542773 out of 257786946 (77%) usable reads had overlaps
  • 7988792 out of 11443537 (69%) usable s_1,2 reads had overlaps

Frg/Ovl correction

  • Frgcorr jobs: 151
 wc -l 3-overlapcorrection/cat-corrects.frgcorrlist
  • Ovlcorr jobs: 151
 wc -l 3-overlapcorrection/cat-corrects.frgcorrlist

OverlapStore

  • Failures
    • overlapStore: too many files (max 10240) ; fix: recompile the code
    • chimera: FAILED file seems to be truncated but contains records from all libs; fix: link asm.chimera.report.FAILED asm.chimera.report

Bog

  • Run time: 14 hr
  • default unitigger : does not seem to work on such big datasets
  • Utg len stats:
 cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
 elem       min    q1     q2     q3     max        mean       n50        sum            
 63916892   47     85     150    150    22861      125.23     150        8004296170     

Bog consensus

  • job count :
 wc -l 4-unitigger/asm.partitioningInfo
 110
  • Failures
 grep FAILED 5-consensus/asm*err
 5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED.  Could not align fragment 264046761.
 
 # Fix: 
 # update layout (CA 6.1)
 mkdir utg
 
 # update layout (CA 6.1)
 grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "tigStore -g asm.gkpStore -t asm.tigStore 2 -u  $F[0] -d layout > utg/$F[0].layout\n";'  | ~/bin/scheduler.pl
 grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "~/bin/updateUTGlayout.pl utg/$F[0].layout >  utg/$F[0].layout.update\n ";'              | ~/bin/scheduler.pl 
 grep FAILED 5-consensus/asm*err | awk '{print $3}' | sort -u -n | p 'print "tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg/$F[0].layout.update\n" ;'             | ~/bin/scheduler.pl  -c 1

 grep FAILED 5-consensus/asm*err | sed 's/\.err:/ /' | awk '{print $1}' | sort -u | p 'print "touch $F[0].success\n";'                                               | ~/bin/scheduler.pl 
 # create success file
 touch 5-consensus/consensus.success
 
 # relaunch runCA

Cgw

  • Failures
 grep -B 5 "^Failure message:" runCA.log 
 

Cgw consensus

  • job count : 84
 wc -l 7-CGW/asm.partitionInfo
  • ctg count: 1698486
 cat 7-CGW/asm.partitionInfo | /nfshomes/dpuiu/bin/sum.pl -i 3 -t ctgs | pretty -o

Terminator

  • Failure:
    • buildPosMap : segFault :ids too long???
 Fix: split asm file(optional) and reassign new accession numbers
 cd split/
 ~/bin/splitAsm.pl < asm.asm => ... CCO.asm, SCF.asm
 cat CCO.asm CLK.asm SCF.asm SLK.asm | ~/bin/newAsmIds.pl | buildPosMap -o asm

Stats

 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  15009      74     4786   9789   20021  415318     16916.33   29133      253897199      
 ctg                  145036     64     93     116    1035   138630     1498.67    7726       217361459      
 deg                  1507280    54     98     118    146    1434       148.28     137        223502287      
 utg                  2786974    54     90     110    142    25644      197.04     277        549140918      
 singletons           61129865
 usableSeqs           257786946
  • Stats (100bp+)
                       elem     min      max       mean    n50        sum 
contigs(>100)          99429    100      138630    2148    7972       213595680
scaffolds(span)        14992    100      415318    16935   29133      253895676
scaffolds(len)         14991    100      303946    14499   24091      217357577 
  • Library stats (totals)
 lib    ctg       sur      deg       singl     total
 s_1    2251597   162800   3119665   3988095   9522157
 s_2    424821    41107    563656    891795    1921379
 s_3    28880189  1078891  13600909  12735637  56295626
 s_5    28983834  1077334  13561390  12962731  56585289
 s_6    28663974  1062939  13559494  12896855  56183262
 s_7    28784505  1058364  13713200  13020274  56576343
 s_8    10455533  383065   5229814   4634478   20702874
 all    128444453 4864500  63348128  61129865  257786930
  • Library stats (percentages)
 lib    ctg  sur  deg  singl  total
 s_1    24   2    33   42     100
 s_2    22   2    29   46     100
 s_3    51   2    24   23     100
 s_5    51   2    24   23     100
 s_6    51   2    24   23     100
 s_7    51   2    24   23     100
 s_8    51   2    25   22     100

Mate redundancy

  • Criteria:
 Given 2 mates  
   1: a1,a2 
   2: b1,b1
 We have the following 2 overlaps
   1: a1,b1  a2,b2    
   2: a1,b2  a2,b1  
 At least 75% of one read overlaps the other read
  • Location:
 cd /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor/9-terminator/
 wc -l s_1.mates asm.posmap.frgutg.s_1.*
 5320163 s_1.mates                                     # input mates (from FRG file; 2 reads/mate)
 1232734 asm.posmap.frgutg.s_1.mates                   # assembled mates (both reads assembled) 
  366262 asm.posmap.frgutg.s_1.redundant_mates         # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 29.7% redundancy
  312206 asm.posmap.frgutg.s_1.redundant_mate_pairs    # redundant assembled mate pairs
 
 wc -l s_2.mates asm.posmap.frgutg.s_2.* 
 1034086 s_2.mates                                     # input mates (from FRG file; 2 reads/mate)
  255905 asm.posmap.frgutg.s_2.mates                   # assembled mates (both reads assembled) 
  110570 asm.posmap.frgutg.s_2.redundant_mates         # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 43.2% redundancy
 1871747 asm.posmap.frgutg.s_2.redundant_mate_pairs    # redundant assembled mate pairs
   11937 asm.posmap.frgutg.s_2.redundant_mate_clusters # redundant assembled mate clusters

Runtimes

 #     step            hours
 0     gatekeeper      5.36   
 1     initialTrim     2.07   
 2     meryl           9.90   
 9     overlap.sh      11.31  
 11    find            0.00   
 16    overlapStore    13.79  
 17    consolidate     5.12   
 18    merge-trimming  1.41   
 19    chimera(1)      10.04  
 20    overlap.sh      49.86  
 21    find            0.00   
 22    overlapStore    16.68  
 23    frgcorr.sh      5.57   
 24    cat-corrects    0.10   
 25    ovlcorr.sh      4.64   
 26    cat-erates      0.90   
 27    overlapStore    6.67   
 28    buildUnitigs    13.88  
 29    gatekeeper      1.31   
 30    consensus.sh    3.41   
 31    cgw(1)          65.00  
 43    gatekeeper      4.39   
 44    consensus.sh    4.70   
 46    terminator      6.26   
 47    asmOutputFasta  7.75   
 48    dumpSingletons  3.68   
 49    buildPosMap(1)  3.79

Location

 ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/                             # original assembly
 ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/asm.gkpStore.OBTCHIMERACLR   # dumped OBT clear ranges, 0/1 undel/del
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor                        # backup

CA s_1..8 corr OBT redo (CBCB)

Issues

  • Fixed outie s_1,2 lib : clear ranges will be different
  • s_3..8 : stay the same

runCA spec

 doOverlapTrimming =        0         !!! already done in the prev assembly
 saveOverlaps  =            1         !!! it was actually set to 0 for this run; otherwise the ovb.gz files get deleted
 ovlOverlapper =            ovl       # same
 unitigger =                bog       # same   
 
 BatchSize =                2000000    # must be the same so overlaps are not redone
 BlockSize =                2000000    # must be the same
 ...
 doExtendClearRanges =      1          # initially set to 2; seems to take too long => reset to 1 

Input

  • Modified frg files:
    • dumped "CA s_1..8 corr OBT" assembly gatekeeper
    • undelete act:D fragments so the ids stay the save
    • reverse s_1,2 reads

Gatekeeper

  • Make sure the iids match

Overlap

  • Rerun only the overlap jobs for the s_1,2 reads

OverlapStore

  • To build faster in the future
e 'foreach ("0000000001".."0000000058") { print "overlapStore -c $_.ovlStore -g asm.gkpStore/ -i 0 -M 8192 1-overlapper/$_/*gz\n" }' | scheduler.pl  # 13min/job  
e 'foreach ("0000000002".."0000000058") { print "overlapStore -m 0000000001.ovlStore $_.ovlStore\n" }'                                               # 3 min/job 
total=2*13+57*3=3.5 hrs (instead of 13.85 reported in runCA.log)

Frgcorr

  • Initial failure: on 144 out of 151 jobs : the gatekeeper id's originally did not match (5 reads missing from s_5 lib shifted all the other reads)

Ovlcorr

  • The error rate on overlaps with error went down by ~ 1.28%

Bog

 cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
 elem       min    q1     q2     q3     max        mean       n50        sum            
 60970504   47     85     150    150    93191      126.17     150        7692801690
  • Unitig stats
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep -c ^unitig_num_frags                                    # 60,970,523  : total unitigs
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep -c ' 1$'                       # 59,776,572  : unitigs with only 1 read
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep -v ' 1$' | getSummary.pl -i 2  # stats for unitig with 2+ reads

               elem         min    q1     q2     q3     max        mean       n50        sum            
 1read/utg     59,776,572                                                                59,776,572
 2+reads/utg   1,193,951    2      3      6      28     74715      165.84     4712       198,010,374

Bog consensus

  • job count :
 wc -l 4-unitigger/asm.partitioningInfo
 110
  • Failures: 1 failed unitig
 grep FAILED 5-consensus/asm*err
 5-consensus/asm_022.err:MultiAlignUnitig()-- Unitig 115198 FAILED.  

 Fix: same as before

Cgw

  • Failure:
    • after 7-0-CGW/asm.ckp.4
 tail runCA.log
 ERROR:  Negative variance for chunk 59905266* in scaffold 6409 (AEnd=-163479.174700 BEnd=-163491.446700)

 37980:Scaffold 6409:
 37981-  Idx    Chunk     (   Start, Variance)  (     End, Variance)      Len
 37982-    0: 60983787    (       0,        0)  (   31646,      823)    31646
 37983-    1:   69691     (   35389,      934)  (   31650,      837)     3739
 37984-    2: 59907028    (   35103,      942)  (   42035,     1122)     6932
 37985-    3: 60799298    (   41640,     1132)  (   64897,     1737)    23257
 37986-    4: 60975976    (   64924,     1784)  (   88439,     2395)    23515
 37987-    5: 60988396    (   88431,     2416)  (  122023,     3289)    33592
 37988-    6: 60988698    (  121551,     3324)  (  123589,     3377)     2038
 37989-    7: 59905266*   (  122331,  -163479)  (  121859,  -163491)      472
 37990-    8: 60988599    (  122275,   170138)  (  134564,   170457)    12289
 ./AS_REZ/GapFillREZ.C:5666: int Estimate_Chunk_Ends(VarArrayStack_Entry_t*, int, int, LengthT*, LengthT*, ChunkInstanceT*, float*, Scaffold_Fill_t*, int*, int*, int*): Assertion `min_right_variance >= 0.0' failed.
 ./AS_REZ/GapFillREZ.c:8339:       "ERROR:  Negative variance for chunk %d in scaffold %d (AEnd=%f BEnd=%f)\n",  #message
 Fix: modfy the code; comment assertion statemants
 cat /nfshomes/dpuiu/szdevel/SourceForge/wgs/src/AS_REZ/GapFillREZ.c
 // assert (0.0 <= ref_variance && ref_variance <= FLT_MAX);
 // assert (min_right_variance >= 0.0);
  • Preliminary stats
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $sum,"\n"; $sum=0; } else { $sum+=$F[-1]}' | getSummary.pl -t scf
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -t ctg
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -min 200 -t ctg200+
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $count-1,"\n"; $count=0; } else { $count++}' | getSummary.pl -i 0 -min 1 -t scf.gaps
 #7-0-CGW/rez.i36.log :
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  1820       84     162    4583   82337  3161523    125884.07  667487     229109005       # ids from 0..13113
 ctg                  26453      64     962    3756   10485  255820     8759.59    21800      231717378      
 ctg200+              23343      200    1760   4772   11873  255820     9908.34    21845      231290344 
 scf.gaps             877        1      4      12     36     373        27.71      64         24298
 #7-2-CGW/rez.i01.log :
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  4903       66     1283   6908   39466  1148626    47212.57   209756     231483215      
 ctg                  23480      64     1370   4397   11757  255820     9858.75    23476      231483360
 7-2-CGW/rez.i06.log 
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  3818       64     268    6813   61948  1198480    60723.65   257457     231842914      
 ctg                  24005      64     1165   4161   11439  255820     9659.47    23819      231875677 
 #7-2-CGW/rez.i11.log 
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  3733       64     268    6880   65584  1391034    62111.99   258823     231864045      
 ctg                  24004      64     1165   4161   11439  255820     9659.98    23819      231878256     
 #7-2-CGW/stone.i02.log
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  1854       83     153    4039   52506  2983045    125375.58  860583     232446325      
 ctg                  23863      64     674    3833   11368  255820     9761.94    25892      232949194       
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  1847       84     152    4440   62397  3337984    125470.25  835387     231743549      
 ctg                  26088      64     429    3213   10218  297786     8941.40    25521      233263147      
  • TigStore
 tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D unitigs | tail
 maID            isPresent       isDeleted       ptID    svID    fileOffset      covStat
 60970540        1               0               0       3       58023924696     84
 tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D contigs | tail -1 
 maID            isPresent       isDeleted       ptID    svID    fileOffset
 61001100        1               0               0       32      16307420
  • Checkpoints
 # 7-0-CGW              
                                                                   #times   iterMax
 CHECKPOINT_AFTER_BUILDING_SCAFFOLDS     "ckp01-ABS"               
 CHECKPOINT_DURING_CLEANING_SCAFFOLDS    "ckp02-DCS"               5        10
 CHECKPOINT_AFTER_CLEANING_SCAFFOLDS     "ckp03-ACD"               1
 CHECKPOINT_DURING_1ST_SCAFF_MERGE       "ckp04-1SM-partial"       12              # we should see more iterations if the scaffolding goes well; also ECR would take longer
 CHECKPOINT_AFTER_1ST_SCAFF_MERGE        "ckp05-1SM"               1
 CHECKPOINT_AFTER_STONES                 "ckp06-AS"           
 CHECKPOINT_DURING_2ND_SCAFF_MERGE       "ckp07-2SM-partial"  
 CHECKPOINT_AFTER_2ND_SCAFF_MERGE        "ckp08-2SM"               1
 CHECKPOINT_AFTER_FINAL_ROCKS            "ckp09-FR"                1

 # 7-2-CGW              
                                                                   #times   iterMax
 CHECKPOINT_AFTER_BUILDING_SCAFFOLDS     "ckp01-ABS"               
 CHECKPOINT_DURING_CLEANING_SCAFFOLDS    "ckp02-DCS"               9        10
 CHECKPOINT_AFTER_CLEANING_SCAFFOLDS     "ckp03-ACD"               1 
  ...
 CHECKPOINT_DURING_1ST_SCAFF_MERGE       "ckp04-1SM-partial"       10
 CHECKPOINT_AFTER_1ST_SCAFF_MERGE        "ckp05-1SM"               1
 CHECKPOINT_AFTER_STONES                 "ckp06-AS"                1
 CHECKPOINT_DURING_2ND_SCAFF_MERGE       "ckp07-2SM-partial"       1
 CHECKPOINT_AFTER_2ND_SCAFF_MERGE        "ckp08-2SM"               1
 CHECKPOINT_AFTER_FINAL_ROCKS            "ckp09-FR"                1
  ...
 CHECKPOINT_AFTER_PARTIAL_STONES         "ckp10-PS"                1
 CHECKPOINT_AFTER_FINAL_CONTAINED_STONES "ckp11-FCS"  
 CHECKPOINT_AFTER_FINAL_CLEANUP          "ckp12-FC"   
 CHECKPOINT_AFTER_RESOLVE_SURROGATES     "ckp13-RS"   
 CHECKPOINT_AFTER_OUTPUT                 "ckp14-FIN"
  • Long running time : implemented 2 cgw options -ic, -im ....
  • Running times
 CGW-0: START Wed Apr 28 22:30:12 ; END Mon May  3 03:34:57
 ECR-1: START Mon May  3 03:34:58 ; END Mon May  3 12:28:54        
 CGW-2: START Mon May  3 12:28:54 ; End Wed May 12 22:05:00
  • Library insert histograms
 # Example
 cat 7-0-CGW/stat/scaffold_final.distlib_1.cgm | p 'print int($F[0]/1000),"\n";' | count.pl | sort -n > 7-0-CGW/stat/scaffold_final.distlib_1.cgm.hist
 cat 7-0-CGW/stat/scaffold_final.distlib_2.cgm | p 'print int($F[0]/1000),"\n";' | count.pl | sort -n > 7-0-CGW/stat/scaffold_final.distlib_2.cgm.hist
 gnuplot
   set xrange [0:10]
   plot "scaffold_final.distlib_1.cgm.hist" w l, "scaffold_final.distlib_2.cgm.hist" w l, 


  • Failure:
    • after 7-2-CGW/asm.ckp.42 (after running about 2hrs)
 tail runCA.log
 * Screwed up scaffold 3790: Chunk 60994012 has bad mean
 * Scaffold 3790 iteration 1  1 bad CIs out of 33 LSE:2664.55 improvement:1
 * Reinserting chunk 60994012 in scaffold 3790
 * improvement = 0.0721104 LSE = 2664.55
 ----------------------------------------
 Failure message:
 scaffolder failed

ECR

  • Count scaffold gaps : total/closed
 cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^examining gap'   # 24635
 cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^closed gap'      # 2803

Terminator

  • Get id range
 grep -m 1 ^7 asm.posmap.utglen asm.posmap.ctglen asm.posmap.deglen asm.posmap.scflen 
 asm.posmap.utglen:7180000000000 124    
 asm.posmap.deglen:7180001635156 133     
 asm.posmap.ctglen:7180001635158 124
 asm.posmap.scflen:7180002817601 161    
 =>
 utg: 7180000000000..7180001635155
 ctg/deg:7180001635156..7180002817600
 scf:7180002817601..7180002819521

Stats

  • Assembly stats
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  1921       65     148    3389   55043  3122263    149121.08  982222     286461602      
 ctg                  91934      64     100    119    180    255828     2621.59    25123      241013378      
 deg                  1090511    54     103    124    152    7068       139.41     135        152024633      
 utg                  1635156    54     97     118    141    94720      267.81     1636       437917658
  • Assembly stats(100bp+)
                         elem     min      max       mean    n50        sum 
 contigs(>100)           69270    100      255828    3451    25383      239118680 
 scaffolds(span)         1870     100      3122263   153185  982222     286457118 
 scaffolds(len)          1870     100      2990211   128880  870182     241007381

Location

 ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo/           # original assembly
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo      # backup

CA s_1, 2 partial ,3..8 corr OBT (CBCB) **

Issues

  • Use only s_2 lib reads which seem 8K apart : ~ 14% of the s_2 reads aligned by soap2 to SOAPdenovo Illinois assembly within 8+-2Kbp of one another
  • Alignments (very similar results)
 soap2-index  Bimp.fasta                 #  soap2-index -> 2bwt-builder
 bowtie-build Bimp.fasta  Bimp.fasta 
 soap2 -D Bimp.fasta.index -a s_2.1.fastq -b s_2.2.fastq -l 32 -p 8 -v 2 -m 6000 -x 10000 -o s_2.mated.soap2 -2 s_2.single.soap2
 bowtie   Bimp.fasta       -1 s_2.1.fastq -2 s_2.2.fastq -l 32 -p 8 -n 2 -I 6000 -X 10000 >  s_2.mated.bowtie
 bowtie   Bimp.fasta         s_2_0.fastq -l 32 -p 8 -n 2                                  >  s_2_0.bowtie

 wc -l s_2.mated.*
 288880 s_2.mated.soap2
 288668 s_2.mated.bowtie

 intersect.pl s_2.mated.*
 288368

runCA spec

 doExtendClearRanges=0 

Input

  • Gatekeeper: same as prev assembly
  • Overlaps: filter out all overlaps for most of the s_2 reads

Bog

 cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
 elem       min    q1     q2     q3     max        mean       n50        sum            
 61930057   47     85     150    150    113529     125.85     150        7793894671      # max went up from 93191->113529

Bog consensus

  • Failures
 grep FAILED 5-consensus/asm*err : 113 unitigs between 5.. 33547 frags
 5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED.  Could not align fragment 264046761.
  • Fix:
 grep    FAILED 5-consensus/asm*err | p 'print "$F[2]\n";' | perl ./fixLayout.pl | scheduler.pl 
 grep -c FAILED 5-consensus/*err | grep -v ':0$' | p '/(.+).err/; print "touch $1.success\n";' | sort -u | scheduler.pl

Cgw

  • Preliminary stats
 #7-0-CGW/stone.i02.log
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  1847       84     152    4440   62397  3337984    125470.25  835387     231743549      
 ctg                  26088      64     429    3213   10218  297786     8941.40    25521      233263147
 cat 7-0-CGW/cgw.out
 ...
 Processed 61,930,044 unitigs with 301,738,113 fragments

Terminator

  • Message counts
 CTP     91502
 SCF     1896
 CLK     3288923

cat CLK.asm | grep ^ori: | count.pl ori:O 979139 ori:A 869982 ori:N 857732 ori:I 582070

Stats

  • Assembly stats
  .                    elem       min    q1     q2     q3     max        mean       n50        sum
  scf                  1896       76     150    4044   67922  4021294**  151760     1017298**  287738041
  ctg                  92307      63     100    119    186    297795**   2612.99    24781**    241197400
  deg                  1086650    54     103    124    152    6822       139.41     135        151484589
  utg                  1574225    54     98     119    143    123354     274.33     2027       431859732
  • Assembly stats(100bp+)
 .                       elem     min      max       mean    n50        sum            
 contigs(>100)           69944    100      297795    3421    25048      239324761      
 scaffolds(span)         1869     100      4021294   153951  1017298    287735536      
 scaffolds(len)          1869     100      3429736   128833  860270     240790206
  • Library stats
 .                    badLong  badOuttie  badSame  bothChaff  bothDegen  bothSurrogate  diffScaffold  good   oneChaff  oneDegen  oneSurrogate  
 HWI-EAS385_0001:1    0.15     0.06       0.01     19         10.18      0              0.45          46.9   21.41     1.59      0.19          
 HWI-EAS385_0001:2    0.13     0.01       0        84.86      0.02       0              0.27          14.2   0         0.36      0.02          
 HWI-EAS385_0001:3    0.06     0.01       0        14.45      9.03       0.04           0.05          60.11  15.17     0.68      0.16          
 HWI-EAS385_0001:5    0.06     0.01       0        14.84      8.93       0.04           0.05          59.98  15.02     0.65      0.16          
 HWI-EAS385_0001:6    0.06     0.01       0        14.76      9.04       0.04           0.05          59.7   15.29     0.65      0.16          
 HWI-EAS385_0001:7    0.06     0.01       0        14.73      9.12       0.04           0.05          59.51  15.44     0.64      0.16          
 HWI-EAS385_0001:8    0.06     0.01       0        14.05      10.1       0.03           0.05          58.66  15.92     0.7       0.16          
  • Issues
    • Gap sizes vary between -297,562 bp and 772,064 bp. They should be no longer than ~ 8Kbp (max insert size)
 cat 9-terminator/split/SCF.asm | grep -A 3 "^{CTP" | grep ^mea: | sed 's/mea://' | sort -n | head
 -297562
 -216729
 ...
 162239    # scf7180002754932.314143..476382  : 26 s_1,3,5,6,7 mate pairs link the two sides of the gap
 772064    # scf7180002754479.867224..1639288 : 2  s_1 mate pairs link the two sides of the gap
 ~/bin/posmap2scaff.pl 9-terminator/asm.posmap.ctgscf | grep -v "^>" | sort -nk4 | tail
 7180002653372 EB 89    69161
 7180002699293 EB 78    92310
 7180002731234 BE 64936 162239
 7180002726839 BE 46924 772064
 grep ^mea: SCF.asm | sed 's/mea://' | getSummary.pl -i 0 -t scf.gap | pretty
 .          elem   min          q1      q2       q3       max         mean     n50     sum            
 scf.gap    91502  -297562      19      222      667      772064      -141     772064  -12910029
 scf.gap-   13788  -297562      -2166   -89      -12      0           -4284    0       -59071991
 scf.gap+   77699  0            84      325      775      772064      594      1194    46161963

Location

 mulberry:/scratch0/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/       # original (to be deleted)
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/   # backup

Other

SOAPdenovo on CA contigs

 cat asm.ctg.fasta | ~/bin/assignFastaIds.pl | awk '{print $1}' > asm.contig
 lenseq asm.contig > asm.contig.len
 echo Edge_num `wc -l asm.contig.len` `wc -l asm.contig.len` > asm.ContigIndex
 echo "index   length  reverseComplement" >> asm.ContigIndex
 lenseq asm.K31.contig | awk '{print $1,$2,1}' >> asm.ContigIndex
 SOAPdenovo map   -s SOAPdenovo.config -g asm > SOAPdenovo.log        # ok

 SOAPdenovo scaff -g asm -F >> SOAPdenovo.log                         # fails: a nucmer of files are missing

Assembly comparison

CA.contigs vs SOAPdenovo.contigs

  • Filter contigs >=100 bp
  • Alignment
 nucmer --mum -c 100  CA.fasta SOAPdenovo.fasta

 show-coords -H /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | p ' $F[19]="." unless($F[19]); print $F[19],"\n";' | count.pl | pretty -o
 [CONTAINS]     80413  
 [CONTAINED]    4646   
 [IDENTITY]     1960   
 [BEGIN]        4152    
 [END]          4034   
 .              157     # partial alignments
 # get "broken" contigs
 show-coords -H        /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | grep -v '\[' | awk '{print $18,$12}' | sort -u | getSummary.pl -i 1 -t all
 show-coords -L 500 -H /scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs/CA-vs-SOAPdenovo.filter-q.delta | grep -v '\[' | awk '{print $18,$12}' | sort -u | getSummary.pl -i 1 -t 500+
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 all                  118        155    4630   9817   29042  144499     22498.97   54311      2654878     
 500+                 16         1844   8205   16551  28426  66880      20514.88   28426      328238  
  • Location
 mulberry:/scratch0/Bombus_impatiens/Assembly/CA-vs-SOAPdenovo.contigs

CA.contigs vs SOAPdenovo.scaffolds

  • All ctgs & scf
  • Alignment
 #ref/qry all
    .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 SOAPdenovo.scf(span)    12473      100    113    144    498    5933710    20428      1086614    254800874    
 SOAPdenovo.scf(len)     12473      100    113    143    341    5522831    18759.84   1034925    233991535
 CA.ctg                  92307      63     100    119    186    297795     2612       24781      241197400
 #best hits
 nucmer --mum -c 40 CA.fasta SOAPdenovo.fasta
 delta-filter -1 40 -i 95 -o 50 -1 SOAPdenovo-vs-CA.40.delta >! SOAPdenovo-vs-CA.40.filter-1.delta

Genbank submission

 http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi?show=20FF4536-24A4-4D8C-94FA-0755E1326E33
 Project ID: 61101
 Submitting Organization:              University of Illinois & University of Maryland
 Sequencing Center:                    Keck Center for Comparative and Functional Genomics, University of Illinois  ??? BCUI ???  Biotechnology Center, Univ. Illinois (BCUI)
 Consortium Name:
 Organism Name:                        Bombus impatiens
 Strain/isolate/breed:
 Locus Tag Prefix:                     BIMP       #3+ letters
 Source of DNA used for sequencing:    haploid males
 Sequencing Method:                    wgs       ???
 Sequencing Technology:                Illumina
 Estimated Genome Size:                250Mb     # the haploid genome size 
 Brief description of the importnace:  #2 pollinator in the US
 Comments to the staff:                
   DNA 
   whole genome sequencing 
   single genome; 
   no annotation
   assembly name BIMP_1.0 
   assembly method: Celera Assembler 6.1 
   plan to update: ?
   expect to release 
   agp:
  • Sequin:
 Manuscript:         
 Authors:     Hugh Robertson, Matt Hudson, Kim Walden, Aarti Venkat, Tom Newman, Daniela Puiu, Tanja Magoc, David Kelley, Aleksey Zimin, Steven Salzberg, Gene Robinson

  • Best assembly (CA)
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/
 ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bombus_impatiens/Assembly/CA/
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 ctg                  92307      63     100    119    186    297795     2613       24781      241197400      
 ctg.contam           82         1331   10485  19924  31977  112638     25648      32851      2103095   
 ctg.clean            92225      63     100    119    185    297795     2593       24628      239094305     
 ctgNs                76         75     536    3124   6834   44990      5326       10261      404781         
 scf                  1896       76     150    4044   67922  4021294    151761     1017298    287738041
 scf.ctg.contam       24         14315  45008  71878  141921 1116872    141140     164885     3387362        
 deg                  1086650    54     103    124    152    6822       139        135        151484589   
 deg.contam           1129       65     114    141    207    1035       198        211        223588         
 deg.clean            1085521    54     103    124    152    6822       139        135        151261001      
 deg.Ns               37         98     145    234    426    914        313        437        11574             
  • Contaminant list: ???
 /fs/szattic-asmg4/tmagoc/bees_bacteria/IDs

1st submission

  • Submit all contigs (should have deleted the singletons < 200bp)
  • Directories & files
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/ 
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/BIMP_1.0.*
 server:      ftp-trace.ncbi.nlm.nih.gov
 user:        cbcb_trc
 directory:   test/
 files names: BIMP_1.1.sqn , BIMP_1.1.agp
 number contigs:   92,231 
 number scaffolds: 1,895
  • NCBI email (Dec 23rd)
 there are 3 contigs that need to be trimmed:
 Contig name	Length	Span(s)	Apparent source
 7180002728824	11695	72..148	Escherichia coli
 7180002730077	25748	22703..25611	Escherichia coli
 7180002740828	30102	134..1688	Escherichia coli
 Since these are internal hits, consider whether the contig needs to split (if the E.coli was the 'joining' element).
  • NOTE: When I blastn these 3 contigs to nr (distant blast) , all 3 contigs had multiple good hits to other bacteria , so probably they are all contaminated and should be removed completely, not just trimmed

2nd submission

  • Jan 24th
  number of scaffolds: 1191
  number of contigs:   91524
  • Directories & files
 /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/genbank/sequin/BIMP_1.1.*

NCBI release