Bumblebee

From Cbcb
Jump to navigation Jump to search

Data

  • ~ 500Mbp genome
  • Complete mitochondrion genomes:
 NC_011923.1    15468  14.67  Bombus hypocrita sapporoensis mitochondrion, complete genome
 NC_010967.1    16434  13.22  Bombus ignitus mitochondrion, complete genome
 only 88% identity; no rearrangements, only snps, short indels

Traces

  • 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
 Lane   Insert            ReadLen    #Mates         #Reads         Coverage   Orientation     Comments
 1      3K(2..6,avg 4K)   124        34,944,099                    14X        outie           865,687(1.2%) reads have qual==0
 2      8K(7..9,avg 8K)   124        32,540,640                    13X        outie

 3      400               124        34,745,750                    .                           gDNA ; originally thought to be 500bp insert instead of 400
 5      400               124        34,601,239                    . 
 6      400               124        34,553,857                    . 
 7      400               124        34,682,612                    . 
 8      400               124        12,975,839                    . 
  
 3-8                                 151,559,297     303,118,594   75X
 1-8                                 219,044,036     438,088,072   108X
 Two new 3kb libraries to come; ~ 70% are true pair ends
  • Adaptors
 >circularizarion
 CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
 >circularizarion.revcomp
 TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG 
 >5
 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
 >3
 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
  • Location
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt : s_1-8 libs
 /fs/szattic-asmg4/Bees/Bombus_impatiens/3kb_Bimpatiens.txt            : 88 reads

Tasks to figure out

  1. Erroneous reads/bases, which we need to correct or discard
  2. GC bias, so we can compute a-stats properly
  3. Redundancy in the long paired ends, which are lane 1 and lane 2.
  • Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument.
    • Some reads will begin in the circularization adaptor and thus will have only one usable read
    • Some reads have a few bases of DNA sequence and hit the circularization adaptor right away
    • Most reads will have at least 36bp from each end before hitting the adaptor.
    • Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor)
  • A small but significant number of reads from the 3kb and 8kb libraries are not recircularized.
    • Thus their mate distance is +400bp rather than -3kb or -8kb.
    • It's apparently the result of a faulty batch of cre recombinase. This causes problems with contiging and scaffolding.
    • It is possible to remove these reads by removing
  1. mate pairs where neither read is trimmed (thus no adapter ligation may have occurred)
  2. mate pairs where one read begins with the adapter sequence.
    • We are working on this, up to now our paired assembly stages have been disappointing.
  • Matt's idea is to exclude all mate pair reads that don't have evidence of the linker with flanking useful sequence, as a way to avoid uncircularized molecules that will give misleading "mate pairs" only 400 bp apart.
    • There has been no trimming of the adaptor, which is the 42 base 454 adaptor, so its presence can be used to indicate potentially good mate pairs.
    • Even tossing half the mate pairs might not be a problem, as we have perhaps too many anyway.
    • But you will also need to toss redundant mate pairs, and that will indeed reduce the total a lot.
    • Just to be clear - the 500 base mate pairs should have no such problems, except that as Matt has found from his preliminary assembly, the mean fragment length is actually 400 bp rather than 500 bp (and the 3 and 8 kb PE reads are typically shorter than nominally given, e.g. more like 2.5 and 6 kb).
    • You'll also need to throw out all the reads where one of the mates /starts/ with linker, I assume you'd do that anyway. We're also working on ways round this; one might be when we get a better assembly we'll find some better characteristic to filter the unrecircularized reads on.

Trimming

  • Quality trimming: trim all bases with fastq quality eq "B" (0)
  cat *fastq | ~/bin/fastq2clq.pl 
  • Adaptor trimming: Align all subsets to adaptors
 C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
 3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
 5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
 id  len    gc%
 C   42     30.95  
 3   52     55.77  
 5   67     59.70  
 nucmer -l 8 -c 16 -b 32 -g 32 adaptors.seq  ...
 delta-filter -l 16 -q   ...
 cat *.filter-q.delta | ~/bin/delta2clr53.pl -5 5,3 ...
  • Adaptor/primers median positions in the reads
 Lib       adaptorPos     5'primerPos 3'primerPos 
 s_1       34-75          0-36        0-19
 s_2       2-40           0-36        0-19
  • Mate stats
 Lib        #mates    adaptor.bothMates   adaptor.noMate   adaptor.oneMate   adaptor.oneMate(filtered)   
 s_1        34944099  269156              9804247          24870696          6048164(17.3%)               
 s_2        32540640  91528               16181288         16267824          1061858( 3.2%)   
 total      67484739                                                         7110022(10.5%)
  • Filtered read clr stats
 Lib        #mates     #reads       min    q1     q2     q3     max        mean       n50        sum            
 s_1        6048164    12096328     64     80     95     115    124        96.56      101        1,168,064,632     
 s_2        1061858    2123716      64     80     96     115    124        96.59      101          205,122,226 
 total      7110022    14220044     64     80     95     115    124        96.57      101        1,373,186,858 
  • Filtered read GC% stats
 Lib        #reads       min    q1     q2     q3     max        mean       n50                    
 s_1        12096322     0.00   31.93  36.26  42.35  100.00     37.45      38        
 s_2        2123715      0.00   31.88  36.46  42.48  100.00     37.56      38         
  • Other frequent kmers
 26mer : ACGTTATAACGTATTACGTTATATGG -> revcomp -> CCATATAACGTAATACGTTATAACGT : ~10% of the traces
 10mer : AAAAAAAAAA                               TTTTTTTTTT                 : ~32% of the traces    
 53mer: CGATTTCCATGGCGTCGTTTGAGGATTCCAATACGGCGAACCTGTTGTGAGTG                : ~2% of the mito seqs (either begin or end); not present in the 2 complete mito's (probably ok)
  • Data processing pipeline
 s_[12]    : /nfshomes/dpuiu/bin/fastq2adaptorFreeFrg.amos
 s_[35678] : /nfshomes/dpuiu/bin/fastqfrg.amos
  • Location:
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt                   # original fastq files
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/s_[1235678]_[12]_sequence.cor.txt    # corrected fastq files
 /fs/szattic-asmg5/Bees/Bombus_impatiens/s_[12]/all/s_[12].64.rev.frg                    # adaptor free s_[12] mated reads
 /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[1235678].frg                  # adaptor free & corrected s_[1235678] reads

D Kelly's trimming

  • Read lens stats
 lib                       reads      min  q1   q2   q3   max  mean    n50  sum          
 s_1_1_sequence.cor.txt    5436814    30   71   90   115  124  89.70   100  487673256    
 s_1_2_sequence.cor.txt    5864225    30   77   92   110  124  93.27   98   546956864    
 s_2_1_sequence.cor.txt    1053858    30   80   96   118  124  97.16   102  102395785    
 s_2_2_sequence.cor.txt    1041846    30   78   93   110  124  93.51   99   97426679     
 s_3_1_sequence.cor.txt    33668302   30   105  124  124  124  111.04  124  3738530761   
 s_3_2_sequence.cor.txt    32554322   30   82   108  124  124  100.42  120  3269039697   
 s_5_1_sequence.cor.txt    33579744   30   109  124  124  124  111.65  124  3749192427   
 s_5_2_sequence.cor.txt    32465412   30   84   112  124  124  102.13  124  3315553171   
 s_6_1_sequence.cor.txt    33535725   30   109  124  124  124  111.60  124  3742602285   
 s_6_2_sequence.cor.txt    32390877   30   83   109  124  124  100.92  123  3268875471   
 s_7_1_sequence.cor.txt    33674235   30   112  124  124  124  112.19  124  3777917379   
 s_7_2_sequence.cor.txt    32518568   30   84   110  124  124  101.60  124  3303807321   
 s_8_1_sequence.cor.txt    11777821   30   113  124  124  124  112.30  124  1322658923   
 s_8_2_sequence.cor.txt    12176364   30   84   110  124  124  101.51  124  1236034348 
  
 total                     301,738,113  .  .    .    .    .    .       .    31958664367  
 total (64bp+)             275,763,594  .  .    .    .    .    .       .    30693543885  
 s_1,2                     13,396,743
  • Read/Mate counts
 lib                       reads        mates            
 s_1                       11301039     5320163    
 s_2                       2095704      1034086    
 s_3                       66222624     31857304   
 s_5                       66045156     31820553         
 s_6                       65926602     31748713   
 s_7                       66192803     31891468   
 s_8                       23954185     11168258   

 total                     301738113    144,840,545
 s_1,2                     13,396,743   6,354,249
  • Files
 /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[12].rev.frg
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/frg/s_[35678].frg         
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/s_?_[12]_sequence.cor.txt      # fastq ; s_[12] m

Assembly

CA.bog s_1

  • Input  : 31*2M s_1 reads trimmed to 100bp & formatted by fastqToCA
  • Spec file
 unitigger =                      bog
 ovlOverlapper =                  mer
 obtMerThreshold =                400
 ovlMerThreshold =                120
 doOverlapTrimming =              0
  • Unitigger : max utg len=852bp
  • Consensus after unitigger : 3 out of 129 jobs failed
  • Location
 /scratch1/Bombus_impatiens/Assembly/31M

CA.utg s_1,2 filtered

  • Input: 7*2M s_1 & s_2 reads formatted by convert-fasta-to-v2.pl
  • Filter:
    • only one read from the mate contains the adaptor in the 64..124bp range
    • both reads from the mate have good quality in the 0..64bp range
  • Spec file
 unitigger            =  utg  
 ovlOverlapper        =  ovl  
 obtMerThreshold      =  400
 ovlMerThreshold      =  200
 doOverlapTrimming    =  0     => reads < 64 bp atre trimmed by the gatekeeper; no need to trim
  • ovlStore stats
 #ovls/read
 reads      min    q1     q2     q3     max        mean       n50        sum            
 14220044   0      1      5      31     514        31.30      131        445090932 
  • final stats
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 ctg                  2          1004   1004   1055   1055   1055       1029.50    1055       2059           
 deg                  1433143    64     115    129    164    1028       145.79     145        208938949   
 assembled            7557472
 singletons           6662572
  • Mate redundancy
 cat 9-terminator/asm.posmap.frgdeg  | sort -nk2 -nk3 | perl ~/bin/posmap2ovl.pl | p 'chop $F[1]; chop $F[2]; print "$F[1] $F[2]\n"; print "$F[2] $F[1]\n";' | count.pl -m 2 | more
  • Location
 /scratch1/Bombus_impatiens/Assembly/14M.redo

SOAPdenovo s_3..8 orig (CBCB)

  • Kmer=23
  • Input: s_3,5,6,7,8 original reads
  • Error correction: no
  • Summary
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  571,114    100    106    115    141    156066     162.19     137        92,628,126       
 ctg                  54049561   24     24     28     42     2049       35.04      35         1,893,660,606     
 gapSeq               49518      28     123    143    181    424        152.97     158        7,574,733
  • Longest scaff:
 scaffold672    156066 35.63
 cat asm.scafSeq | ~/bin/scafffasta2scaff.pl | tail -21 | getSummary.pl -i 2  -t ctg
 cat asm.scafSeq | ~/bin/scafffasta2scaff.pl | tail -21 | head -20| getSummary.pl -i 3 -t gap
 .             elem       min    q1     q2     q3     max        mean       n50        sum            
 ctg           21         481    2943   5480   9853   19523      7268.48    12202      152638         
 gap           20         1      166    183    246    302        171.40     233        3428

SOAPdenovo s_3..8 corr (CBCB)

  • Kmer=31
  • Input: s_3,5,6,7,8 reads
  • Error correction: yes
  • Summary
                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                106,217    100    140    434    2418   102317     2,177      7099       23,1261,742      

SOAPdenovo s_3..8 corr (Illinois)

  • Kmer=31
  • Input: s_3,5,6,7,8 reads
  • Error correction: yes; UIUC method
  • Summary
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  17,550                                            13,460                236,225,169
 ctg                  4004102    31                          97,618     1916       6020       230,333,709
 ctg100bp+            120167     100

SOAPdenovo s_1..8 corr (Illinois)

  • Kmer=31
  • Input: s_1,2,3,5,6,7,8 reads
  • Error correction: yes; CBCB method
  • Summary
 .                    elem       min    q1     q2     q3     max        mean       n50        sum   
 ctg200bp+            43603      201    752    2786   9518   167534     7637.63    19895      333,023,446         
  • Locations
 ftp://stan.cropsci.uiuc.edu/download/bombus1-k64-contigs.fa.min200
 ginkgo:/scratch1/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC/

SOAPdenovo s_1..8 corr (Illinois)

  • From: Kim Walden 04/30/2010
  • Input: s_1,2,3,5,6,7,8 reads
  • Error correction: yes; CBCB method
  • Summary
 # computed by UIUC
                         elements min_len  max_len  mean    n50          sum
 contigs(>100)           103858   100      85850    2227    7541         231387390      # prior to scaffolding; 
 scaffolds&contigs(>100) 12663    100      4648588  19282   1040109      244171008      # post-scaffolding and gap-filling
 scaffolds               3736     252      4648588  64874   1047164      242369898      # same prior data minus contigs that didn't assemble into scaffolds.
 # computed by CBCB
                         elem     min      max        mean    n50        sum
 contigs                 8927     100      37465      201     178        1801110        
 scaffolds+contigs       12663    100      4648588    19282   1040109    244171008
 scaffolds               3736     252      4648588    64874   1047164    242369898
  • Location
 http://www.life.illinois.edu/robertson/private/
 Login: robertson
 password: dna

CA s_1..8 corr OBT (CBCB) *

Issues

  • Used outie s_1,2 lib ; needs to be redone

runCA spec

 doOverlapTrimming =        1
 ovlOverlapper =            ovl
 unitigger =                bog         # cmd buildUnitigs

 *BatchSize =               2000000
 *BlockSize =               2000000

 ovlMemory   =              8GB
 doExtendClearRanges =      0          # fast run

Gatekeeper

 gatekeeper -dumpfragments asm.gkpStore | grep ^fragmentClear ...
 LOAD                  STATS         
 
 7                     libInput      
 7                     libLoaded     
 0                     libErrors     
 
 301738113             frgInput      
 301738113             frgLoaded     
 25974519              frgErrors     
 
 144840545             lkgInput      
 124575590             lkgLoaded     
 20264955              lkgErrors     
 0                     lkgWarnings   
 
 275763594             numRandom     
 301738113             numNormal     
 
 
 GLOBAL                STATS         
 
 301738113             FRG           
 7                     LIB           
 
 LibraryName           numActiveFRG  numDeletedFRG  numMatedFRG  readLength   clearLength  
 GLOBAL                274282535     27455578       247151634    30524746650  29445986650  
 LegacyUnmatedReads    0             0              0            0            0            
 
 s_1                   10409320      891719         9039810      996306280    994802849    
 s_2                   2054575       41129          1987728      197829588    197593453    
 s_3                   59894531      6328093        53925476     2382786529   2127257586   
 s_5                   60127274      5917882        54512448     2459696212   2209078913   
 s_6                   59813485      6113117        54062058     2396359740   2152656620   
 s_7                   60278536      5914267        54647258     2476326572   2237423989   
 s_8                   21704814      2249371        18976856     2435572545   2347304056   
 gatekeeper -dumplibraries -tabular asm.gkpStore
 UID    IID  Orientation  Mean  StdDev  NumFeatures  
 s_1    1    I            4000  400     1582823088   
 s_2    2    I            8000  800     1582823088   
 s_3    3    I            400   40      1582823088   
 s_5    4    I            400   40      1582823088   
 s_6    5    I            400   40      1582823088   
 s_7    6    I            400   40      1582823088   
 s_8    7    I            400   40      1582823088   
 s_1,2 lib : ids:1..13,396,744
 overlap jobs that need to be redone : 1036 out of 11476 < 10%
 ~/bin/getOvlJobsCount.pl 2000000 2000000 301738113 | p '/(\d+)-(\d+)\s+(\d+)-(\d+)/; print $_ if($1<14000000 or $3<14000000);' 

Meryl

 meryl -Dh -s 0-mercounts/asm-C-ms22-cm0 
 Found 23686053415 mers.
 Found 265670791 distinct mers.
 Found 8682767 unique mers.
 ...
 #freq   count   %
 1       8682767 0.0327  0.0004
 2       4288602 0.0488  0.0007
 72*     3803015 0.7150  0.3229
 73      3797615 0.7293  0.3346

OBT

  • job count
 n=((301738113/2000000)+0.5)=151
 OBT/OVL jobs = n*n/2 =151*76=11476
 ~ 20 min/OBT job =~ 0.33 hr/OBT job  => 0.33*11476/32=5 days
  • OBT trimming
 key             sum             count           mean
 CLR             31958664367     301738113       105.915239043733
 OBTINITIAL      30779687358     301610679       102.051052900551
 OBTMERGE        30052401719     301610679       99.6397137483318
 OBTCHIMERA      30052138325     301610679       99.638840456972
 LATEST          30052138325     301610679       99.638840456972
  • use OBTCHIMERA as CLR if want to redo the assembly !!!

Overlap

  • job count : same as OBT
  • Overlap counts
 cat asm.repeatmodel.lib.00*stats | egrep '^Lib|nSamples|median' | p 'chomp; if(/Lib/) {print "\n$_"} else { print "  $F[1]"}' | pretty -o 
 Lib    0  5'  199542773  50  
 Lib    0  3'  199542773  50  
 Lib    1  5'  6682072    38  
 Lib    1  3'  6682072    37  
 Lib    2  5'  1306720    37  ???
 Lib    2  3'  1306720    42  ???
 Lib    3  5'  43868663   50  
 Lib    3  3'  43868663   50  
 Lib    4  5'  43939366   51  
 Lib    4  3'  43939366   50  
 Lib    5  5'  43620676   50  
 Lib    5  3'  43620676   50  
 Lib    6  5'  43903838   51  
 Lib    6  3'  43903838   50  
 Lib    7  5'  16221438   50  
 Lib    7  3'  16221438   50
  • 199542773 out of 257786946 (77%) usable reads had overlaps
  • 7988792 out of 11443537 (69%) usable s_1,2 reads had overlaps

Frg/Ovl correction

  • Frgcorr jobs: 151
 wc -l 3-overlapcorrection/cat-corrects.frgcorrlist
  • Ovlcorr jobs: 151
 wc -l 3-overlapcorrection/cat-corrects.frgcorrlist

OverlapStore

  • Failures
    • overlapStore: too many files (max 10240) ; fix: recompile the code
    • chimera: FAILED file seems to be truncated but contains records from all libs; fix: link asm.chimera.report.FAILED asm.chimera.report

Bog

  • Run time: 14 hr
  • default unitigger : does not seem to work on such big datasets
  • Utg len stats:
 cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
 elem       min    q1     q2     q3     max        mean       n50        sum            
 63916892   47     85     150    150    22861      125.23     150        8004296170     

Bog consensus

  • job count :
 wc -l 4-unitigger/asm.partitioningInfo
 110
  • Failures
 grep FAILED 5-consensus/asm*err
 5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED.  Could not align fragment 264046761.

 Fix:
 # update layout
 setenv UTG 62979707
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u $UTG -d layout > utg.$UTG.layout
 ~/bin/updateUTGlayout.pl utg.$UTG.layout >  utg.$UTG.layout.update
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u $UTG -R utg.$UTG.layout.update  
 # create success files
 touch 5-consensus/asm_090.success 5-consensus/consensus.success
 # relaunch runCA

Cgw

  • Failures
 grep -B 5 "^Failure message:" runCA.log 
 

Cgw consensus

  • job count : 84
 wc -l 7-CGW/asm.partitionInfo
  • ctg count: 1698486
 cat 7-CGW/asm.partitionInfo | /nfshomes/dpuiu/bin/sum.pl -i 3 -t ctgs | pretty -o

Terminator

  • Failure:
    • buildPosMap : segFault :ids too long???

Stats

 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  15009      74     4786   9789   20021  415318     16916.33   29133      253897199      
 ctg                  145036     64     93     116    1035   138630     1498.67    7726       217361459      
 deg                  1507280    54     98     118    146    1434       148.28     137        223502287      
 utg                  2786974    54     90     110    142    25644      197.04     277        549140918      
 singletons           61129865
 usableSeqs           257786946
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 ctg200+              42345      200    1561   2803   4840   138630     4878.79    8448       206592377      
 deg200+              182588     200    224    284    503    1434       384.31     464        70171056       
 ctg200+,deg200+      224933     200    230    340    745    138630     1230.43    4669       276763433
  • Library stats (totals)
 lib    ctg       sur      deg       singl     total
 s_1    2251597   162800   3119665   3988095   9522157
 s_2    424821    41107    563656    891795    1921379
 s_3    28880189  1078891  13600909  12735637  56295626
 s_5    28983834  1077334  13561390  12962731  56585289
 s_6    28663974  1062939  13559494  12896855  56183262
 s_7    28784505  1058364  13713200  13020274  56576343
 s_8    10455533  383065   5229814   4634478   20702874
 all    128444453 4864500  63348128  61129865  257786930
  • Library stats (percentages)
 lib    ctg  sur  deg  singl  total
 s_1    24   2    33   42     100
 s_2    22   2    29   46     100
 s_3    51   2    24   23     100
 s_5    51   2    24   23     100
 s_6    51   2    24   23     100
 s_7    51   2    24   23     100
 s_8    51   2    25   22     100

Mate redundancy

  • Criteria:
 Given 2 mates  
   1: a1,a2 
   2: b1,b1
 We have the following 2 overlaps
   1: a1,b1  a2,b2    
   2: a1,b2  a2,b1  
 At least 75% of one read overlaps the other read
  • Location:
 cd /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor/9-terminator/
 wc -l s_1.mates asm.posmap.frgutg.s_1.*
 5320163 s_1.mates                                     # input mates (from FRG file; 2 reads/mate)
 1232734 asm.posmap.frgutg.s_1.mates                   # assembled mates (both reads assembled) 
  366262 asm.posmap.frgutg.s_1.redundant_mates         # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 29.7% redundancy
  312206 asm.posmap.frgutg.s_1.redundant_mate_pairs    # redundant assembled mate pairs
 
 wc -l s_2.mates asm.posmap.frgutg.s_2.* 
 1034086 s_2.mates                                     # input mates (from FRG file; 2 reads/mate)
  255905 asm.posmap.frgutg.s_2.mates                   # assembled mates (both reads assembled) 
  110570 asm.posmap.frgutg.s_2.redundant_mates         # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 43.2% redundancy
 1871747 asm.posmap.frgutg.s_2.redundant_mate_pairs    # redundant assembled mate pairs
   11937 asm.posmap.frgutg.s_2.redundant_mate_clusters # redundant assembled mate clusters

Runtimes

 #     step            hours
 0     gatekeeper      5.36   
 1     initialTrim     2.07   
 2     meryl           9.90   
 9     overlap.sh      11.31  
 11    find            0.00   
 16    overlapStore    13.79  
 17    consolidate     5.12   
 18    merge-trimming  1.41   
 19    chimera(1)      10.04  
 20    overlap.sh      49.86  
 21    find            0.00   
 22    overlapStore    16.68  
 23    frgcorr.sh      5.57   
 24    cat-corrects    0.10   
 25    ovlcorr.sh      4.64   
 26    cat-erates      0.90   
 27    overlapStore    6.67   
 28    buildUnitigs    13.88  
 29    gatekeeper      1.31   
 30    consensus.sh    3.41   
 31    cgw(1)          65.00  
 43    gatekeeper      4.39   
 44    consensus.sh    4.70   
 46    terminator      6.26   
 47    asmOutputFasta  7.75   
 48    dumpSingletons  3.68   
 49    buildPosMap(1)  3.79

Location

 ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/ # assembly
 ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/asm.gkpStore.OBTCHIMERACLR : dumped OBT clear ranges, 0/1 undel/del

CA s_1..8 corr OBT redo (CBCB) *

Issues

  • Fixed outie s_1,2 lib : clear ranges will be different
  • s_3..8 : stay the same

runCA spec

 doOverlapTrimming =        0         !!! already done in the prev assembly
 saveOverlaps  =            1         !!! it was actually set to 0 for this run; otherwise the ovb.gz files get deleted
 ovlOverlapper =            ovl       # same
 unitigger =                bog       # same   
 
 BatchSize =                2000000    # must be the same so overlaps are not redone
 BlockSize =                2000000    # must be the same
 ...
 doExtendClearRanges =      1          # initially set to 2; seems to take too long => reset to 1 

Input

  • Modified frg files:
    • dumped "CA s_1..8 corr OBT" assembly gatekeeper
    • undelete act:D fragments so the ids stay the save
    • reverse s_1,2 reads

Gatekeeper

  • Make sure the iids match

Overlap

  • Rerun only the overlap jobs for the s_1,2 reads

OverlapStore

  • To build faster in the future
e 'foreach ("0000000001".."0000000058") { print "overlapStore -c $_.ovlStore -g asm.gkpStore/ -i 0 -M 8192 1-overlapper/$_/*gz\n" }' | scheduler.pl  # 13min/job  
e 'foreach ("0000000002".."0000000058") { print "overlapStore -m 0000000001.ovlStore $_.ovlStore\n" }'                                               # 3 min/job 
total=2*13+57*3=3.5 hrs (instead of 13.85 reported in runCA.log)

Frgcorr

  • Initial failure: on 144 out of 151 jobs : the gatekeeper id's originally did not match (5 reads missing from s_5 lib shifted all the other reads)

Ovlcorr

  • The error rate on overlaps with error went down by ~ 1.28%

Bog

 cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
 elem       min    q1     q2     q3     max        mean       n50        sum            
 60970504   47     85     150    150    93191      126.17     150        7692801690

Bog consensus

  • job count :
 wc -l 4-unitigger/asm.partitioningInfo
 110
  • Failures
 grep FAILED 5-consensus/asm*err
 5-consensus/asm_022.err:MultiAlignUnitig()-- Unitig 115198 FAILED.  

 Fix: same as before

Cgw

  • Failure:
    • after 7-0-CGW/asm.ckp.4
 tail runCA.log
 ERROR:  Negative variance for chunk 59905266* in scaffold 6409 (AEnd=-163479.174700 BEnd=-163491.446700)

 37980:Scaffold 6409:
 37981-  Idx    Chunk     (   Start, Variance)  (     End, Variance)      Len
 37982-    0: 60983787    (       0,        0)  (   31646,      823)    31646
 37983-    1:   69691     (   35389,      934)  (   31650,      837)     3739
 37984-    2: 59907028    (   35103,      942)  (   42035,     1122)     6932
 37985-    3: 60799298    (   41640,     1132)  (   64897,     1737)    23257
 37986-    4: 60975976    (   64924,     1784)  (   88439,     2395)    23515
 37987-    5: 60988396    (   88431,     2416)  (  122023,     3289)    33592
 37988-    6: 60988698    (  121551,     3324)  (  123589,     3377)     2038
 37989-    7: 59905266*   (  122331,  -163479)  (  121859,  -163491)      472
 37990-    8: 60988599    (  122275,   170138)  (  134564,   170457)    12289
 ./AS_REZ/GapFillREZ.C:5666: int Estimate_Chunk_Ends(VarArrayStack_Entry_t*, int, int, LengthT*, LengthT*, ChunkInstanceT*, float*, Scaffold_Fill_t*, int*, int*, int*): Assertion `min_right_variance >= 0.0' failed.
 ./AS_REZ/GapFillREZ.c:8339:       "ERROR:  Negative variance for chunk %d in scaffold %d (AEnd=%f BEnd=%f)\n",  #message
 Fix: modfy the code; comment assertion statemants
 cat /nfshomes/dpuiu/szdevel/SourceForge/wgs/src/AS_REZ/GapFillREZ.c
 // assert (0.0 <= ref_variance && ref_variance <= FLT_MAX);
 // assert (min_right_variance >= 0.0);
  • Preliminary stats
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $sum,"\n"; $sum=0; } else { $sum+=$F[-1]}' | getSummary.pl -t scf
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -t ctg
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -min 200 -t ctg200+
 cat rezlog/rez.i36.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $count-1,"\n"; $count=0; } else { $count++}' | getSummary.pl -i 0 -min 1 -t scf.gaps
 #7-0-CGW/rez.i36.log :
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  1820       84     162    4583   82337  3161523    125884.07  667487     229109005       # ids from 0..13113
 ctg                  26453      64     962    3756   10485  255820     8759.59    21800      231717378      
 ctg200+              23343      200    1760   4772   11873  255820     9908.34    21845      231290344 
 scf.gaps             877        1      4      12     36     373        27.71      64         24298
 #7-2-CGW/rez.i01.log :
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  4903       66     1283   6908   39466  1148626    47212.57   209756     231483215      
 ctg                  23480      64     1370   4397   11757  255820     9858.75    23476      231483360
 7-2-CGW/rez.i06.log 
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  3818       64     268    6813   61948  1198480    60723.65   257457     231842914      
 ctg                  24005      64     1165   4161   11439  255820     9659.47    23819      231875677 
 #7-2-CGW/rez.i11.log 
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  3733       64     268    6880   65584  1391034    62111.99   258823     231864045      
 ctg                  24004      64     1165   4161   11439  255820     9659.98    23819      231878256      


  • TigStore
 tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D unitigs | tail
 maID            isPresent       isDeleted       ptID    svID    fileOffset      covStat
 60970540        1               0               0       3       58023924696     84
 tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D contigs | tail -1 
 maID            isPresent       isDeleted       ptID    svID    fileOffset
 61001100        1               0               0       32      16307420
  • Checkpoints
 # 7-0-CGW              
                                                                   #times   iterMax
 CHECKPOINT_AFTER_BUILDING_SCAFFOLDS     "ckp01-ABS"               
 CHECKPOINT_DURING_CLEANING_SCAFFOLDS    "ckp02-DCS"               5        10
 CHECKPOINT_AFTER_CLEANING_SCAFFOLDS     "ckp03-ACD"               1
 CHECKPOINT_DURING_1ST_SCAFF_MERGE       "ckp04-1SM-partial"       12                     # we should see more iterations if the scaffolding goes well; also ECR would take longer
 CHECKPOINT_AFTER_1ST_SCAFF_MERGE        "ckp05-1SM"               1
 CHECKPOINT_AFTER_STONES                 "ckp06-AS"           
 CHECKPOINT_DURING_2ND_SCAFF_MERGE       "ckp07-2SM-partial"  
 CHECKPOINT_AFTER_2ND_SCAFF_MERGE        "ckp08-2SM"               1
 CHECKPOINT_AFTER_FINAL_ROCKS            "ckp09-FR"                1

 CHECKPOINT_AFTER_PARTIAL_STONES         "ckp10-PS"   
 CHECKPOINT_AFTER_FINAL_CONTAINED_STONES "ckp11-FCS"  
 CHECKPOINT_AFTER_FINAL_CLEANUP          "ckp12-FC"   
 CHECKPOINT_AFTER_RESOLVE_SURROGATES     "ckp13-RS"   
 CHECKPOINT_AFTER_OUTPUT                 "ckp14-FIN"
 # 7-2-CGW              
                                                                   #times   iterMax
 CHECKPOINT_AFTER_BUILDING_SCAFFOLDS     "ckp01-ABS"               
 CHECKPOINT_DURING_CLEANING_SCAFFOLDS    "ckp02-DCS"               9        10
 CHECKPOINT_AFTER_CLEANING_SCAFFOLDS     "ckp03-ACD"               1                      # failed after it
  • Long running time : implemented 2 cgw options -ic, -im ....
  • Running times
 CGW-0: START Wed Apr 28 22:30:12 ; END Mon May  3 03:34:57
 ECR-1: START Mon May  3 03:34:58 ; END Mon May  3 12:28:54        
 CGW-2: START Mon May  3 12:28:54 ...
  • Library insert histograms
 # Example
 cat 7-0-CGW/stat/scaffold_final.distlib_1.cgm | p 'print int($F[0]/1000),"\n";' | count.pl | sort -n > 7-0-CGW/stat/scaffold_final.distlib_1.cgm.hist
 gnuplot
   plot "scaffold_final.distlib_1.cgm.hist" w l


  • Failure:
    • after 7-2-CGW/asm.ckp.42 (after running about 2hrs)
 tail runCA.log
 * Screwed up scaffold 3790: Chunk 60994012 has bad mean
 * Scaffold 3790 iteration 1  1 bad CIs out of 33 LSE:2664.55 improvement:1
 * Reinserting chunk 60994012 in scaffold 3790
 * improvement = 0.0721104 LSE = 2664.55
 ----------------------------------------
 Failure message:
 scaffolder failed

ECR

  • Count scaffold gaps : total/closed
 cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^examining gap'   # 24635
 cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^closed gap'      # 2803

Location

 ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo/ # assembly

CA s_1,3..8 corr OBT (CBCB) *

Issues

  • Don't use s_2 lib : seems like bimodal distribuition, CA "cannot handle it"

runCA spec

 doExtendClearRanges=0 

Input

  • Gatekeeper: same as prev assembly
  • Overlaps: filter out all overlaps for s_2 reads

CA s_1..8 corr noOBT (CBCB) *

Issues

  • Used outie s_1,2 lib ; needs to be redone

runCA spec

 doOverlapTrimming =              0
 ovlOverlapper =                  ovl
 unitigger =                      utg  # cmd unitigger

Overlap

 cat asm.repeatmodel.lib.00*stats | egrep '^Lib|nSamples|median' | p 'chomp; if(/Lib/) {print "\n$_"} else { print "  $F[1]"}' | pretty -o
 Lib    0  5'  212707959  51  
 Lib    0  3'  212707959  51  
 Lib    1  5'  7387977    39  
 Lib    1  3'  7387977    39  
 Lib    2  5'  1413291    40  ???
 Lib    2  3'  1413291    45  ???
 Lib    3  5'  46756517   51  
 Lib    3  3'  46756517   51  
 Lib    4  5'  46786172   52  
 Lib    4  3'  46786172   51  
 Lib    5  5'  46531611   52  
 Lib    5  3'  46531611   51  
 Lib    6  5'  46891503   52  
 Lib    6  3'  46891503   51  
 Lib    7  5'  16940888   52  
 Lib    7  3'  16940888   51  

Bog

  • Utg len stats:
 elem       min    q1     q2     q3     max        mean       n50        sum            
 71125093   66     95     150    150    23390      130.41     150        9275136839     

Location

 walnut:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor.noOBT