Data
- ~ 500B genome
- Complete mitochondrion genomes:
NC_011923.1 15468 14.67 Bombus hypocrita sapporoensis mitochondrion, complete genome
NC_010967.1 16434 13.22 Bombus ignitus mitochondrion, complete genome
only 88% identity; no rearrangements, only snps, short indels
Traces
- 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
Lane Insert ReadLen #Mates #Reads Coverage Orientation Comments
1 3K(2..6,avg 4K) 124 34,944,099 14X outie 865,687(1.2%) reads have qual==0
2 8K(7..9,avg 8K) 124 32,540,640 13X outie
3 400 124 34,745,750 . gDNA ; originally thought to be 500bp insert instead of 400
5 400 124 34,601,239 .
6 400 124 34,553,857 .
7 400 124 34,682,612 .
8 400 124 12,975,839 .
3-8 151,559,297 303,118,594 75X
1-8 219,044,036 438,088,072 108X
Two new 3kb libraries to come; ~ 70% are true pair ends
>circularizarion
CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
>circularizarion.revcomp
TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG
>5
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
>3
CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
/fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt : s_1-8 libs
/fs/szattic-asmg4/Bees/Bombus_impatiens/3kb_Bimpatiens.txt : 88 reads
Tasks to figure out
- Erroneous reads/bases, which we need to correct or discard
- GC bias, so we can compute a-stats properly
- Redundancy in the long paired ends, which are lane 1 and lane 2.
- Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument.
- Some reads will begin in the circularization adaptor and thus will have only one usable read
- Some reads have a few bases of DNA sequence and hit the circularization adaptor right away
- Most reads will have at least 36bp from each end before hitting the adaptor.
- Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor)
- A small but significant number of reads from the 3kb and 8kb libraries are not recircularized.
- Thus their mate distance is +400bp rather than -3kb or -8kb.
- It's apparently the result of a faulty batch of cre recombinase. This causes problems with contiging and scaffolding.
- It is possible to remove these reads by removing
- mate pairs where neither read is trimmed (thus no adapter ligation may have occurred)
- mate pairs where one read begins with the adapter sequence.
- We are working on this, up to now our paired assembly stages have been disappointing.
- Matt's idea is to exclude all mate pair reads that don't have evidence of the linker with flanking useful sequence, as a way to avoid uncircularized molecules that will give misleading "mate pairs" only 400 bp apart.
- There has been no trimming of the adaptor, which is the 42 base 454 adaptor, so its presence can be used to indicate potentially good mate pairs.
- Even tossing half the mate pairs might not be a problem, as we have perhaps too many anyway.
- But you will also need to toss redundant mate pairs, and that will indeed reduce the total a lot.
- Just to be clear - the 500 base mate pairs should have no such problems, except that as Matt has found from his preliminary assembly, the mean fragment length is actually 400 bp rather than 500 bp (and the 3 and 8 kb PE reads are typically shorter than nominally given, e.g. more like 2.5 and 6 kb).
- You'll also need to throw out all the reads where one of the mates /starts/ with linker, I assume you'd do that anyway. We're also working on ways round this; one might be when we get a better assembly we'll find some better characteristic to filter the unrecircularized reads on.
Trimming
- Quality trimming: trim all bases with fastq quality eq "B" (0)
cat *fastq | ~/bin/fastq2clq.pl
- Adaptor trimming: Align all subsets to adaptors
C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
id len gc%
C 42 30.95
3 52 55.77
5 67 59.70
nucmer -l 8 -c 16 -b 32 -g 32 adaptors.seq ...
delta-filter -l 16 -q ...
cat *.filter-q.delta | ~/bin/delta2clr53.pl -5 5,3 ...
- Adaptor/primers median positions in the reads
Lib adaptorPos 5'primerPos 3'primerPos
s_1 34-75 0-36 0-19
s_2 2-40 0-36 0-19
Lib #mates adaptor.bothMates adaptor.noMate adaptor.oneMate adaptor.oneMate(filtered)
s_1 34944099 269156 9804247 24870696 6048164(17.3%)
s_2 32540640 91528 16181288 16267824 1061858( 3.2%)
total 67484739 7110022(10.5%)
Lib #mates #reads min q1 q2 q3 max mean n50 sum
s_1 6048164 12096328 64 80 95 115 124 96.56 101 1,168,064,632
s_2 1061858 2123716 64 80 96 115 124 96.59 101 205,122,226
total 7110022 14220044 64 80 95 115 124 96.57 101 1,373,186,858
Lib #reads min q1 q2 q3 max mean n50
s_1 12096322 0.00 31.93 36.26 42.35 100.00 37.45 38
s_2 2123715 0.00 31.88 36.46 42.48 100.00 37.56 38
26mer : ACGTTATAACGTATTACGTTATATGG -> revcomp -> CCATATAACGTAATACGTTATAACGT : ~10% of the traces
10mer : AAAAAAAAAA TTTTTTTTTT : ~32% of the traces
53mer: CGATTTCCATGGCGTCGTTTGAGGATTCCAATACGGCGAACCTGTTGTGAGTG : ~2% of the mito seqs (either begin or end); not present in the 2 complete mito's (probably ok)
s_[12] : /nfshomes/dpuiu/bin/fastq2adaptorFreeFrg.amos
s_[35678] : /nfshomes/dpuiu/bin/fastqfrg.amos
/fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt # original fastq files
/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/s_[1235678]_[12]_sequence.cor.txt # corrected fastq files
/fs/szattic-asmg5/Bees/Bombus_impatiens/s_[12]/all/s_[12].64.rev.frg # adaptor free s_[12] mated reads
/fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[1235678].frg # adaptor free & corrected s_[1235678] reads
D Kelly's trimming
lib reads min q1 q2 q3 max mean n50 sum
s_1_1_sequence.cor.txt 5436814 30 71 90 115 124 89.70 100 487673256
s_1_2_sequence.cor.txt 5864225 30 77 92 110 124 93.27 98 546956864
s_2_1_sequence.cor.txt 1053858 30 80 96 118 124 97.16 102 102395785
s_2_2_sequence.cor.txt 1041846 30 78 93 110 124 93.51 99 97426679
s_3_1_sequence.cor.txt 33668302 30 105 124 124 124 111.04 124 3738530761
s_3_2_sequence.cor.txt 32554322 30 82 108 124 124 100.42 120 3269039697
s_5_1_sequence.cor.txt 33579744 30 109 124 124 124 111.65 124 3749192427
s_5_2_sequence.cor.txt 32465412 30 84 112 124 124 102.13 124 3315553171
s_6_1_sequence.cor.txt 33535725 30 109 124 124 124 111.60 124 3742602285
s_6_2_sequence.cor.txt 32390877 30 83 109 124 124 100.92 123 3268875471
s_7_1_sequence.cor.txt 33674235 30 112 124 124 124 112.19 124 3777917379
s_7_2_sequence.cor.txt 32518568 30 84 110 124 124 101.60 124 3303807321
s_8_1_sequence.cor.txt 11777821 30 113 124 124 124 112.30 124 1322658923
s_8_2_sequence.cor.txt 12176364 30 84 110 124 124 101.51 124 1236034348
total 301,738,113 . . . . . . . 31958664367
total (64bp+) 275,763,594 . . . . . . . 30693543885
s_1,2 13,396,743
lib reads mates
s_1 11301039 5320163
s_2 2095704 1034086
s_3 66222624 31857304
s_5 66045156 31820553
s_6 65926602 31748713
s_7 66192803 31891468
s_8 23954185 11168258
total 301738113 144,840,545
s_1,2 13,396,743 6,354,249
/fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/s_[12].rev.frg
/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/frg/s_[35678].frg
/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/s_?_[12]_sequence.cor.txt # fastq ; s_[12] m
Assembly
CA.bog s_1
- Input : 31*2M s_1 reads trimmed to 100bp & formatted by fastqToCA
unitigger = bog
ovlOverlapper = mer
obtMerThreshold = 400
ovlMerThreshold = 120
doOverlapTrimming = 0
- Unitigger : max utg len=852bp
- Consensus after unitigger : 3 out of 129 jobs failed
/scratch1/Bombus_impatiens/Assembly/31M
CA.utg s_1,2 filtered
- Input: 7*2M s_1 & s_2 reads formatted by convert-fasta-to-v2.pl
- Filter:
- only one read from the mate contains the adaptor in the 64..124bp range
- both reads from the mate have good quality in the 0..64bp range
unitigger = utg
ovlOverlapper = ovl
obtMerThreshold = 400
ovlMerThreshold = 200
doOverlapTrimming = 0 => reads < 64 bp atre trimmed by the gatekeeper; no need to trim
#ovls/read
reads min q1 q2 q3 max mean n50 sum
14220044 0 1 5 31 514 31.30 131 445090932
. elem min q1 q2 q3 max mean n50 sum
ctg 2 1004 1004 1055 1055 1055 1029.50 1055 2059
deg 1433143 64 115 129 164 1028 145.79 145 208938949
assembled 7557472
singletons 6662572
cat 9-terminator/asm.posmap.frgdeg | sort -nk2 -nk3 | perl ~/bin/posmap2ovl.pl | p 'chop $F[1]; chop $F[2]; print "$F[1] $F[2]\n"; print "$F[2] $F[1]\n";' | count.pl -m 2 | more
/scratch1/Bombus_impatiens/Assembly/14M.redo
SOAPdenovo s_3..8 orig (CBCB)
- Kmer=23
- Input: s_3,5,6,7,8 original reads
- Error correction: no
- Summary
. elem min q1 q2 q3 max mean n50 sum
scf 571,114 100 106 115 141 156066 162.19 137 92,628,126
ctg 54049561 24 24 28 42 2049 35.04 35 1,893,660,606
gapSeq 49518 28 123 143 181 424 152.97 158 7,574,733
scaffold672 156066 35.63
cat asm.scafSeq | ~/bin/scafffasta2scaff.pl | tail -21 | getSummary.pl -i 2 -t ctg
cat asm.scafSeq | ~/bin/scafffasta2scaff.pl | tail -21 | head -20| getSummary.pl -i 3 -t gap
. elem min q1 q2 q3 max mean n50 sum
ctg 21 481 2943 5480 9853 19523 7268.48 12202 152638
gap 20 1 166 183 246 302 171.40 233 3428
SOAPdenovo s_3..8 corr (CBCB)
- Kmer=31
- Input: s_3,5,6,7,8 reads
- Error correction: yes
- Summary
elem min q1 q2 q3 max mean n50 sum
scf 106,217 100 140 434 2418 102317 2,177 7099 23,1261,742
SOAPdenovo s_3..8 corr (Illinois)
- Kmer=31
- Input: s_3,5,6,7,8 reads
- Error correction: yes; UIUC method
- Summary
. elem min q1 q2 q3 max mean n50 sum
scf 17,550 13,460 236,225,169
ctg 4004102 31 97,618 1916 6020 230,333,709
ctg100bp+ 120167 100
SOAPdenovo s_1..8 corr (Illinois)
- Kmer=31
- Input: s_1,2,3,5,6,7,8 reads
- Error correction: yes; CBCB method
- Summary
. elem min q1 q2 q3 max mean n50 sum
ctg200bp+ 43603 201 752 2786 9518 167534 7637.63 19895 333,023,446
ftp://stan.cropsci.uiuc.edu/download/bombus1-k64-contigs.fa.min200
ginkgo:/scratch1/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC/
SOAPdenovo s_1..8 corr (Illinois)
- From: Kim Walden 04/30/2010
- Input: s_1,2,3,5,6,7,8 reads
- Error correction: yes; CBCB method
- Summary
elements min_len. max_len. mean n50 sum
contigs(>100) 103858 100 85850 2227 7541 231387390
scaffolds&contigs(>100) 16030 100 1044604 14995 166497 240381124
scaffolds 6956 253 1044604 34258 168763 238300263
CA s_1..8 corr OBT (CBCB) *
Issues
- Used outie s_1,2 lib ; needs to be redone
runCA spec
doOverlapTrimming = 1
ovlOverlapper = ovl
unitigger = bog # cmd buildUnitigs
*BatchSize = 2000000
*BlockSize = 2000000
ovlMemory = 8GB
Gatekeeper
gatekeeper -dumpfragments asm.gkpStore | grep ^fragmentClear ...
LOAD STATS
7 libInput
7 libLoaded
0 libErrors
301738113 frgInput
301738113 frgLoaded
25974519 frgErrors
144840545 lkgInput
124575590 lkgLoaded
20264955 lkgErrors
0 lkgWarnings
275763594 numRandom
301738113 numNormal
GLOBAL STATS
301738113 FRG
7 LIB
LibraryName numActiveFRG numDeletedFRG numMatedFRG readLength clearLength
GLOBAL 274282535 27455578 247151634 30524746650 29445986650
LegacyUnmatedReads 0 0 0 0 0
s_1 10409320 891719 9039810 996306280 994802849
s_2 2054575 41129 1987728 197829588 197593453
s_3 59894531 6328093 53925476 2382786529 2127257586
s_5 60127274 5917882 54512448 2459696212 2209078913
s_6 59813485 6113117 54062058 2396359740 2152656620
s_7 60278536 5914267 54647258 2476326572 2237423989
s_8 21704814 2249371 18976856 2435572545 2347304056
gatekeeper -dumplibraries -tabular asm.gkpStore
UID IID Orientation Mean StdDev NumFeatures
s_1 1 I 4000 400 1582823088
s_2 2 I 8000 800 1582823088
s_3 3 I 400 40 1582823088
s_5 4 I 400 40 1582823088
s_6 5 I 400 40 1582823088
s_7 6 I 400 40 1582823088
s_8 7 I 400 40 1582823088
s_1,2 lib : ids:1..13,396,744
overlap jobs that need to be redone : 1036 out of 11476 < 10%
~/bin/getOvlJobsCount.pl 2000000 2000000 301738113 | p '/(\d+)-(\d+)\s+(\d+)-(\d+)/; print $_ if($1<14000000 or $3<14000000);'
Meryl
meryl -Dh -s 0-mercounts/asm-C-ms22-cm0
Found 23686053415 mers.
Found 265670791 distinct mers.
Found 8682767 unique mers.
...
#freq count %
1 8682767 0.0327 0.0004
2 4288602 0.0488 0.0007
72* 3803015 0.7150 0.3229
73 3797615 0.7293 0.3346
OBT
n=((301738113/2000000)+0.5)=151
OBT/OVL jobs = n*n/2 =151*76=11476
~ 20 min/OBT job =~ 0.33 hr/OBT job => 0.33*11476/32=5 days
key sum count mean
CLR 31958664367 301738113 105.915239043733
OBTINITIAL 30779687358 301610679 102.051052900551
OBTMERGE 30052401719 301610679 99.6397137483318
OBTCHIMERA 30052138325 301610679 99.638840456972
LATEST 30052138325 301610679 99.638840456972
- use OBTCHIMERA as CLR if want to redo the assembly !!!
Overlap
cat asm.repeatmodel.lib.00*stats | egrep '^Lib|nSamples|median' | p 'chomp; if(/Lib/) {print "\n$_"} else { print " $F[1]"}' | pretty -o
Lib 0 5' 199542773 50
Lib 0 3' 199542773 50
Lib 1 5' 6682072 38
Lib 1 3' 6682072 37
Lib 2 5' 1306720 37 ???
Lib 2 3' 1306720 42 ???
Lib 3 5' 43868663 50
Lib 3 3' 43868663 50
Lib 4 5' 43939366 51
Lib 4 3' 43939366 50
Lib 5 5' 43620676 50
Lib 5 3' 43620676 50
Lib 6 5' 43903838 51
Lib 6 3' 43903838 50
Lib 7 5' 16221438 50
Lib 7 3' 16221438 50
- 199542773 out of 257786946 (77%) usable reads had overlaps
- 7988792 out of 11443537 (69%) usable s_1,2 reads had overlaps
Frg/Ovl correction
wc -l 3-overlapcorrection/cat-corrects.frgcorrlist
wc -l 3-overlapcorrection/cat-corrects.frgcorrlist
OverlapStore
- Failures
- overlapStore: too many files (max 10240) ; fix: recompile the code
- chimera: FAILED file seems to be truncated but contains records from all libs; fix: link asm.chimera.report.FAILED asm.chimera.report
Bog
- Run time: 14 hr
- default unitigger : does not seem to work on such big datasets
- Utg len stats:
cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
elem min q1 q2 q3 max mean n50 sum
63916892 47 85 150 150 22861 125.23 150 8004296170
Bog consensus
wc -l 4-unitigger/asm.partitioningInfo
110
grep FAILED 5-consensus/asm*err
5-consensus/asm_090.err:MultiAlignUnitig()-- Unitig 62979707 FAILED. Could not align fragment 264046761.
Fix:
# update layout
setenv UTG 62979707
tigStore -g asm.gkpStore -t asm.tigStore 2 -u $UTG -d layout > utg.$UTG.layout
~/bin/updateUTGlayout.pl utg.$UTG.layout > utg.$UTG.layout.update
tigStore -g asm.gkpStore -t asm.tigStore 2 -u $UTG -R utg.$UTG.layout.update
# create success files
touch 5-consensus/asm_090.success 5-consensus/consensus.success
# relaunch runCA
Cgw
grep -B 5 "^Failure message:" runCA.log
Cgw consensus
wc -l 7-CGW/asm.partitionInfo
cat 7-CGW/asm.partitionInfo | /nfshomes/dpuiu/bin/sum.pl -i 3 -t ctgs | pretty -o
Terminator
- Failure:
- buildPosMap : segFault :ids too long???
Stats
. elem min q1 q2 q3 max mean n50 sum
scf 15009 74 4786 9789 20021 415318 16916.33 29133 253897199
ctg 145036 64 93 116 1035 138630 1498.67 7726 217361459
deg 1507280 54 98 118 146 1434 148.28 137 223502287
utg 2786974 54 90 110 142 25644 197.04 277 549140918
singletons 61129865
usableSeqs 257786946
. elem min q1 q2 q3 max mean n50 sum
ctg200+ 42345 200 1561 2803 4840 138630 4878.79 8448 206592377
deg200+ 182588 200 224 284 503 1434 384.31 464 70171056
ctg200+,deg200+ 224933 200 230 340 745 138630 1230.43 4669 276763433
lib ctg sur deg singl total
s_1 2251597 162800 3119665 3988095 9522157
s_2 424821 41107 563656 891795 1921379
s_3 28880189 1078891 13600909 12735637 56295626
s_5 28983834 1077334 13561390 12962731 56585289
s_6 28663974 1062939 13559494 12896855 56183262
s_7 28784505 1058364 13713200 13020274 56576343
s_8 10455533 383065 5229814 4634478 20702874
all 128444453 4864500 63348128 61129865 257786930
- Library stats (percentages)
lib ctg sur deg singl total
s_1 24 2 33 42 100
s_2 22 2 29 46 100
s_3 51 2 24 23 100
s_5 51 2 24 23 100
s_6 51 2 24 23 100
s_7 51 2 24 23 100
s_8 51 2 25 22 100
Mate redundancy
Given 2 mates
1: a1,a2
2: b1,b1
We have the following 2 overlaps
1: a1,b1 a2,b2
2: a1,b2 a2,b1
At least 75% of one read overlaps the other read
cd /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor/9-terminator/
wc -l s_1.mates asm.posmap.frgutg.s_1.*
5320163 s_1.mates # input mates (from FRG file; 2 reads/mate)
1232734 asm.posmap.frgutg.s_1.mates # assembled mates (both reads assembled)
366262 asm.posmap.frgutg.s_1.redundant_mates # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 29.7% redundancy
312206 asm.posmap.frgutg.s_1.redundant_mate_pairs # redundant assembled mate pairs
wc -l s_2.mates asm.posmap.frgutg.s_2.*
1034086 s_2.mates # input mates (from FRG file; 2 reads/mate)
255905 asm.posmap.frgutg.s_2.mates # assembled mates (both reads assembled)
110570 asm.posmap.frgutg.s_2.redundant_mates # redundant assembled mates (there is another s_1 assembled mate pair with overlapping reads) => 43.2% redundancy
1871747 asm.posmap.frgutg.s_2.redundant_mate_pairs # redundant assembled mate pairs
11937 asm.posmap.frgutg.s_2.redundant_mate_clusters # redundant assembled mate clusters
Runtimes
# step hours
0 gatekeeper 5.36
1 initialTrim 2.07
2 meryl 9.90
9 overlap.sh 11.31
11 find 0.00
16 overlapStore 13.79
17 consolidate 5.12
18 merge-trimming 1.41
19 chimera(1) 10.04
20 overlap.sh 49.86
21 find 0.00
22 overlapStore 16.68
23 frgcorr.sh 5.57
24 cat-corrects 0.10
25 ovlcorr.sh 4.64
26 cat-erates 0.90
27 overlapStore 6.67
28 buildUnitigs 13.88
29 gatekeeper 1.31
30 consensus.sh 3.41
31 cgw(1) 65.00
43 gatekeeper 4.39
44 consensus.sh 4.70
46 terminator 6.26
47 asmOutputFasta 7.75
48 dumpSingletons 3.68
49 buildPosMap(1) 3.79
Location
ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/ # assembly
ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor/asm.gkpStore.OBTCHIMERACLR : dumped OBT clear ranges, 0/1 undel/del
CA s_1..8 corr OBT redo (CBCB) *
Issues
- Fixed outie s_1,2 lib : clear ranges will be different
- s_3..8 : stay the same
runCA spec
doOverlapTrimming = 0 !!! already done in the prev assembly
saveOverlaps = 1 !!! it was actually set to 0 for this run; otherwise the ovb.gz files get deleted
ovlOverlapper = ovl # same
unitigger = bog # same
BatchSize = 2000000 # must be the same so overlaps are not redone
BlockSize = 2000000 # must be the same
...
Input
- Modified frg files:
- dumped from the prev assembly gatekeeper
- undelete act:D fragments so the ids stay the save
- reverse s_1,2 reads
Gatekeeper
Overlap
- Rerun only the overlap jobs for the s_1,2 reads
OverlapStore
- To build faster in the future
e 'foreach ("0000000001".."0000000058") { print "overlapStore -c $_.ovlStore -g asm.gkpStore/ -i 0 -M 8192 1-overlapper/$_/*gz\n" }' | scheduler.pl # 13min/job
e 'foreach ("0000000002".."0000000058") { print "overlapStore -m 0000000001.ovlStore $_.ovlStore\n" }' # 3 min/job
total=2*13+57*3=3.5 hrs (instead of 13.85 reported in runCA.log)
Frgcorr
- Initial failure: on 144 out of 151 jobs : the gatekeeper id's originally did not match (5 reads missing from s_5 lib shifted all the other reads)
Ovlcorr
- The error rate on overlaps with error went down by ~ 1.28%
Bog
cat 4-unitigger/asm.cga.0 | perl ~/bin/cga2utgLen.pl | getSummary.pl -i 0 -j 1
elem min q1 q2 q3 max mean n50 sum
60970504 47 85 150 150 93191 126.17 150 7692801690
Bog consensus
wc -l 4-unitigger/asm.partitioningInfo
110
grep FAILED 5-consensus/asm*err
5-consensus/asm_022.err:MultiAlignUnitig()-- Unitig 115198 FAILED.
Fix: same as before
Cgw
tail runCA.log
ERROR: Negative variance for chunk 59905266* in scaffold 6409 (AEnd=-163479.174700 BEnd=-163491.446700)
37980:Scaffold 6409:
37981- Idx Chunk ( Start, Variance) ( End, Variance) Len
37982- 0: 60983787 ( 0, 0) ( 31646, 823) 31646
37983- 1: 69691 ( 35389, 934) ( 31650, 837) 3739
37984- 2: 59907028 ( 35103, 942) ( 42035, 1122) 6932
37985- 3: 60799298 ( 41640, 1132) ( 64897, 1737) 23257
37986- 4: 60975976 ( 64924, 1784) ( 88439, 2395) 23515
37987- 5: 60988396 ( 88431, 2416) ( 122023, 3289) 33592
37988- 6: 60988698 ( 121551, 3324) ( 123589, 3377) 2038
37989- 7: 59905266* ( 122331, -163479) ( 121859, -163491) 472
37990- 8: 60988599 ( 122275, 170138) ( 134564, 170457) 12289
./AS_REZ/GapFillREZ.C:5666: int Estimate_Chunk_Ends(VarArrayStack_Entry_t*, int, int, LengthT*, LengthT*, ChunkInstanceT*, float*, Scaffold_Fill_t*, int*, int*, int*): Assertion `min_right_variance >= 0.0' failed.
./AS_REZ/GapFillREZ.c:8339: "ERROR: Negative variance for chunk %d in scaffold %d (AEnd=%f BEnd=%f)\n", #message
- Preliminary stats 7-0-CGW/rez.i26.log :
grep -B 1000000 "^Scaffold Edges" rezlog/rez.i26.log | grep ':' | p 'if(/^Scaff/) { print $sum,"\n"; $sum=0; } else { $sum+=$F[-1]}' | getSummary.pl -t scf
grep -B 1000000 "^Scaffold Edges" rezlog/rez.i26.log | grep ':' | grep -v Scaff | getSummary.pl -i 8 -t ctg
grep -B 1000000 "^Scaffold Edges" rezlog/rez.i26.log | grep ':' | grep -v Scaff | getSummary.pl -i 8 -min 200 -t ctg200+
grep -B 1000000 "^Scaffold Edges" rezlog/rez.i26.log | grep ':' | p 'if(/^Scaff/) { print $count-1,"\n"; $count=0; } else { $count++}' | getSummary.pl -i 0 -min 1 -t scf.gaps
. elem min q1 q2 q3 max mean n50 sum
scf 7343 64 2946 11197 41936 523510 31581.31 81748 231901574
ctg 32315 64 1029 3415 8580 183629 7176.28 16619 231901574
ctg200+ 28499 200 1720 4241 9765 183629 8118.99 16699 231382982
scf.gaps 3622 1 2 4 9 121 6.89 12 24972
- Preliminary stats 7-0-CGW/rez.i36.log :
. elem min q1 q2 q3 max mean n50 sum
scf 1820 84 162 4583 82337 3161523 125884.07 667487 229109005 # ids from 0..13113
ctg 26453 64 962 3756 10485 255820 8759.59 21800 231717378
ctg200+ 28500 200 1720 4241 9765 183629 8118.71 16699 231383358
scf.gaps 877 1 4 12 36 373 27.71 64 24298
- Preliminary stats 7-2-CGW/rez.i01.log :
. elem min q1 q2 q3 max mean n50 sum
scf 4903 66 1283 6908 39466 1148626 47212.57 209756 231483215
ctg 23480 64 1370 4397 11757 255820 9858.75 23476 231483360
tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D unitigs | tail
maID isPresent isDeleted ptID svID fileOffset covStat
60970540 1 0 0 3 58023924696 84
tigStore -g asm.gkpStore -t asm.tigStore/ 32 -D contigs | tail -1
maID isPresent isDeleted ptID svID fileOffset
61001100 1 0 0 32 16307420
ECR
- Count scaffold gaps : total/closed
cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^examining gap' # 24635
cat 7-1-ECR/extendClearRanges-scaffold.*.err | grep -c '^closed gap' # 2803
Location
ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo/ # assembly
CA s_1..8 corr noOBT (CBCB) *
Issues
- Used outie s_1,2 lib ; needs to be redone
runCA spec
doOverlapTrimming = 0
ovlOverlapper = ovl
unitigger = utg # cmd unitigger
Overlap
cat asm.repeatmodel.lib.00*stats | egrep '^Lib|nSamples|median' | p 'chomp; if(/Lib/) {print "\n$_"} else { print " $F[1]"}' | pretty -o
Lib 0 5' 212707959 51
Lib 0 3' 212707959 51
Lib 1 5' 7387977 39
Lib 1 3' 7387977 39
Lib 2 5' 1413291 40 ???
Lib 2 3' 1413291 45 ???
Lib 3 5' 46756517 51
Lib 3 3' 46756517 51
Lib 4 5' 46786172 52
Lib 4 3' 46786172 51
Lib 5 5' 46531611 52
Lib 5 3' 46531611 51
Lib 6 5' 46891503 52
Lib 6 3' 46891503 51
Lib 7 5' 16940888 52
Lib 7 3' 16940888 51
Bog
elem min q1 q2 q3 max mean n50 sum
71125093 66 95 150 150 23390 130.41 150 9275136839
Location
walnut:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor.noOBT