Bumblebee
Jump to navigation
Jump to search
Data
- ~ 500B genome
- Complete mitochondrion genomes:
NC_011923.1 15468 14.67 Bombus hypocrita sapporoensis mitochondrion, complete genome NC_010967.1 16434 13.22 Bombus ignitus mitochondrion, complete genome only 88% identity; no rearrangements, only snps, short indels
Traces
- 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
Lane Insert ReadLen #Mates #Reads Coverage Comments 1 3K(2..6,avg 4K) 124 34,944,099 14X 865,687(1.2%) reads have qual==0 2 8K(7..9,avg 8K) 124 32,540,640 13X 3 400 124 34,745,750 . gDNA ; originally thought to be 500bp insert instead of 400 5 400 124 34,601,239 . 6 400 124 34,553,857 . 7 400 124 34,682,612 . 8 400 124 12,975,839 . 3-8 151,559,297 303,118,594 75X 1-8 219,044,036 438,088,072 108X
Two new 3kb libraries to come; ~ 70% are true pair ends
- Adaptors
>circularizarion CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA >circularizarion.revcomp TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG >5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG >3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
- Location
/fs/szattic-asmg4/Bees/Bombus_impatiens/s_[1235678]_[12]_sequence.txt : s_1-8 libs /fs/szattic-asmg4/Bees/Bombus_impatiens/3kb_Bimpatiens.txt : 88 reads
Tasks to figure out
- Erroneous reads/bases, which we need to correct or discard
- GC bias, so we can compute a-stats properly
- Redundancy in the long paired ends, which are lane 1 and lane 2.
- Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument.
- Some reads will begin in the circularization adaptor and thus will have only one usable read
- Some reads have a few bases of DNA sequence and hit the circularization adaptor right away
- Most reads will have at least 36bp from each end before hitting the adaptor.
- Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor)
- A small but significant number of reads from the 3kb and 8kb libraries are not recircularized.
- Thus their mate distance is +400bp rather than -3kb or -8kb.
- It's apparently the result of a faulty batch of cre recombinase. This causes problems with contiging and scaffolding.
- It is possible to remove these reads by removing
- mate pairs where neither read is trimmed (thus no adapter ligation may have occurred)
- mate pairs where one read begins with the adapter sequence.
- We are working on this, up to now our paired assembly stages have been disappointing.
- Matt's idea is to exclude all mate pair reads that don't have evidence of the linker with flanking useful sequence, as a way to avoid uncircularized molecules that will give misleading "mate pairs" only 400 bp apart.
- There has been no trimming of the adaptor, which is the 42 base 454 adaptor, so its presence can be used to indicate potentially good mate pairs.
- Even tossing half the mate pairs might not be a problem, as we have perhaps too many anyway.
- But you will also need to toss redundant mate pairs, and that will indeed reduce the total a lot.
- Just to be clear - the 500 base mate pairs should have no such problems, except that as Matt has found from his preliminary assembly, the mean fragment length is actually 400 bp rather than 500 bp (and the 3 and 8 kb PE reads are typically shorter than nominally given, e.g. more like 2.5 and 6 kb).
- You'll also need to throw out all the reads where one of the mates /starts/ with linker, I assume you'd do that anyway. We're also working on ways round this; one might be when we get a better assembly we'll find some better characteristic to filter the unrecircularized reads on.
Trimming
- Quality trimming: trim all bases with fastq quality eq "B" (0)
cat *fastq | ~/bin/fastq2clq.pl
- Adaptor trimming: Align all subsets to adaptors
C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA 3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT 5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
id len gc% C 42 30.95 3 52 55.77 5 67 59.70
nucmer -l 8 -c 16 -b 32 -g 32 adaptors.seq ... delta-filter -l 16 -q ... cat *.filter-q.delta | ~/bin/delta2clr53.pl -5 5,3 ...
- Adaptor/primers median positions in the reads
Lib adaptorPos 5'primerPos 3'primerPos s_1 34-75 0-36 0-19 s_2 2-40 0-36 0-19
- Mate stats
Lib #mates adaptor.bothMates adaptor.noMate adaptor.oneMate adaptor.oneMate(filtered) s_1 34944099 269156 9804247 24870696 6048164(17.3%) s_2 32540640 91528 16181288 16267824 1061858( 3.2%) total 67484739 7110022(10.5%)
- Filtered read clr stats
Lib #mates #reads min q1 q2 q3 max mean n50 sum s_1 6048164 12096328 64 80 95 115 124 96.56 101 1,168,064,632 s_2 1061858 2123716 64 80 96 115 124 96.59 101 205,122,226 total 7110022 14220044 64 80 95 115 124 96.57 101 1,373,186,858
- Filtered read GC% stats
Lib #reads min q1 q2 q3 max mean n50 s_1 12096322 0.00 31.93 36.26 42.35 100.00 37.45 38 s_2 2123715 0.00 31.88 36.46 42.48 100.00 37.56 38
- Other frequent kmers
26mer : ACGTTATAACGTATTACGTTATATGG -> revcomp -> CCATATAACGTAATACGTTATAACGT : ~10% of the traces 10mer : AAAAAAAAAA TTTTTTTTTT : ~32% of the traces 53mer: CGATTTCCATGGCGTCGTTTGAGGATTCCAATACGGCGAACCTGTTGTGAGTG : ~2% of the mito seqs (either begin or end); not present in the 2 complete mito's (probably ok)
- Location:
/fs/szattic-asmg4/Bees/Bombus_impatiens/
ginkgo:/scratch1/Bombus_impatiens/Data/error_free/frg/*frg #frg files /fs/szattic-asmg5/Bees/Bombus_impatiens/error_free/frg/*frg #frg files
D Kelly's trimming
- Reads
. elem min q1 q2 q3 max mean n50 sum s_1_1_sequence.cor.txt 5436814 30 71 90 115 124 89.70 100 487673256 s_1_2_sequence.cor.txt 5864225 30 77 92 110 124 93.27 98 546956864 s_2_1_sequence.cor.txt 1053858 30 80 96 118 124 97.16 102 102395785 s_2_2_sequence.cor.txt 1041846 30 78 93 110 124 93.51 99 97426679 s_3_1_sequence.cor.txt 33668302 30 105 124 124 124 111.04 124 3738530761 s_3_2_sequence.cor.txt 32554322 30 82 108 124 124 100.42 120 3269039697 s_5_1_sequence.cor.txt 33579744 30 109 124 124 124 111.65 124 3749192427 s_5_2_sequence.cor.txt 32465412 30 84 112 124 124 102.13 124 3315553171 s_6_1_sequence.cor.txt 33535725 30 109 124 124 124 111.60 124 3742602285 s_6_2_sequence.cor.txt 32390877 30 83 109 124 124 100.92 123 3268875471 s_7_1_sequence.cor.txt 33674235 30 112 124 124 124 112.19 124 3777917379 s_7_2_sequence.cor.txt 32518568 30 84 110 124 124 101.60 124 3303807321 s_8_1_sequence.cor.txt 11777821 30 113 124 124 124 112.30 124 1322658923 s_8_2_sequence.cor.txt 12176364 30 84 110 124 124 101.51 124 1236034348 total 301,738,113 . . . . . . . 31958664367 total (64bp+) 275,763,594 . . . . . . . 30693543885
- Mates
s_1.mates 5320163 s_2.mates 1034086 s_3.mates 31857304 s_5.mates 31820553 s_6.mates 31748713 s_7.mates 31891468 s_8.mates 11168258 total 144,840,545
- Files
/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/frg/s_?.frg # frg /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/s_?_[12]_sequence.cor.txt # fastq
Assembly
CA.bog s_1
- Input : 31*2M s_1 reads trimmed to 100bp & formatted by fastqToCA
- Spec file
unitigger = bog ovlOverlapper = mer obtMerThreshold = 400 ovlMerThreshold = 120 doOverlapTrimming = 0
- Unitigger : max utg len=852bp
- Consensus after unitigger : 3 out of 129 jobs failed
- Location
/scratch1/Bombus_impatiens/Assembly/31M
CA.utg s_1,2 filtered
- Input: 7*2M s_1 & s_2 reads formatted by convert-fasta-to-v2.pl
- Filter:
- only one read from the mate contains the adaptor in the 64..124bp range
- both reads from the mate have good quality in the 0..64bp range
- Spec file
unitigger = utg ovlOverlapper = ovl obtMerThreshold = 400 ovlMerThreshold = 200 doOverlapTrimming = 0 => reads < 64 bp atre trimmed by the gatekeeper; no need to trim
- ovlStore stats
#ovls/read reads min q1 q2 q3 max mean n50 sum 14220044 0 1 5 31 514 31.30 131 445090932
- final stats
. elem min q1 q2 q3 max mean n50 sum ctg 2 1004 1004 1055 1055 1055 1029.50 1055 2059 deg 1433143 64 115 129 164 1028 145.79 145 208938949 assembled 7557472 singletons 6662572
- Mate redundancy
cat 9-terminator/asm.posmap.frgdeg | sort -nk2 -nk3 | perl ~/bin/posmap2ovl.pl | p 'chop $F[1]; chop $F[2]; print "$F[1] $F[2]\n"; print "$F[2] $F[1]\n";' | count.pl -m 2 | more
- Location
/scratch1/Bombus_impatiens/Assembly/14M.redo
SOAPdenovo s_3..8 orig (CBCB)
- Kmer=23
- Input: s_3,5,6,7,8 original reads
- Error correction: no
- Summary
. elem min q1 q2 q3 max mean n50 sum scf 571,114 100 106 115 141 156066 162.19 137 92,628,126 ctg 54049561 24 24 28 42 2049 35.04 35 1,893,660,606 gapSeq 49518 28 123 143 181 424 152.97 158 7,574,733
SOAPdenovo s_3..8 corr (CBCB)
- Kmer=31
- Input: s_3,5,6,7,8 reads
- Error correction: yes
- Summary
elem min q1 q2 q3 max mean n50 sum scf 106,217 100 140 434 2418 102317 2,177 7099 23,1261,742
SOAPdenovo s_3..8 corr (Illinois)
- Kmer=31
- Input: s_3,5,6,7,8 reads
- Error correction: yes; UIUC method
- Summary
. elem min q1 q2 q3 max mean n50 sum scf 17,550 13,460 236,225,169 ctg 4004102 31 97,618 1916 6020 230,333,709 ctg100bp+ 120167 100
SOAPdenovo s_1..8 corr (Illinois)
- Kmer=31
- Input: s_1,2,3,5,6,7,8 reads
- Error correction: yes; CBCB method
- Summary
. elem min q1 q2 q3 max mean n50 sum ctg200bp+ 43603 201 752 2786 9518 167534 7637.63 19895 333,023,446
- Locations
ftp://stan.cropsci.uiuc.edu/download/bombus1-k64-contigs.fa.min200 ginkgo:/scratch1/Bombus_impatiens/Assembly/SOAPdenovo.s_1-8.cor.UIUC/
CA s_1..8 corr OBT (CBCB)
- CA spec
doOverlapTrimming = 1 ovlOverlapper = ovl unitigger = bog *BatchSize = 2000000 *BlockSize = 2000000 ovlMemory = 8GB
- Meryl
meryl -Dh -s 0-mercounts/asm-C-ms22-cm0 Found 23686053415 mers. Found 265670791 distinct mers. Found 8682767 unique mers. ... #freq count % 1 8682767 0.0327 0.0004 2 4288602 0.0488 0.0007 72* 3803015 0.7150 0.3229 73 3797615 0.7293 0.3346
- OBT/Overlaps jobs
n=((301738113/2000000)+0.5)=151 OBT/OVL jobs = n*n/2 =151*76=11476 ~ 20 min/OBT job =~ 0.33 hr/OBT job => 0.33*11476/32=5 days
- OBT trimming
key sum count mean CLR 31958664367 301738113 105.915239043733 OBTINITIAL 30779687358 301610679 102.051052900551 OBTMERGE 30052401719 301610679 99.6397137483318 LATEST 30052138325 301610679 99.638840456972
- Failures:
- overlapStore: too many files (max 10240) ; fix: recompile the code
- chimera: FAILED file seems to be truncated but contains records from all libs; fix: link asm.chimera.report.FAILED asm.chimera.report
- Location:
ginkgo:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor
CA s_1..8 corr noOBT (CBCB)
- CA spec
doOverlapTrimming = 0 ovlOverlapper = ovl unitigger = utg
- Location:
walnut:/scratch1/Bombus_impatiens/Assembly/CA.s_1-8.cor.noOBT