Bumblebee
Jump to navigation
Jump to search
Data
- ~ 500B genome
- Complete mitochondrion genomes:
NC_011923.1 15468 14.67 Bombus hypocrita sapporoensis mitochondrion, complete genome NC_010967.1 16434 13.22 Bombus ignitus mitochondrion, complete genome only 88% identity; no rearrangements, only snps, short indels
Traces
- 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
Lane Insert ReadLen #Mates Coverage Comments 1 3K(2..6,avg 4K) 124 34,944,099 14X 865,687(1.2%) reads have qual==0 2 8K(7..9,avg 8K) 124 32,540,640 13X 3 500(450..600) 124 34,745,750 # gDNA 5 500 34,601,239 6 500 34,553,857 7 500 34,682,612 8 500 12,975,839
- Adaptors
>circularizarion CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA >circularizarion.revcomp TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG >5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG >3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Tasks to figure out
- Erroneous reads/bases, which we need to correct or discard
- GC bias, so we can compute a-stats properly
- Redundancy in the long paired ends, which are lane 1 and lane 2.
- Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument.
- Some reads will begin in the circularization adaptor and thus will have only one usable read
- Some reads have a few bases of DNA sequence and hit the circularization adaptor right away
- Most reads will have at least 36bp from each end before hitting the adaptor.
- Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor)
- A small but significant number of reads from the 3kb and 8kb libraries are not recircularized.
- Thus their mate distance is +400bp rather than -3kb or -8kb.
- It's apparently the result of a faulty batch of cre recombinase. This causes problems with contiging and scaffolding.
- It is possible to remove these reads by removing
- mate pairs where neither read is trimmed (thus no adapter ligation may have occurred)
- mate pairs where one read begins with the adapter sequence.
- We are working on this, up to now our paired assembly stages have been disappointing.
- Matt's idea is to exclude all mate pair reads that don't have evidence of the linker with flanking useful sequence, as a way to avoid uncircularized molecules that will give misleading "mate pairs" only 400 bp apart.
- There has been no trimming of the adaptor, which is the 42 base 454 adaptor, so its presence can be used to indicate potentially good mate pairs.
- Even tossing half the mate pairs might not be a problem, as we have perhaps too many anyway.
- But you will also need to toss redundant mate pairs, and that will indeed reduce the total a lot.
- Just to be clear - the 500 base mate pairs should have no such problems, except that as Matt has found from his preliminary assembly, the mean fragment length is actually 400 bp rather than 500 bp (and the 3 and 8 kb PE reads are typically shorter than nominally given, e.g. more like 2.5 and 6 kb).
Trimming
- Keep only the first 100bp (last 24 bp are anyway low qual) otherwise gatekeeper "Seg fault" ? Too much seq discarded
- Quality trimming
cat s_1_*_sequence.*.txt | ~/bin/fastq2clb.pl > s_1_sequence.clb
- Adaptor trimming: Align all subsets to adaptors
>C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA >3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT >5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
id len gc% C 42 30.95 3 52 55.77 5 67 59.70
nucmer -l 8 -c 16 -b 8 -g 8 adaptors.seq s_1_1_sequence.00.seq -p s_1_1_sequence.00 delta-filter -l 16 -q s_1_1_sequence.00.delta > s_1_1_sequence.00.filter-q.delta ... cat s_1_*_sequence.*.filter-q.delta | ~/bin/delta2clr53.pl -5 5,3 -minLen 64 > s_1_sequence.clv
Adaptor positions . elem min q1 q2 q3 max mean %elem C.5 25181056 0 2 34 68 108 38 36% C.3 25181056 16 43 75 108 124 75 5.5 374742 0 0 0 0 108 3 0.53% 5.3 374742 17 36 36 67 124 46 3.5 143332 0 0 0 11 108 10 0.20% 3.3 143332 16 18 19 28 124 30
- Stats
. elem <=64 >64 min q1 q2 q3 max mean n50 sum orig 69888198 0 69888198 124 124 124 124 124 124 124 8666136552 clq 69888198 7724022 62164176 0 89 111 124 124 96.76 117 6762346722 clv 69888198 18607136 51281062 0 0 124 124 124 86.96 124 6077231064 clr 69888198 24677952 45210246 0 0 88 115 124 67.31 113 4704368689
- Other frequent kmers
26mer : ACGTTATAACGTATTACGTTATATGG -> revcomp -> CCATATAACGTAATACGTTATAACGT : ~10% of the traces 10mer : AAAAAAAAAA TTTTTTTTTT : ~32% of the traces 53mer: CGATTTCCATGGCGTCGTTTGAGGATTCCAATACGGCGAACCTGTTGTGAGTG : ~2% of the mito seqs (either begin or end); not present in the 2 complete mito's (probably ok)
- Location:
/fs/szattic-asmg4/Bees/Bombus_impatiens
D Kelly's trimming
438088072 total reads 109166398 reads were thrown away 148886138 reads were corrected and/or trimmed (to a min length of 30 bp)
Assembly
- Trimming
No OBT adaptors in the seqs
- Kmers
meryl -Dh -s 0-mercounts/asm-C-ms22-cm1 >! 22mers.hist Found 3136399464 mers. Found 379123530 distinct mers. Found 201257394 unique mers. Largest mercount is 12006651; 90 mers are too big for histogram.
countKmers.pl most frequent 42mer : CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA ~ 20% of the seqs : circularization adapter
- Overlapper
#overlaps/read reads 0count min q1 q2 q3 max mean n50 sum 62164168 21589472 0 0 4 12 324 11.42 38 709902310
- Unitigger : max utg len=852bp
- Consensus after unitigger : 3 out of 129 jobs failed
- Location
/fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/Linux-amd64/bin/runCA