Bumblebee: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
|||
Line 2: | Line 2: | ||
* ~ 500B genome | * ~ 500B genome | ||
== Original == | |||
* 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used) | * 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used) | ||
Line 15: | Line 17: | ||
8 500 12,975,839 | 8 500 12,975,839 | ||
* Tasks to figure out | * Adaptors | ||
>circularizarion | |||
CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA | |||
>circularizarion.revcomp | |||
TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG | |||
>5 | |||
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG | |||
>3 | |||
CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT | |||
== Tasks to figure out == | |||
# Erroneous reads/bases, which we need to correct or discard | |||
# GC bias, so we can compute a-stats properly | |||
# Redundancy in the long paired ends, which are lane 1 and lane 2. | |||
* Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument. | * Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument. | ||
Line 26: | Line 39: | ||
** Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor) | ** Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor) | ||
== Trimming == | |||
* Keep only the first 100bp (last 24 bp are anyway low qual) otherwise gatekeeper "Seg fault" | |||
* Adaptor trimming: | |||
* | ** Split data set in 1M read subsets | ||
** Align all subsets to adaptors | |||
nucmer -l 8 -c 16 -b 8 -g 8 adaptors.seq subset.seq -p subset | |||
delta-filter -l 16 -q subset.delta > subset.filter-q.delta | |||
** Identify CLV | |||
~/bin/delta2clr53.pl -5 5,3 -minLen 64 -maxLen 100 < subset.filter-q.delta > subset.clv | |||
* Location: | * Location: |
Revision as of 12:11, 5 March 2010
Data
- ~ 500B genome
Original
- 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
Lane Insert ReadLen #Reads 1 3K(2..6,avg 4K) 124 34,944,099 2 8K(7..9,avg 8K) 124 32,540,640 3 500(450..600) 124 34,745,750 # gDNA 5 500 34,601,239 6 500 34,553,857 7 500 34,682,612 8 500 12,975,839
- Adaptors
>circularizarion CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA >circularizarion.revcomp TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG >5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG >3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Tasks to figure out
- Erroneous reads/bases, which we need to correct or discard
- GC bias, so we can compute a-stats properly
- Redundancy in the long paired ends, which are lane 1 and lane 2.
- Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument.
- Some reads will begin in the circularization adaptor and thus will have only one usable read
- Some reads have a few bases of DNA sequence and hit the circularization adaptor right away
- Most reads will have at least 36bp from each end before hitting the adaptor.
- Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor)
Trimming
- Keep only the first 100bp (last 24 bp are anyway low qual) otherwise gatekeeper "Seg fault"
- Adaptor trimming:
- Split data set in 1M read subsets
- Align all subsets to adaptors
nucmer -l 8 -c 16 -b 8 -g 8 adaptors.seq subset.seq -p subset delta-filter -l 16 -q subset.delta > subset.filter-q.delta
- Identify CLV
~/bin/delta2clr53.pl -5 5,3 -minLen 64 -maxLen 100 < subset.filter-q.delta > subset.clv
- Location:
/fs/szattic-asmg4/Bees/Bombus_impatiens
Assembly
- Trimming
No OBT adaptors in the seqs
- Kmers
meryl -Dh -s 0-mercounts/asm-C-ms22-cm1 >! 22mers.hist Found 3136399464 mers. Found 379123530 distinct mers. Found 201257394 unique mers. Largest mercount is 12006651; 90 mers are too big for histogram.
countKmers.pl most frequent 42mer : CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA ~ 20% of the seqs : circularization adapter
- Unitigger : max utg len=852bp
- Consensus after unitigger : 3 out of 129 jobs failed
- Location
/fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/Linux-amd64/bin/runCA