Bumblebee: Difference between revisions
Jump to navigation
Jump to search
(→Data) |
(→Data) |
||
Line 1: | Line 1: | ||
= Data = | = Data = | ||
* | * ~ 500B genome | ||
* 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used) | |||
Lane Insert ReadLen #Reads | |||
1 3Kbp(2..6,avg 4) 124 34,944,099 | |||
2 8Kbp(7..9,avg 8) 124 32,540,640 | |||
3 gDNA ~500 34,745,750 | |||
5 gDNA 34,601,239 | |||
6 gDNA 34,553,857 | |||
7 gDNA 34,682,612 | |||
8 gDNA 12,975,839 | |||
* Tasks to figure out: | * Tasks to figure out: | ||
Line 11: | Line 20: | ||
3. Redundancy in the long paired ends, which are lane 1 and lane 2. | 3. Redundancy in the long paired ends, which are lane 1 and lane 2. | ||
* | * Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument. | ||
** Some reads will begin in the circularization adaptor and thus will have only one usable read | |||
** Some reads have a few bases of DNA sequence and hit the circularization adaptor right away | |||
** Most reads will have at least 36bp from each end before hitting the adaptor. | |||
** Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor) | |||
* | |||
* circularization adaptors | * circularization adaptors | ||
TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG - revcomp -> CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA | TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG - revcomp -> CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA | ||
AGCATATTGAAGCATATTACATACGATATGCTTCAATAATGC | AGCATATTGAAGCATATTACATACGATATGCTTCAATAATGC | ||
* Formatting: keep only the first 100bp (last 24 bp are anyway low qual) | |||
* Location: | |||
/fs/szattic-asmg4/Bees/Bombus_impatiens | |||
= Assembly = | = Assembly = |
Revision as of 15:20, 4 March 2010
Data
- ~ 500B genome
- 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
Lane Insert ReadLen #Reads 1 3Kbp(2..6,avg 4) 124 34,944,099 2 8Kbp(7..9,avg 8) 124 32,540,640 3 gDNA ~500 34,745,750 5 gDNA 34,601,239 6 gDNA 34,553,857 7 gDNA 34,682,612 8 gDNA 12,975,839
- Tasks to figure out:
1. Erroneous reads/bases, which we need to correct or discard 2. GC bias, so we can compute a-stats properly 3. Redundancy in the long paired ends, which are lane 1 and lane 2.
- Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument.
- Some reads will begin in the circularization adaptor and thus will have only one usable read
- Some reads have a few bases of DNA sequence and hit the circularization adaptor right away
- Most reads will have at least 36bp from each end before hitting the adaptor.
- Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor)
- circularization adaptors
TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG - revcomp -> CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA AGCATATTGAAGCATATTACATACGATATGCTTCAATAATGC
- Formatting: keep only the first 100bp (last 24 bp are anyway low qual)
- Location:
/fs/szattic-asmg4/Bees/Bombus_impatiens
Assembly
- Meryl
meryl -Dh -s 0-mercounts/asm-C-ms22-cm1 >! 22mers.hist Found 3136399464 mers. Found 379123530 distinct mers. Found 201257394 unique mers. Largest mercount is 12006651; 90 mers are too big for histogram.
- countKmers
most frequent 22mer : AGCATACATTATACGAAGTTAT ~ 16% of the seqs most frequent 42mer : CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA ~ 10% of the seqs (pPAC7.9124-9165)
- Location
/fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/Linux-amd64/bin/runCA